Big Industries, Cloudera systems integration and reseller partner for Belgium and Luxembourg, has developed an integration of Apache Mesos and CDH that can be deployed and managed through Cloudera Manager.
Why would you want to run things like web servers, proxies, and caches on a Hadoop cluster, though? For example, building a security information and event management (SIEM) solution on top of Hadoop means including a live-traffic inspection layer as well as an active archive of security-related event logs and reporting tools.
Mesos comes with a framework called Marathon to launch tasks on the cluster, and a scheduler framework called Chronos, which offers a highly available, fault-tolerant alternative to Unix cron.
In order to launch an application, Mesos Marathon uses Docker, an application virtualization system that enables portable, standardized and containerized deployment of applications and components across the cluster. There are other ways to launch applications on Mesos, but Docker offers a robust solution with extensive features. The next step is to download, distribute, and activate the Mesos and Docker parcels via Cloudera Manager.
We can now set up and configure a new Mesos service on our cluster from Cloudera Manager, in the same way we would set up any other Hadoop service.
You can choose which nodes of the cluster to use as Mesos slaves, Mesos masters, and where to deploy the Marathon service. The best way to do this is to provide a Docker Registry, which is comparable to a Git-repository for Docker images. Note that when using an insecure private registry, like the one from the JSON file, it is important to add the –insecure-registry argument to the start command. Make sure the IP address and the port number of the registry are set correctly and the registry is added as insecure registry on the Docker daemon.
In this article we have explained some of the features and benefits of Apache Mesos, seen how to deploy Mesos and Docker under CDH using Cloudera Manager and custom parcels, and had a look at launching an application component (Memcached) across the cluster using Mesos Marathon. The source code for the Cloudera Manager Mesos and Docker extensions is available on github and its Apache v2 licensed. Rob Gibbon is architect, manager, and partner at Big Industries, the industry-leading Hadoop SI partner for Belgium and Luxembourg.
As part of our series of announcements at the recent Hadoop Summit, Cloudera released two of its previously internal projects into open source. We’ve seen our customers have great success using Hadoop for processing their data, but the question of how to get the data there to process in the first place was often significantly more challenging. At the same time, we observed that much more data are being produced than most organisations have the software infrastructure to collect.
Scalability – Flume is designed to capture a lot of data from a wide variety of sources.

All of this configuration is controlled by the Flume master, which is a separate distributed service which takes care of monitoring and updating all the logical nodes in a Flume deployment. The collector receives the annotated events from the processor, and writes them to a path in HDFS that is determined by the browser string that was extracted earlier. This simple example shows how Flume is easily able to capture your data, and to perform some non-trivial processing on it.
The Flume User Guide is a one-stop resource for a lot more detail on Flume’s architecture, usage and internals. There have recently been some good conversations on the Flume user list comparing Flume to a message queue, to Chukwa and to Scribe. In this post, Big Industries’ Rob Gibbon explains the benefits of deploying Mesos on your cluster and walks you through the process of setting it up. Well, when assembling a technical solution, especially an off-the-shelf solution, it is common that the buyer expects the vendor to provide a complete, ready to go platform, with a single bill-of-materials. While Hadoop perfectly fits the needs for the active archiving element, without Mesos integration to run a live-traffic inspection system and a reporting server, it would be quite difficult to deliver on the complete system requirements in a consistent way from a single platform.
The engineer writes a dockerfile, which is a text file containing a set of automation instructions for deploying and configuring the application.
With this approach, deploying Mesos and Docker is a similar experience to deploying other Hadoop components like YARN, Impala, or Hive.
One of those was the HUE user interface environment, which we’ll be saying a bit more about later this week. Many customers had produced ad-hoc solutions with complicated shell scripts and periodically running batch copies. We are very keen to allow our users to take advantage of all the data that their cluster is generating. Flume is a distributed service that makes it very easy to collect and aggregate your data into a persistent store such as HDFS. We also recognise that different kinds of data have different levels of importance, and therefore designed Flume to have fine-grained tunable reliability guarantees that dictate how much effort Flume goes to to ensure that your data are delivered when those failures happen. As such we want to be able to run Flume on hundreds of machines with no scalability bottleneck. Data passes through a simple network of logical nodes, which are lightweight entities that know how to do exactly one thing: read data from some source and send it on to a sink.
At the beginning of a flow is the original source of the data, and the sink of the final logical node in the chain defines where the data will eventually be delivered. This makes it very easy to configure, create, delete and restart individual logical nodes without having to restart the Flume process.

Flume’s HDFS sink uses a simple template language to figure out where to write events. This entire flow can be constructed and configured from the Flume Master, without any need to log on to the individual machines involved.
Mesos is designed to scale, like YARN is, and Mesos services can be deployed on clusters of up to 10,000s of nodes.
Solutions often make use of operational, front-end serving components (reverse proxies, load balancers, web servers, application servers) and middle-tier components (object caches, JMS, workflow engine etc.) in addition to backend components and while Hadoop is great at solving backend data processing challenges, until Mesos it has been pretty difficult to deploy and operate front-end and middle tier components in a consistent manner as part of a complete, Hadoop-powered solution. In order to ensure solid resource isolation, you can use Cloudera Manager’s Linux Control Groups integration to allocate appropriate system resource shares to the Mesos framework; this way Mesos and other Hadoop components like YARN and Impala can coexist.
Such solutions, while minimally effective, don’t allow the user any insight into how they were running, whether or not they were succeeding and whether or not any data were being lost. Looking around, we saw no solutions that supported all the features that we wanted to provide to our customers, incuding reliable delivery of data and an easy configuration system that didn’t involve logging in to a hundred machines to restart a process, as well a powerful extensibility solution for easy integration with a wide variety of data sources. Flume is horizontally scalable, which broadly means that performance is roughly proportional to the number of machines on which it is deployed.
We want Flume to enable flexible data collection, and that means making it easy to make changes. That way, anyone can prototype and build new connectors for Flume to almost any data source very quickly indeed.
A source might be a file, process output or any other of the many sources that Flume supports – or it might be another Flume logical node.
We call the first logical node in a flow the agent, and the last logical node the collector. All this can be configured dynamically from a single, central location – no more tedious configuration file editing and process restarting. Similarly, a sink might be another Flume logical node, or it might be HDFS, S3 or even IRC or an e-mail. The intermediate nodes are called processors, and their job is to do some light transformation and filtering on the events as they come through in order to get them ready for storage. In this blog post I’d like to introduce Flume to the world, and say a little about how it might help problems with data collection that you might be facing right now. Flume will collect the data from wherever existing applications are storing it, and whisk it away for further analysis and processing.

Free online backup providers comparison
Can you post amazon affiliate links on pinterest


  1. 08.05.2014 at 17:13:10

    Get rid of the Panda Cloud needs to be retrieved, the user presents the ID to the center.

    Author: narkusa
  2. 08.05.2014 at 21:52:59

    For its software and services, in 2001 the and any.

    Author: Zaur_Zirve
  3. 08.05.2014 at 11:13:40

    Dropbox , Google Drive , and SkyDrive , and all of them automatically sync fraction of the time of the enterprise.

    Author: alishka