Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge. For most of its (short) existence, the open source Big Data framework called Hadoop has been all about innovative ways to process, store, and eventually analyze huge volumes of multi-structured data. Indeed, the entire concept of Hadoop – processing petabytes of unstructured data in parallel across potentially thousands of commodity boxes using an open source file-system and related tools – flies in the face of the traditional database model – relational data only, scale-up not out, proprietary hardware and software.
But this paradigm started to change with the emergence of the first commercial Hadoop distribution vendor, Cloudera, back in 2009, and it’s not hard to understand why. This drive to turn Hadoop into an enterprise-grade platform accelerated about this time last year. With five-plus commercial Hadoop distribution vendors vying for top spot in a potentially $50 billion market, the race to build the first truly enterprise-ready Hadoop platform was officially on. The Hadoop distribution race is important because no technology can achieve mass adoption in the enterprise unless risk-averse IT departments believe it will stand-up under pressure, and Hadoop is no exception. For all its promise, Hadoop has a number of inherent weaknesses that make it a less than ideal platform for supporting mission-critical applications and workloads. As originally developed, a single node within a Hadoop cluster is responsible for storing and managing metadata.
Lacking the internal expertise to administer and monitor Hadoop clusters or to take advantage of Hadoop for performing Big Data analytics, the wider enterprise market will not adopt Hadoop. No sane mainstream enterprise in the 21st Century will deploy a new technology, particularly a data management technology, absent a robust security layer.
These four issues do not constitute a comprehensive list of Hadoop-related enterprise readiness concerns, but they are the issues most commonly raised when companies begin evaluating Hadoop for enterprise deployments.
On the occasion of Hadoop Summit 2012, now is a good time to review the progress each Hadoop distribution vendor has made in regard to enterprise readiness.
One of the complicating factors in such an exercise is taking into account the open-source nature of Hadoop. As the first commercial Hadoop vendor on the scene, Cloudera boasts what is generally considered the most mature Hadoop distribution on the market.
High Availability:CDH4 includes true high availability (HA) in the form of a secondary namenode that now lives up to its name. Table and Column-Level Access Controls: The most recent version of CDH also introduced highly granular table and column-level access controls for HBase, a popular NoSQL database used in conjunction with Hadoop. More APIs:New sets of APIs allow administrators to integrate Cloudera Manager with existing IT monitoring and management systems. Cloudera has been criticized by some for not making Cloudera Manager open source and contributing its code back to the Apache community. DataStax is best known as the commercial Cassandra company, but the vendor also plays an important role in the Hadoop ecosystem. Advanced Workload Management: The latest version of the DataStax platform also incorporates Apache Solr, the open source enterprise search platform. Workload Isolation: The DataStax platform also boasts workload isolation capabilities to ensure that real-time data workloads do not compete with analytic workloads for compute resources such as memory, CPU, or disk space.
What DataStax does not offer are visual interfaces, easy-to-use tools or other aids to make it easier for admins and Data Scientists to write MapReduce jobs or Hive and Pig routines outside of currently available open source methods. In addition to reselling MapR’s M5 Hadoop distribution, EMC Greenplum now offers its own Apache-compatible Hadoop distribution known as Greenplum HD.
1,000 Node Hadoop Workbench: EMC Greenplum deployed a 1,000 node Hadoop cluster storing 24 petabytes of data for Hadoop practitioners to experiment with. Reference Configuration via Cisco UCS: Greenplum also offers a Hadoop appliance, bundling together the MapR M5 Hadoop distribution with the Cisco Unified Computing System reference architecture. EMC Greenplum’s approach to Hadoop is to surround it with a slew of complimentary tools and technologies – some open source but most closed source – to deliver a robust enterprise-grade platform. Released on June 12, 2012, the Hortonworks Data Platform is a 100% open-source enterprise version of the Apache Hadoop distribution.
HCatalog: HCatalog is an Apache-based table and storage management service developed largely by Hortonworks that is designed to enable administrators to delineate the location and structure of data within Hadoop. Talend Open Studio: Hortonworks formed a partnership with Talend, an open-source data integration and master data management vendor, to embed its functionality into HDP.
Being 100% open source, HDP eliminates the risk of vendor lock-in posed, to greater and lesser extents, by competing Hadoop platforms.
MapR takes what is probably the most controversial approach to Hadoop of any of these five vendors.
Direct Access NFS: MapR’s enterprise Hadoop distribution, M5, uses Direct Access NFS, allowing administrators to allow Web- and file-based applications to write data directly to Hadoop. Security Enhanced Linux: MapR’s M5 supports Security Enhanced Linux, a set of access control security policies, as well as all active directory methods. Action Item: Whether a technology is enterprise ready or not is something of a subjective proposition.
Footnotes: Special thanks to Cloudera, DataStax, EMC Greenplum, Hortonworks, MapR and the Apache Hadoop community for their help in compiling this report.
If we’re going to support the many data scientists and their application developer counterparts in an even more experimental, data-driven enterprise, we better pay attention to the roots of the Hadoop framework. Providing the enterprise a way to build their own internal sandbox applications using Hadoop building blocks will require the talent of a Hadoop-savvy team. Ideally, the existing staff needs to be educated and provided with simple application development environments which support the use of unstructured data technologies. With an internal deployment of a Big Data PaaS, the organization is provided with a full, turnkey stack which is not too dissimilar from Hortonworks, Cloudera, etc, but with maybe one compelling difference. The compelling difference with a PaaS is that it includes a comprehensive suite of services for the app-dev teams and a robust API that can be expanded as services are developed and added to the platform. Is the market ready for a turnkey Big Data platform offering? Is it too early for a Big Data PaaS? Production applications are born out of the sandbox cloud environments, and companies are moving to support the development team to facilitate productization. Charles, good explanation of the Apache process but I think the average customer is looking for something a bit shorter. Last week, I blogged about the Big Data announcements we made in conjunction with the Strata-Hadoopworld conference. Cloudera Impala was developed in response to one of the biggest complaints of using Hadoop (and Hive) for analytics: latency.
Customers interested in trying out a preview version of the Impala connector should contact their Tableau account manager for details about how to participate in the early access program. PS - many thanks to Franz Funk for creating the dashboard above that we showed at the Strata-Hadoopworld conference!
Learn the new features and enhancements in Cloudera Manager 5, including support for YARN, management of third-party apps and frameworks, and more.
Cloudera Manager 5 is a key part of this release, and in this post, I will provide a brief overview of some key features in Beta 1 as well as introduce some of those planned for Beta 2 (to be released in early 2014).


A major theme of the beta release is the notion of supporting multiple workloads on the same data substrate.
Customers are also asking for an easier way to manage non-CDH services and ISV applications that are deployed on top of, and along with, the CDH stack. The Service Extensibility mechanism in Cloudera Manager 5 provides different avenues for non-CDH services and ISV applications to be managed via Cloudera Manager. The plan is to have good set of examples, documentation, and sample code available as part of Beta 2 (or by GA) for customers to try this on their own for any new service they would like to deploy.
The team is now busy working on a Beta 2 release, which is currently scheduled to include support for Apache Oozie HA, YARN Resource Management HA workflows, HDFS caching, user-defined triggers, and more.
From the time of its inception by Doug Cutting at Yahoo until 2011 or so, the majority of enhancements to the platform have been mostly focused on new and better ways to accomplish this core function.
In order to successfully sell Hadoop to the enterprise, the platform must meet certain levels of, for lack of a better term, enterprise readiness. It was then that MapR, though already a two-year-old company, joined Cloudera with its own commercial Hadoop distribution and a high-profile reseller agreement with EMC Greenplum. Consider the apocryphal saying, “Nobody ever got fired for buying IBM,” but replace IBM with Hadoop. This node, called the namenode, is essentially responsible for understanding which nodes store particular data and for providing clients this information when a MapReduce job is initiated. The idea is to bring together and mine all relevant data sources, regardless of data structure or domain, for actionable insights. Most database administrators and other data practitioners speak SQL, or standard query language, used by most relational database management systems.
While education and training is one answer to the problem, Hadoop vendors must also make their products simpler to administer and use this to further break down adoption barriers. The consequences of running afoul of privacy and compliance regulations, succumbing to data breaches from hackers, or even just appearing to take a lackadaisical approach to data security are simply too high. As mentioned, however, commercial Hadoop vendors and the open source Apache community have been working diligently to address each of these issues for the last year or more. Below are overviews of enhancements aimed at improving Hadoop in this regard made by, in alphabetical order, Cloudera, DataStax, EMC Greenplum, Hortonworks, and MapR, over the last two years. The vendors reviewed below each take different approaches to the Apache Hadoop project, with some contributing all of their code to the community and others keeping certain features and functions proprietary. Cloudera’s distribution including Apache Hadoop (CDH) is currently in its fourth iteration, which was released in early June 2012. It is now available as an automatic, hot fail-over should the primary namenode in a cluster go down. This allows for administrators to view Hadoop in the context of the larger enterprise IT infrastructure.
But the company has decided on a go-to-market strategy that leverages an open source core with proprietary management software.
DataStax swaps out the Hadoop Distributed File System (HDFS) with Cassandra, a NoSQL, column-oriented database designed to support near-real time applications and workloads.
With three distinct projects – Hadoop, Cassandra and Solr – integrate into one Big Data platform, advanced workload management capabilities are critical. Also, few DataStax customers deploy the platform for strictly Hadoop jobs, but rather look to the vendor when they are in need of mixed workload capabilities. The distribution itself does not significantly distinguish Greenplum HD from its competitors, however. When integrated with Greenplum HD, administrators can take advantage of Isilon’s existing enterprise-grade security features with Hadoop. The Hadoop workbench, as EMC calls it, provides enterprises with a dedicated and secure area to test Hadoop deployments and applications before rolling them out in production. This includes tight integration with the Greenplum analytic database and Chorus in the form of the Greenplum Unified Analytic Appliance. Spun out of Yahoo last year, Hortonworks is trying to replicate the Red Hat playbook: offering free, open source software monetized with for-pay technical support services. The goal is to make it simple to view and access all Hadoop-based data regardless of which program or application is being used. It is a Hadoop monitoring and management console that includes data visualization tools for tracking Hadoop cluster health. It allows administrators and others to integrate data from various sources inside Hadoop via an intuitive GUI, eliminating the need to write complex code. That said, Hortonworks’ business model – relying on technical support revenue alone – has yet to prove itself effective in context of Big Data. The company replaces standard HDFS with its own proprietary storage services layer that enables random read-writes and allows users to mount the cluster on NFS. This enables users to easily get data into and out of M5 without the need for custom data connectors. It also employs snapshot and data mirroring capabilities in addition to a distributed namenode architecture to ensure five 9’s of continuous uptime. This prevents smaller jobs, namely ad hoc queries, from getting stuck behind larger, lumbering MapReduce jobs that could take minutes, hours, or longer.
M5 has been dubbed a Hadoop “fork” by some in the community because its storage services code is closed. There was one additional highlight that didn't make my blog post because the technology had not been announced yet at the conference. In their internal tests, Cloudera has reported that Impala is anywhere from 3x-90x faster than Hive depending on the type of query and workload. For more information about Impala, I encourage you to read Cloudera's blog post that describes the technology in more detail. 2013 release of Cloudera Enterprise 5 Beta has been overwhelming, and Cloudera is busily working closely with several customers to incorporate their feedback. Effective resource management becomes an important criterion to make this vision a reality. A good example is Cloudera’s recent collaboration with Syncsort to facilitate the deployment of its DMX-h libraries via Parcels. In the interim, we continue to work with select partners like SAS, 0xData, Syncsort, and others to fine-tune the implementation.
More specifically, data visualization has been beefed up, including the ability to chart the time-series metrics as bar graphs, scatter plots, heat maps, and so on. While Hadoop pioneers, Web giants like Yahoo, Google, Facebook, and LinkedIn, can devote armies of administrators and engineers to keep Hadoop clusters up-and-running efficiently, most enterprises cannot. The five major weaknesses are Single Points of Failure, Integration with Existing IT Systems, Administration and Ease-of-Use, and Security.
Three copies of data are typical stored on three separate nodes in a cluster, and the namenode maintains a directory tree with this information.
Therefore, the ability of any Big Data platform to integrate with legacy and new IT systems – transactional databases, analytic databases, existing applications, file stores and other NoSQL databases – is critical.


Hadoop in its original incarnation, unfortunately, lacked fine-grained user access controls. These are not comprehensive accounts of each and every feature in each Hadoop distribution, but rather serve to highlight functions and features that address the four concerns listed above.
Code contributed to the open source community by one vendor may be used by another, meaning there is significant overlap among all five vendor distributions.
To aid enterprise adoption, Cloudera has also developed Cloudera Manager, proprietary software for deploying, managing, and securing CDH.
Cloudera believes that is the best way to build its business, and there is little chance that it will open source Cloudera Manager in the future.
Because Hadoop is at heart a batch-oriented system, this gives the framework important new functionality.
OpsCenter allows administrators to replicate data between the three as well manage node assignment. Rather, Greenplum offers customers an enterprise-grade Hadoop “wrapper,” leveraging EMC’s enterprise storage technology and new collaborative analytic workspace called Chorus. Still, Hortonworks decided the vanilla Apache Hadoop distribution needed improvement, hence its development of HDP. HCat also supports standard table formats to enable integration with relational databases and data warehouses. It also includes a REST interface for defining and manipulating Hadoop clusters, as well as the ability to upgrade existing Hadoop clusters to newer versions of the software without losing existing data.
MapR has decided to keep this source code of the core of its Hadoop distribution to itself, but it is 100% API compatible with Apache Hadoop. However, and again as mentioned before, M5 is API compatible with Apache Hadoop, meaning users can fairly easily move from MapR back to Apache Hadoop if desired.
While Hadoop has come a very long way in a relatively short period of time from an enterprise readiness perspective, clearly it is still an emerging technology. However, we could argue that many of these production applications were born out of discovery sandbox initiatives.
It is actually a very significant announcement for customers using Tableau on top of Cloudera's hadoop distribution (CDH). This should provide significant performance gains over Tableau's existing Hive connectivity.
With Cloudera Manager 5, the plan is to have YARN production ready to support dynamic resource allocation (for different applications that leverage YARN). The complete functionality will enable customers to manage the entire lifecycle of new services (such as Apache Accumulo, Apache Spark [incubating], and so on) via this mechanism. With just one node responsible for understanding where all data is located, the namenode becomes a SPOF. While the Hadoop Distributed File System can ingest data in virtually any format, moving data back and forth between Hadoop and, for example, existing enterprise databases, was not originally a trivial task.
As such, existing data management professionals require significant training to administer Hadoop clusters. Hadoop can run Kerberos, a network authentication protocol, which allows nodes to “prove” their identity before an action is taken.
Indeed, some of the features highlighted below for one vendor may also be a part of another vendor’s distribution even if not listed here. While CDH is available for free download and is fully Apache compliant, Cloudera bundles the closed-source Cloudera Manager along with support services with CDH as part of its Cloudera Enterprise offering, which it sells under a subscription model. From an enterprise readiness perspective, DataStax’s Hadoop distribution takes advantage of Cassandra’s decentralized architecture to mitigate the SPOF issue. EMC also partners with Cisco to deliver the high-performance M5 Hadoop distribution as an optimized appliance. Chorus, as mentioned, is a collaborative workspace that sits on top of Greenplum HD and allows Data Scientists to annotate their work, share notes, and otherwise build-off one another’s’ analysis. The platform is very much EMC-focused, however, and comes with a higher price tag than most competing Hadoop offerings.
In any event, as a result of its approach, MapR has developed a high-performance Hadoop distribution that leapfrogged many of the weaknesses that dogged the Apache community, most notably the SPOF issue. While the various distribution vendors have increased Hadoop’s management, security and performance capabilities with their respective distributions, there are still no industry-wide standards accepted by all. At the conference, Cloudera announced a new hadoop technology called Impala, which is a real-time query processing engine for hadoop. In addition, we continue to support static partitioning (via cgroups) to divide cluster resources (cpu, memory, and so on) among these stand-alone processes.
The end goal is to have customers write a simple service descriptor (a JSON file along with set of control scripts) for a new service that gets managed by Cloudera Manager. Meanwhile, a NoSQL database company called DataStax produced its own version of Hadoop for the enterprise. There is significantly less value in any Big Data platform that cannot easily integrate with existing enterprise IT systems, lest it become yet another data silo. Likewise, most analytic professionals are not versed in MapReduce, meaning the pool of qualified Hadoop analytic pros is limited to a select few. Still, as perceived by most data security professionals, Hadoop has lacked the level of security functionality needed for safe deployment in the enterprise and loaded with sensitive data. The company’s OpsCenter Enterprise Edition provides a visual interface to allow administrators to manage deployments, including monitoring the health and status of Hadoop jobs. It is a more efficient Hadoop distribution than most, though it is also more expensive from a price-per-node perspective.
As maturation continues, expect a set of standards to emerge as we’ve seen in other technology areas. Tableau was chosen as one of its first partners to integrate with Cloudera Impala and we previewed an early version of the Impala connector at the conference. Cloudera Manager 5 adds several knobs and parameters to manage all these resource management aspects in a simplified and streamlined fashion.
Once a namenode fails, it can take hours or longer to return the cluster to working order, during which time no jobs can be processed.
And as the Apache Hadoop distribution catches up with MapR, the company will come under increasing pressure to prove its value proposition. In the meantime, enterprises evaluating their Hadoop distribution options should weigh the features and functions of the different vendors against their particular needs in both the short and long-term.
While a secondary namenode is part of most Hadoop configurations, as originally designed it only periodically replicates and stores data from the namenode, meaning it cannot be relied upon as a failsafe should the namenode go offline.



Make your personal cloud ubuntu
Free php cloud software companies
Set up my cloud printer unavailable
Cloud space in xperia u vs


Comments

  1. 07.05.2015 at 11:22:59


    Minutes, which is quite decent with an internet.

    Author: salam
  2. 07.05.2015 at 16:27:31


    Start your free trial, you agree to the Amazon Drive Terms technical support and complete compatibility.

    Author: Escalade
  3. 07.05.2015 at 10:57:28


    Feature?�so you risk losing work if you forget to save their Privacy Policy makes it sound.

    Author: yjuy