Category Archives: Apache Hadoop


HA Namenode for HDFS with Hadoop 1.0 – Part 1

Introduction

A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things.

First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of the Hadoop system as a whole and then focus on the high-availability of individual components; that work lead to the Full Stack HA Architecture.

Second, we realized that we can build an HA NameNode for Hadoop 1.0 using industry proven solutions such as Linux HA and vSphere; this is important because HDFS in Hadoop 1 is been proven to be stable and reliable, while HDFS in Hadoop 2 is just beginning beta testing. This blog describes some technical details of HDFS NameNode HA in Hadoop 1. A future blog will give some more details on Full Stack HA.

The first and foremost question in people’s mind is: What is the difference between HA Hadoop 1 and Hadoop 2? My colleague Suresh and I wrote the original design for Hadoop 2 (see HDFS-1623) and have worked closely with the community on the implementation. HA in Hadoop 1 is the direct result of our experiences during that work.

Hadoop 2 HA focused on three areas:

  • Hot failover: We have found that the difference in failover times are small between cold and hot failover for small to medium clusters; Hadoop 1 uses cold failover.
  • Automatic failover: For Hadoop 1 we have used industry proven HA framework rather than use the Hadoop 2 Failover-Controller. The figure above illustrates NameNode HA in Hadoop 1 using Linux HA.
  • Remove dependency on shared storage: this is still work in progress for Journal Daemons in trunk[6]. Both Hadoop 2 and Hadoop 1 use shared storage.

To summarize, the difference in Hadoop 1 HA is cold failover and the use of industry standard HA frameworks. Lets look at details and implications of these differences below.

Failover Times and Cold versus Hot Failover

The failover time of a high available system with active-passive failover is the sum of (1) time to detect that the active service has failed, (2) time to elect a leader and/or for the leader to make a failover decision and communicate to the other party, and (3) the time to transition the standby service to active.

The first and second items are the same for cold or hot failover: they both rely on heartbeat timeouts, monitoring probe timeouts, etc. We have observed that total combined time for failure detection and leader election to range from 30 seconds to 2.5 minutes depending on the kind of failure; the lowest times are typical when the active server’s host or host operating system fails; hung processes take longer due to the grace period needed to be confident that the process is not blocked during Garbage Collection.

For the third item, the time to transition the standby service to active, Hadoop 1 requires starting a second NameNode and for the NameNode to get out of safe mode. In our experiments we have observed the following times:

  • A 60 node cluster with 6 million blocks using 300TB raw storage, and 100K files: 30 seconds. Hence total failover time ranges from 1-3 minutes.
  • A 200 node cluster with 20 million blocks occupying 1PB raw storage and 1 million files: 110 seconds. Hence total failover time ranges from 2.5 to 4.5 minutes.

Industry Standard HA Frameworks

As we stated in the Hadoop HDFS NameNode HA design document, HDFS-1623, HDFS NameNode HA design document [HDFS-1623], the notion of having a failover controller outside the NameNode was influenced by frameworks like Linux HA[1], Red Hat HA[2] and Veritas Cluster[4,5]. Part of the decision for the Hadoop community to build our own failover controller was made because we felt it was useful for Hadoop to provide an out-of-the-box solution. Linux HA, being GPL, did not allow that.

For Hadoop 1 we decided not to back-port the failover controller from trunk, but instead use an industry standard HA frameworks. Why?

We wanted to add HA to the stable Hadoop line in as risk-free a way possible which lead us to use proven and robust HA frameworks. Many customers already have experience using these HA frameworks. These frameworks deal with monitoring timeout, service startup timeout, shutdown timeout, and have a way to flag and deal with repeated failures. Further, the industry proven frameworks offer several alternative fencing solutions including power-based fencing.

The same HA framework can be used to failover other Hadoop services such as the Job-Tracker; we have have already started the work on an HA Job-Tracker. These HA frameworks also provide the ability to share a common pool of server machines to host highly avaiolable NameNode, JobTracker and other Master daemons; the shared pools allow N-N, N-on-N and N+K failover. Finally they offer a way to perform manual switchover, coordinated shutdown of both and being able to run with one of the NameNodes down.

Using Failover solutions which keep the Namenode IP Address constant ensure that the web interfaces such as WebHDFS or the Hadoop consoles also failover. With IP failover, URLs will follow the service, regardless of where it is running. The use of mature IP failover-based solutions reduced the complexity, making it possible to implement HA on the stable Hadoop 1.0 line with a few, low risk changes to the Hadoop core.

Recently, Symantec independently described how to make the NameNode highly available using Veritas Cluster [5]. The Figure above illustrates the NameNode HA in Hadoop 1 using Linux HA that will be available along with a similar solution using vSphere HA.

FAQ

You are using cold failover and hence it is not practical.
The HDFS 2.0 HA design was driven by the needs of the very large clusters at Yahoo and Facebook. For small to medium clusters cold failover is only 30 to 120 seconds slower, as described above.

If this was so easy to do with Linux HA or other tools why didn’t the HDFS community do this earlier?
This is partly because the original HDFS team focused on very large clusters where cold failover was not practical. We assumed that Hadoop needed to provide its own built-in solution As we’ve developed this technology, we’ve heard directly from our customers that HA solutions are complex and that they prefer using their existing, well understood, solutions.

You have taken a different path from HA in Hadoop 2.
Not true – both are based on the same design principals. The difference is the focus on cold failover instead of hot failover. Further, the work is complimentary. All the work in the Hadoop 1.0 line is also being added to the Hadoop 2.0 line.

I need to use Linux HA, or another framework – isn’t that a hindrance?
Many of Hadoop users already use Linux HA, Red Hat HA, Veritas Cluster, or vSphere HA in their data centers. Linux HA is freely available and the cost of Red Hat’s HA is fairly low.

So is the Full Stack HA only a part of Hadoop 1?
No, Full Stack HA is orthogonal to the failover of a specific component such as the NameNode or the Job Tracker (see this post). Making the rest of the stack robust against transient failures of the layers underneath improves the entire stack. We will cover more detail of Full Stack HA in a up coming blog.

Does vSphere HA deal with service failures in addition to VM failures?
vSphere allows application level health checks and we have added an application-level monitor for the NameNode. A similar JobTracker-specific monitor will be available shortly as well.

When using vSphere HA solution, what is the advantage of hosting the NameNode and JobTracker each in their own vSphere VirtualMachine?
Hosting the master services in isolated vSphere VMs is an effective design. As vSphere monitors and manages each VM independently, vSphere servers can host independent VMs containing the NameNode, JobTracker, and other master services. If one service fails, that VM is killed and restarted – while the other VMs continue uninterrupted. You also gain the ease of maintenance and rollback that VMs offer.

Status

All patches to core hadoop have been committed to Apache Hadoop trunk and branch 1.1; these are also incorporated in Hortonworks HDP 1 and HDP 1.1 releases. We plan to commit these to Hadoop 2-alpha.

Our monitoring code is targeted for inclusion into the still-in-incubation Ambari project; it has already been submitted as a patch [8].

Future Outlook

Work is in progress to stabilize HDFS 2 which is currently entering beta testing and also Hadoop 2‘s HA (Hot Failover, failover controller, etc). We are also in the progress of providing failover for other Hadoop components, such as the JobTracker, in Hadoop 1.

References

  1. Linux HA: http://www.linux-ha.org/wiki/Main_Page
  2. Red Hat High Availability Add-On: http://www.redhat.com/products/enterprise-linux-add-ons/high-availability/
  3. vSphere HA: http://www.vmware.com/products/high-availability/overview.html
  4. Symantics’ Veritas Custer Framework: https://www.symantec.com/cluster-server
  5. NN HA using Symantics’ Veritas Custer Framework: http://www.symantec.com/connect/blogs/symantec-high-availability-solution-hadoop-namenode
  6. Hadoop Journal Daemon: HDFS-3092 and HDFS-3077
  7. HDFS NameNode HA design (for Hadoop 2): HDFS-1623
  8. Monitoring Library: Ambari-504
  9. Full Stack HA: http://hortonworks.com/blog/high-availability-and-hadoop-1-0-perfect-together/

Apache Hadoop YARN – Background and an Overview

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – Background & Overview

Celebrating the significant milestone that was Apache Hadoop YARN being promoted to a full-fledged sub-project of Apache Hadoop in the ASF we present the first blog in a multi-part series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.

MapReduce – The Paradigm

Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result. The simple, and fairly restricted, nature of the programming model lends itself to very efficient and extremely large-scale implementations across thousands of cheap, commodity nodes.

Apache Hadoop MapReduce is the most popular open-source implementation of the MapReduce model.

In particular, when MapReduce is paired with a distributed file-system such as Apache Hadoop HDFS, which can provide very high aggregate I/O bandwidth across a large cluster, the economics of the system are extremely compelling – a key factor in the popularity of Hadoop.

One of the keys to this is the lack of data motion i.e. move compute to data and do not move data to the compute node via the network. Specifically, the MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack – a core advantage.

Apache Hadoop MapReduce, circa 2011 – A Recap

Apache Hadoop MapReduce is an open-source, Apache Software Foundation project, which is an implementation of the MapReduce programming paradigm described above. Now, as someone who has spent over six years working full-time on Apache Hadoop, I normally like to point out that the Apache Hadoop MapReduce project itself can be broken down into the following major facets:

  • The end-user MapReduce API for programming the desired MapReduce application.
  • The MapReduce framework, which is the runtime implementation of various phases such as the map phase, the sort/shuffle/merge aggregation and the reduce phase.
  • The MapReduce system, which is the backend infrastructure required to run the user’s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc.

This separation of concerns has significant benefits, particularly for the end-users – they can completely focus on the application via the API and allow the combination of the MapReduce Framework and the MapReduce System to deal with the ugly details such as resource management, fault-tolerance, scheduling etc.

The current Apache Hadoop MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves called TaskTrackers.

The JobTracker is responsible for resource management (managing the worker nodes i.e. TaskTrackers), tracking resource consumption/availability and also job life-cycle management (scheduling individual tasks of the job, tracking progress, providing fault-tolerance for tasks etc).

The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the JobTracker and provide task-status information to the JobTracker periodically.

For a while, we have understood that the Apache Hadoop MapReduce framework needed an overhaul. In particular, with regards to the JobTracker, we needed to address several aspects regarding scalability, cluster utilization, ability for customers to control upgrades to the stack i.e. customer agility and equally importantly, supporting workloads other than MapReduce itself.

We’ve done running repairs over time, including recent support for JobTracker availability and resiliency to HDFS issues (both of which are available in Hortonworks Data Platform v1 i.e. HDP1) but lately they’ve come at an ever-increasing maintenance cost and yet, did not address core issues such as support for non-MapReduce and customer agility.

Why support non-MapReduce workloads?

MapReduce is great for many applications, but not everything; other programming models better serve requirements such as graph processing (Google Pregel / Apache Giraph) and iterative modeling (MPI). When all the data in the enterprise is already available in Hadoop HDFS having multiple paths for processing is critical.

Furthermore, since MapReduce is essentially batch-oriented, support for real-time and near real-time processing such as stream processing and CEPFresil are emerging requirements from our customer base.

Providing these within Hadoop enables organizations to see an increased return on the Hadoop investments by lowering operational costs for administrators, reducing the need to move data between Hadoop HDFS and other storage systems etc.

Why improve scalability?

Moore’s Law… Essentially, at the same price-point, the processing power available in data-centers continues to increase rapidly. As an example, consider the following definitions of commodity servers:

  • 2009 – 8 cores, 16GB of RAM, 4x1TB disk
  • 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.

Generally, at the same price-point, servers are twice as capable today as they were 2-3 years ago – on every single dimension.  Apache Hadoop MapReduce is known to scale to production deployments of ~5000 nodes of hardware of 2009 vintage. Thus, ongoing scalability needs are ever present given the above hardware trends.

What are the common scenarios for low cluster utilization?

In the current system, JobTracker views the cluster as composed of nodes (managed by individual TaskTrackers) with distinct map slots and reduce slots, which are not fungible.  Utilization issues occur because maps slots might be ‘full’ while reduce slots are empty (and vice-versa).  Fixing this was necessary to ensure the entire system could be used to its maximum capacity for high utilization.

What is the notion of customer agility?

In real-world deployments, Hadoop is very commonly deployed as a shared, multi-tenant system. As a result, changes to the Hadoop software stack affect a large cross-section if not the entire enterprise. Against that backdrop, customers are very keen on controlling upgrades to the software stack as it has a direct impact on their applications. Thus, allowing multiple, if limited, versions of the MapReduce framework is critical for Hadoop.

Enter Apache Hadoop YARN

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM).

The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner.

The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.

The ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc.

The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container.

Here is an architectural view of YARN:

One of the crucial implementation details for MapReduce within the new YARN system that I’d like to point out is that we have reused the existing MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. More on this later.

The next post will dive further into the intricacies of the architecture and its benefits such as significantly better scaling, support for multiple data processing frameworks (MapReduce, MPI etc.) and cluster utilization.

 

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Understanding Apache Hadoop’s Capacity Scheduler

As organizations continue to ramp the number of MapReduce jobs processed in their Hadoop clusters, we often get questions about how best to share clusters. I wanted to take the opportunity to explain the role of Capacity Scheduler, including covering a few common use cases.

Let me start by stating the underlying challenge that led to the development of Capacity Scheduler and similar approaches.

As organizations become more savvy with Apache Hadoop MapReduce and as their deployments mature, there is a significant pull towards consolidation of Hadoop clusters into a small number of decently sized, shared clusters. This is driven by the urge to consolidate data in HDFS, allow ever-larger processing via MapReduce and reduce operational costs & complexity of managing multiple small clusters. It is quite common today for multiple sub-organizations within a single parent organization to pool together Hadoop/IT budgets to deploy and manage shared Hadoop clusters.

Initially, Apache Hadoop MapReduce supported a simple first-in-first-out (FIFO) job scheduler that was insufficient to address the above use case.

Enter the Capacity Scheduler.

Overview of Capacity Scheduler

The Capacity Scheduler was designed to allow organizations to share Hadoop clusters in a predictable and simple manner – primarily using the very common notion of ‘job queues’. It provides capacity guarantees for queues while providing elasticity for queues’ cluster utilization in the sense that unused capacity of a queue can be harnessed by overloaded queues that have a lot of temporal demand. This results in significantly higher cluster utilization while still providing predictability for Hadoop workloads.

For example, if you set up 5 queues, each queue would then have 20% of the total capacity for processing jobs. You can define the queue for which each job is assigned.

For providing the necessary elasticity, the Capacity Scheduler allocates free resources to any queue beyond its guaranteed capacity. These excess resources can be reclaimed as necessary and assigned to other queues in order to meet capacity guarantees.

Also, Capacity Scheduler supports the notion of queue and per-user limits within the queue.

Admins can set up ‘maximum capacity’ for a queue, which provides an upper bound on the elasticity of the queue i.e. the resources it can claim from other idle queues beyond it’s guaranteed capacity.

Each queue enforces a limit on the percentage of resources that a given user can access if multiple users are accessing the queue at the same time. The user-limits are dynamic, and the actual limits depend upon the active users and their demand.

Furthermore, to further aid stability and reliability of Hadoop MapReduce clusters, the Capacity Scheduler has other limits such as those on the number of accepted/active jobs per queue, the number of pending tasks per queue, the number of accepted/active jobs per user, the number of pending tasks per user, etc.

All of these limits help ensure that a single job, user or queue cannot adversely harm a MapReduce cluster resulting in catastrophic failures.

Finally, the Capacity Scheduler also supports the notion of ‘high resource’ applications such that MapReduce job can ask for multiple-slots for every map or reduce task. This is a very useful feature for resource-heavy applications. The resources consumed by the ‘high resource’ jobs are naturally accounted for against the queue capacity. Also, it’s useful to remember that the Capacity Scheduler has mechanisms to strictly enforce the resource limits given to every map or reduce task.

Understanding Capacity Scheduler’s User Limits

Setting up user limits is an important concept to understand when using Capacity Scheduler. Here is a working example:

Let’s say that a given queue is configured to be shared amongst 5 users, with each user being given an equal 20% share.

Initially, if only a single job from a single user (user A) is active in the queue, the entire capacity of the queue will be consumed by user A’s job.

If another user (user B) creates a job, user A and user B are both given 50% of the capacity of the queue. If one of the user’s doesn’t require the full capacity to run their job(s), Capacity Scheduler automatically assigns more capacity to the other user that requires the additional capacity.

If users A and B continue to submit jobs, the behaviors don’t change.

If user C begins to submit jobs, each user’s share becomes 33.3% of the queue capacity (again, assuming there is sufficient demand from the user’s jobs). This goes on until all 5 users get their 20% share each.

Now, if a 6th user enters the picture, things change a bit. Instead of further dividing capacity, Capacity Scheduler instead attempts to complete jobs from the original 5 users. This logic was added so that organizations can prevent large volumes of users from essentially dividing up resources across too many jobs, which would spread resources too thin (e.g. local storage) to complete tasks in an acceptable time frame.

On the other hand, as the users’ applications wind down, the other users start to reclaim the share. For e.g. if 3 users A, B, C are using 33% and user B’s jobs complete, both A and C can now get 50% of the queue capacity.

Furthermore, if one of the existing active users have applications which do not need additional resources in the cluster, for e.g. at the tail end of the job, the other users get larger share of the queue.

Understanding the Resource Utilization of MapReduce Jobs & Its Impact on Sharing

One of the keys to achieving stability in MapReduce clusters is to understand that even though job tasks may not be running, they can still consume resources in the cluster. A very common use case is map-tasks outputs – i.e. even after a map task completes, it is still consuming resources to store its map outputs.

For example, if a large job consumes tens or hundreds of TB and is sharing capacity with other jobs for a long period of time (e.g. 12 hours), the cluster could have a lot of disk space tied up to map outputs. This is one of the main reasons that Capacity Scheduler attempts to finish individual jobs with a limited amount of sharing, thereby maximizing throughput and improving cluster stability.

Note that the discussion of map outputs doesn’t change if we store them locally on individual disks (current Hadoop MapReduce) or on HDFS with replica of 1, since storage consumed is still the same. Also, having an HDFS replica of map outputs >1 is not advised because it is more expensive both in terms of performance and storage. It is simply much better to re-run individual maps on failure rather than having to pay the expense on each and every job.

Other Features of Capacity Scheduler

Capacity Scheduler supports that notion of resource specifications for individual jobs – i.e. Capacity Scheduler is not limited to a one-slot-per-task paradigm. A user can request for multiple slots for each task by asking for more resources per task (specifically, memory). This is important when you have several users each running resource-intensive jobs. With this feature, the job and queue get “charged” against the queue for using more resources per task.

Another important feature is the extensive stabilization of Capacity Scheduler over the past several years. Given our team’s experience running Capacity Scheduler on very large, shared clusters, we have continually enhanced it to ensure that single jobs or single users cannot overwhelm a cluster either maliciously or because of a user bug. This is especially important because JobTracker is a single point of control. For example, in days past, we had a user error in which a crontab submitted the same job every second instead of every day. This overwhelmed the cluster with tens of thousands of jobs that were initialized, although not necessarily run. Still, initializing the jobs was consuming a large amount of JobTracker memory. Capacity Scheduler now has limits for initialized jobs/tasks, running jobs/tasks, etc. in order to prevent these and other potentially damaging scenarios.

If you find that you are continually growing your Hadoop cluster and adding new users, I strongly suggest that you consider using Capacity Scheduler.

~ Arun C. Murthy

Thinking about the HDFS vs. Other Storage Technologies

As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS.  Here is why…

To compare HDFS to other technologies one must first ask the question, what is HDFS good at:

  • Extreme low cost per byte
    HDFS uses commodity direct attached storage and shares the cost of the network & computers it runs on with the MapReduce / compute layers of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs. This cost advantage lets organizations store and process orders of magnitude more data per dollar than tradition SAN or NAS systems, which is the price point of many of these other systems.  In big data deployments, the cost of storage often determines the viability of the system.
  • Very high bandwidth to support MapReduce workloads
    HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per computer into the map-reduce layer, on a very low cost shared network. Hadoop can go much faster on higher speed networks, but 10gigE, IB, SAN and other high-end technologies double the cost of a deployed cluster. These technologies are optional for HDFS.  2+ gigabits per second per computer may not sound like a lot, but this means that today’s large Hadoop clusters can easily read/write more than a terabyte of data per second continuously to the MapReduce layer.
  • Rock solid data reliability
    When deploying large distributed systems like Hadoop, the laws of probability are not on your side. Things will break every day, often in new and creative ways.  Devices will fail and data will be lost or subtly mutated. The design of HDFS is focused on taming this beast. It was designed from the ground up to correctly store and deliver data while under constant assault from the gremlins that huge scale out unleashes in your data center. And it does this in software, again at low cost. Smart design is the easy part; the difficult part is hardening a system in real use cases.  The only way you can prove a system is reliable is to run it for years against a variety of production applications at full scale.  Hadoop has been proven in thousands of different use cases and cluster sizes, from startups to Internet giants and governments.

How does the HDFS competition stack up?  
This is an article about Hadoop, so I’m not going to call out the other systems by name, but I assert that all of the systems listed in the “8 ways” article don’t compare well to Hadoop in one of the above dimensions. Let me list some of the failure modes:

  • System not designed for Hadoop’s scale
    Many systems simply don’t work at Hadoop scale. They haven’t been designed or proven to work with very large data or many commodity nodes. They often will not scale up to petabytes of data or thousands of nodes. If you have a small use-case and value other attributes, such as integration with existing apps in your enterprise, maybe this is a good trade-off, but something that works well in a 10 node test system may fail utterly as your system scales up. Other systems don’t scale operationally or rely on non-scalable hardware. Traditional NAS storage is a simple example of this problem. A NAS can replace Hadoop in a small cluster. But as the cluster scales up, cost and bandwidth issues come to the fore.
  • System that don’t use commodity hardware or open source software
    Many proprietary software / non-commodity hardware solutions are well tested and great at what they were designed to do. But, these solutions cost more than free software on commodity hardware. For small projects, this may be ok, but most activities have a finite budget and a system that allows much more data to be stored and used at the same cost often becomes the obvious choice. The disruptive cost advantage of Hadoop & HDFS is fundamental to the current success and growing popularity of the platform. Many Hadoop competitors simply don’t offer the same cost advantage.  Vendor price lists speak for themselves in this area (where the prices are even published).
  • Not designed for MapReduce’s I/O patterns
    Many of these systems are not designed from the ground up for Hadoop’s big sequential scans & writes.  Sometimes the limitation is in hardware. Sometimes it is in software. Systems that don’t organize their data for large reads cannot keep up with MapReduce’s data rates. Many databases and NoSql stores are simply not optimized for pumping data into MapReduce.
  • Unproven technology
    Hadoop is interesting because it is used in production at extreme scale in the most demanding big data use cases in the world. As a result thousands of issues have been identified and fixed. This represents several hundred person-centuries of software development investment. It is easy to design a novel alternative system. A paper, a prototype or even a history of success in a related domain or a small set of use cases does not prove that a system is ready to take on Hadoop. Tellingly, along with listing some new and interesting systems, the “8 ways” article says goodbye to some systems that have previously been considered HDFS contenders by vocal advocates. I’ve got a rolodex full of folks who used to work on such systems who are now major players in the Apache Hadoop community.

It is easy to find example use cases where some other storage system is a better choice than Hadoop. But I assert that HDFS is the best system available today to do exactly what it was built for, being Hadoop’s storage system. It delivers rock solid data reliability and very high sequential read/write bandwidth, at the lowest cost possible. As a result, HDFS is, and I predict it will remain THE storage infrastructure for the vast majority of Hadoop clusters.

~E14

Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Working code examples for this post (for both Pig 0.10 and ElasticSearch 0.18.6) are available here.

ElasticSearch makes search simple. ElasticSearch is built over Lucene and provides a simple but rich JSON over HTTP query interface to search clusters of one or one hundred machies. You can get started with ElasticSearch in five minutes, and it can scale to support heavy loads in the enterprise. ElasticSearch has a Whirr Recipe, and there is even a Platform-as-a-Service provider, Bonsai.io.

Apache Pig makes Hadoop simple. In a previous post, we prepared the Berkeley Enron Emails in Avro format. The entire dataset is available in Avro format here: https://s3.amazonaws.com/rjurney.public/enron.avro. Lets check them out:

Read More

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Read More

High Availability and Hadoop 1.0 – Perfect Together

In Shaun Connolly’s post about balancing community innovation and enterprise stability, he discussed the importance of an open source project forging ahead with big improvements that are expected to be initially buggy and incomplete functionally but then stabilize over time. In the case of Apache Hadoop 2.0, currently in community Alpha release, the big improvements have been underway for the past 3 years and include such things as:

  1. Next-gen MapReduce (aka YARN) that opens up Hadoop’s job processing architecture to other application workloads beyond MapReduce,
  2. New HDFS pipe-line to support append and flush,
  3. HDFS federation and performance improvements that enable Hadoop to scale to 1000’s more nodes in a cluster, and
  4. High availability improvements that address some of the single point of failure issues that are often used as examples of how Hadoop may not be as enterprise-ready as it could be.

In the case of high availability (HA), it can take many months or years to get these types of solutions rock solid. While Hadoop 2.0 contains important HA-related features such as HDFS hot standby, we want to make sure we give it time to complete its community release process and allow extra time after that for bugs to be found and fixed to harden it for broad enterprise production use.

Read More

Hortonworks Data Platform v1.0 Download Now Available

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

Read More

Introducing Hortonworks Data Platform v1.0

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Read More

Balancing Community Innovation and Enterprise Stability

Having worked at JBoss and Red Hat from 2004 to 2008 and SpringSource and VMware from 2008 to 2011, I’ve been focused on the world of open source software for a long while. I’ve been blessed to be able to serve enterprise customer needs with high quality open source software such as JBoss Application Server, Hibernate, Drools, Apache Web Server, Apache Tomcat, Spring … and now Apache Hadoop.

As specific open source technologies mature and their use becomes mainstream, it becomes increasingly important to understand and communicate the balancing act that needs to happen between community innovation and enterprise stability.

Community innovation needs to have a fast pace, where “ship early and often” is a key tenet.  Open source projects need to visibly improve and keep innovating if they are to attract a vibrant following. As the open source project’s community grows, they will expect big improvements and will be fine with early, buggy releases, etc. After all, that’s part of the process

Read More

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Read More

Apache Hadoop 2.0 (Alpha) Released

As the release manager for the Apache Hadoop 2.0 release, it gives me great pleasure to share that the Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)! While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

Read More

The Data Lifecycle, Part One: Avroizing the Enron Emails

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

Read More

Executive Video Series: Introduction to HDFS

The latest video in the Hortonworks Executive Video Series features Sanjay Radia, Hortonworks co-founder and Apache Hadoop PMC member. Sanjay is well known in the HDFS circles, having contributed to Hadoop for the past 4+ years. In this video, Sanjay gives a good overview of HDFS, the primary storage system for Hadoop, and provides some insight into both the 0.23 release as well as what can be expected from HDFS over the rest of 2012. He hits on some key elements such as federation, snapshots and improving the overall storage efficiency of HDFS.

If you would like to learn more about HDFS and where it is heading, make sure to attend Hadoop Summit next month in San Jose. At the conference, Sanjay will be presenting HDFS – What is New and Future together with Suresh Srinivas of Hortonworks, as well as Apache Hadoop and Virtual Machines together with Richard McDougall of VMWare.

Read More

Go to page:« First...56789...Last »