Category Archives: MapReduce


Delivering on Hadoop .Next: Benchmarking Performance

In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

Read More

Delivering the Next Generation of Apache Hadoop

Today we announced our plans to release a public preview of the Hortonworks Data Platform (HDP) version 2. HDP v2 will leverage Apache Hadoop 0.23, which is the first major update to Hadoop in more than three years. Among other advancements, HDP v2 will include the NextGen MapReduce architecture, HDFS NameNode HA and HDFS Federation. It will also include the most up-to-date stable components including HCatalog, HBase, Hive and Pig; all fully integrated and tested at scale.

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available in the second half of 2012.

Read More

Apache Hadoop Reaches Milestone: Release 1.0.0

Congratulations! The Hadoop Community has given itself a big holiday present: Release 1.0.0! This release has been six years in the making, and has involved:

  • Hard work and cooperation from dozens of software developers and contributors from across the industry, including of course Doug Cutting and Mike Cafarella’s early work in Nutch and the founding Hadoop team at Yahoo, Doug, Owen O’Malley and many others, with leadership from Eric14.  Special thanks to all the Hadoop committers.
  • Commitment to stability, joined with testing and indispensable production experience at scale, at industry-leading companies like Yahoo!, Facebook, LinkedIn, and others, including hundreds of millions of compute-hours and exabytes of data processed.
  • Feedback from hundreds of knowledgeable users, data scientists, systems engineers and architects.
  • Commitment to the philosophy and practice of opensource from Google, who published their seminal papers and have long supported Apache.
  • The Apache Software Foundation, which provided a structured home for the growth of the ecosystem and blossoming of multiple associated projects.

Read More

Apache Hadoop Meets Informatica Data Parsing

As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard.

For those new to Apache Hadoop, MapReduce is a parallel computing framework for processing large volumes of data. It deals with the four V’s of big data (as Forrester described) that present challenges to existing data systems, namely: volume, velocity, variety and variability. Together with the Hadoop Distributed File System (HDFS) and a handful of other important Apache Hadoop projects, it provides a massively scalable and highly reliable platform for storing, processing, managing and ultimately analyzing the ever-increasing data coming not only from transactional systems but also unstructured data in the form of server logs, customer interaction records, social media updates, email, PDFs, CDRs and so forth.

Read More

Update on Apache Hadoop-0.23

There has been a lot of progress on hadoop-0.23. We’re continuing to crank through issues as we get ready to ship.

We are mostly past the initial challenges of moving our entire build infrastructure to Maven. Many thanks to Alejandro, Tom, Giri & Eric Yang for making it happen.

HDFS is nearly there:

  • HDFS Federation and Client-side mount tables have been tested with ~300 node clusters with security on.
  • HDFS upgrades have been tested from 0.20.2xx.
  • Functional tests for HDFS  are complete.

Read More

Preparing for Next Release of Apache Hadoop

We are glad to have branched for a hadoop-0.23 release. We have already talked about some of the significant enhancements coming in the upcoming release such as HDFS Federation and NextGen MapReduce and we are excited to be starting the journey to begin stabilizing the next release. Please check out this presentation for more details.

As always, this is a community effort and we are very thankful for all the contributions from the Apache Hadoop community. As the Release Manager for Apache Hadoop 0.23, I look forward to getting a great release out of the door!

Read More

NextGen MapReduce Hits Apache Hadoop Mainline

We are very excited to announce NextGen Apache Hadoop MapReduce is getting close. We just merged the code base to Apache Hadoop mainline and Arun is about to branch a hadoop-0.23 to prepare for a release!

We’ve talked about NextGen Apache Hadoop MapReduce and it’s advantages. The drawbacks of current Apache Hadoop MapReduce are both old and well understood. The proposed architecture has been in the public domain for over 3 years now. The team started the work in August 2010 starting with a prototype upon which we did rapid iterations. This culminated with an initial check-in to Apache Hadoop SVN in March 2011. Since then we’ve done all development on the MR-279 branch in Apache and have run really hard to get NextGen Hadoop MapReduce ready. We hope to see it soon on *your* cluster.

Read More

Gone Viral: Next Generation of Apache Hadoop MapReduce

Hi Folks,

I’d like to congratulate Arun Murthy on his very popular Hadoop Summit talk. SlideShare.net reports that his presentation has gone viral. They originally promoted it as the most discussed SlideShare.net presentation on Linked In and yesterday they promoted it as the most Tweeted about presentation. In both cases, the presentation was moved up to the front page.

Arun is a Hortonworks founder and MapReduce expert. His talk does a great job of highlighting some of the current limitations in MapReduce and then outlining the roadmap for improving areas such as scalability, high availability, cluster utilization and support for paradigms other than MapReduce.

Read More

Go to page:12