Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 07, 2018
prev slideNext slide

How Apache Hadoop 3 Adds Value Over Apache Hadoop 2

Thank you to Vinod Vavilapalli and Saumitra Buragohain for contributing to this blog.

This is the 2nd blog of the Hadoop Blog series (part 1, part 3part 4part 5). In this blog, we will show how Apache Hadoop 3 adds value over Apache Hadoop 2 to bring agility and time to market, lower total cost of ownership, scalability and availability and additional new use cases.

Everyone is asking – What is the difference between Apache Hadoop 3 versus Apache Hadoop 2. What’s all this commotion and ruckus mean?  What is Hadoop 3 paving the way towards?

Where to start!  Hadoop 3 combines the efforts of hundreds of contributors over the last five years since Hadoop 2 launched. Several of these committers work at Hortonworks.

Let’s start with your top value propositions around Hadoop 3 and how it can help your organization.

Agility & Time to Market
Although Hadoop 2 uses containers, Hadoop 3 containerization brings agility and package isolation story of Docker.  A container-based service makes it possible to build apps quickly and roll one out in minutes. It also brings faster time to market for services.

Total Cost of Ownership
Hadoop 2 has a lot more storage overhead than Hadoop 3. For example, in Hadoop 2, if there are 6 blocks and 3x replication of each block, the result will be 18 blocks of space.

With erasure coding in Hadoop 3, if there are 6 blocks, it will occupy a 9 block space – 6 blocks and 3 for parity – resulting in less storage overhead.  The end result -instead of the 3x hit on storage, the erasure coding storage method will incur an overhead of 1.5x, while maintaining the same level of data recoverability. It halves the storage cost of HDFS while also retaining data durability.  Storage overhead can be reduced from 200% to 50%. In addition, you benefit from the tremendous cost savings.

Scalability & Availability
Hadoop 2 and Hadoop 1 only use a single NameNode to manage all Namespaces. Hadoop 3 has multiple Namenodes for multiple namespaces for NameNode Federation which improves scalability.

In Hadoop 2, there is only one standby NameNode.  Hadoop 3 supports multiple standby NameNodes. If one standby node goes down over the weekend, you have the benefit of other standby NameNodes so the cluster can continue to operate.  This feature gives you a longer servicing window.

Hadoop 2 uses an old timeline service which has scalability issues.  Hadoop 3 improves the timeline service v2 and improves the scalability and reliability of timeline service.

New Use Cases
Hadoop 2 doesn’t support GPUs. Hadoop 3 enables scheduling of additional resources, such as disks and GPUs for better integration with containers, deep learning & machine learning.  This feature provides the basis for supporting GPUs in Hadoop clusters, which enhances the performance of computations required for Data Science and AI use cases.

Hadoop 2 cannot accommodate intra-node disk balancing. Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to unevenly disks space in each server.   With intra-node disk balancing, the space in each disk is evenly distributed.

Hadoop 2 has only inter-queue preemption across queues. Hadoop 3 introduces intra-queue preemption which goes to the next level time by allowing preemption between application within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority

In conclusion, we are very excited about the upcoming releases on Hadoop 3.  The accelerated release schedule plans anticipated for this year will bring even more capabilities into the hands of the users as soon as possible.  If you look at the blog published last year called Data Lake 3.0: The Ez Button To Deploy In Minutes And Cut TCO By Half, we will see many of the Data Lake 3.0 architecture and innovations from the Apache Hadoop community come to life in our next release of the Hortonworks Data Platform.

 

 LEARN MORE ABOUT HADOOP 3:

Comments

Syed Murtaza Saleem says:

how the existing users of hadoop 2 will leverage the advance features of v3? it seems, they have to setup a completely new environment (cluster) for Hadoop 3 and then migrate stuff from Hadoop 2 OR an upgrade will do the job?

Saumitra Buragohain says:

Hadoo2 to Hadoop3 upgrade will be a seamless in-place upgrade with no requirement for data migration. 3 replicas in Hadoop2 will be retained as 3 replicas in Hadoop 3. If the user wants to reduce storage overhead for cold data, he/she can selectively decide which folder to be Erasure Coded.

Sam says:

When is Hadoop 3.0 slated for release from Hortonworks? Is there a product roadmap that you can share for 3.x+?
I’m looking forward to leverage multiple Namenodes for multiple namespaces to achieve better multi-tenancy isolation.

Roni says:

Thank you for your interest. Unfortunately, future release dates for HDP have not been made public yet. We’re glad you’re excited about the multiple NameNode to help with multitenancy isolation.

onebox app says:

Great read, thank you so much for the wonderful and I am waiting for the HDP. Thank you.

Raj Kumar says:

Thanks for the update and for the introduction of Hadoop 3.

Leave a Reply

Your email address will not be published. Required fields are marked *