newsletter

Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
September 20, 2018
prev slideNext slide

Upgrading your clusters and workloads from Apache Hadoop 2 to Apache Hadoop 3

If you’re interested in learning more, go to our recap blog here!

Introduction

The Apache Hadoop community announced Hadoop 3.0 GA in December, 2017 and Hadoop 3.1 in April, 2018 loaded with great features and improvements.

One of the biggest challenges while upgrading to a new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.

There are some challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should ideally be able to seamlessly run or migrate their workloads onto Hadoop 3. This blog talks about the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.

Why upgrade to Hadoop 3?

Hadoop 3 has been an eagerly awaited major release with great features and improvements in both YARN and HDFS. Below,  we have highlighted only a few of the newly added features. There are tons of exciting features and improvements that were part of Hadoop 3 which will delight customers and increase Hadoop adoption without costing significantly more.

HDFS

HDFS introduced Erasure coding in Hadoop 3 which enables significant storage cost savings and reduction in storage overhead from 200% to 50%.

HDFS Intra-Data Node disk Balancer is a new feature added in Hadoop 3 which adds the capability to balance data across disks within a single Datanode.

HDFS Federation with support for namespaces for better scalability is GA in Hadoop 3.

YARN

YARN scheduler added a bunch of improvements particularly Capacity Scheduler with changes to scheduler’s core for faster(global) scheduling and native support for new resource types like GPUs.

Support for Containerized applications on YARN ( via Docker) is GA in Hadoop 3 along with support for long running services and service discovery.

Express or Rolling Upgrades?

Admins are recommended to Express Upgrade clusters from Hadoop 2 to 3

Why not rolling upgrades ?

With this being a major version upgrade, there are few challenges and issues in supporting Rolling Upgrades. Some of the open issues are

  • HDFS Rolling Upgrade has issues with NN restarts due to change in edit log format  HDFS-13596
  • HDFS-6440 Incompatible changes in image transfer protocol
  • Hadoop daemons configured with a custom MetricsPlugin sink fail to start after upgrade  HADOOP-15502 (API In-compatibility)

Compatibility with Hadoop 2

Wire compatibility

  • Hadoop 3 preserves wire compatibility with Hadoop 2 clients
  • Distcp/WebHDFS compatibility is preserved

API compatibility

Hadoop 3 doesn’t preserve full API level compatibility due to the following changes

  • Classpath – Dependency version bumps like guava
  • Removal of deprecated APIs and tools
  • Shell script rewrites
  • Incompatible bug fixes

Major Configuration/Script changes in Hadoop 3

Some of the major configuration/script changes introduced in Hadoop 3 are listed below

  • Change in Default Daemon Ports for HDFS HDFS-9427
  • Hadoop Script rewrites HADOOP-9902
  • Configuring Hadoop Daemon Heap Size HADOOP-10950
    • HADOOP_HEAPSIZE deprecated in favour of HADOOP_HEAPSIZE_MAX and HADOOP_HEAPSIZE_MIN
    • Auto-tuning based on memory of the host
  • RM Max Completed Applications in State Store/Memory
    • Defaults reduced from 10000 to 1000

For more details on changes in Hadoop 3 and how they would affect upgrades or workloads post upgrade, refer to the slides for the talk we gave recently at Dataworks Summit

Recommended Versions for Upgrade

Recommend users to upgrade to at least Hadoop 2.8.4 before migrating to Hadoop 3.

Upgrade Process

Environment

Java

Hadoop 3 requires Java 8 and above to be deployed on the cluster

Docker

Minimum supported Docker version on Hadoop 3 is 1.12.5

Shell

Minimum version required is Bash V3. POSIX shells are NOT supported

New Service Daemons post upgrade

Hadoop 3 requires additional YARN components in the cluster for running YARN Service applications

  • YARN Registry DNS
  • YARN Timeline Service V2.0 Reader

Apache Hadoop Upgrade

Apache Hadoop upgrade is fairly involved with elaborate steps for pre-upgrade, upgrade process and post-upgrade. Please refer to the slides for the talk we gave recently at Dataworks Summit for further details.

Migrating Workloads

MapReduce applications

MapReduce is fully binary compatible and workloads should run as is without any changes required.

Slider apps

Apache Slider is retiring from Apache Incubator and users are required to port their Slider applications to YARN Services.  Please refer to our recent blog on Apache Hive LLAP as a YARN Service on how easy it is to port Slider apps to YARN Services.

Benefits of YARN Services

  • Easier to manage and deploy
  • Single yarnfile to configure a YARN Service
  • Supports container placement scheduling such as affinity and anti-affinity YARN-6592
  • Rolling upgrades for containers and service YARN-7512 and YARN-4726.
  • Services UI in YARN UI2 for viewing and submitting YARN Service Apps

Versions of Apache projects which support Hadoop 3.x

The following Apache versions already support Hadoop 3 or are in the process of supporting Hadoop 3 in the community.

Conclusion

Express upgrades are the recommended way to go for upgrading from Apache Hadoop 2 to Hadoop 3.

Apache Hadoop community recently released Hadoop 3.1.1 with some features and fixes for bugs identified since Apache Hadoop 3.1.0.  Recommend users to upgrade to Apache Hadoop 3.1.1 from Apache Hadoop 2.8.4.  For end users, applications on HDFS and YARN should run mostly as is !

References

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums