It’s amazing the growth Apache Hadoop and the extended ecosystem has had in the last 10 years. I read through Owen’s “Ten Years of Herding Elephants” blog and downloaded the early docker image of his first patch. It reminded me of the days it took me to do my first Hadoop install and the effort it was to learn the Java MapReduce basics to understand the infamous WordCount example. How far have we come? Let’s go through a basic Hadoop tutorial today on the Sandbox today to see all the progress we’ve made.
In addition, let’s take a closer look at the six labs in the intro to Hadoop tutorial and take a retrospective on how things have changed. The six labs in the tutorial are:
Before I begin going through the labs, I have to give a special shout out to the Sandbox. The Sandbox provides an easy way for users to get started with distributed data platform on their personal machines via Virtual Machines or the cloud. It provides a pre-installed, pre-configured and optimized single node environment for Hadoop that stays current with the evolving ecosystem.
After installing and setting up Hadoop, the next challenge developers tend to run into is getting data into Hadoop.
In the beginning Hadoop was a collection of services and there was no User Interface. When you wanted to load data into Hadoop, you had to learn Hadoop and HDFS commands, and go to the command line.
This view allow users to explore and manipulate data in Hadoop Distributed File System (HDFS). You can easily perform many common operations like loading data, creating and removing directories and moving data.
After you get some data into Hadoop, the next thing a developer might try is to explore the data.
In the early Hadoop days, you had only one option to do data processing, which included writing a Java MapReduce program. MapReduce is a powerful tool that opened up many new opportunities to do data processing, but there was a limited amount of developers that knew it.
The lingua franca for data processing is SQL. Facebook lead the charge of bringing SQL access to Hadoop with Hive. If you are a SQL person trying to learn Hadoop, you can leverage your SQL skills to start exploring data. Additionally, you can use the Ambari Hive View to write queries, see results and tune them.
Getting your SQL query to run is only the first step. The next challenge is getting your Hive query to perform. We have Tez as an interactive execution engine and a Ambari Tez View to visually look at the explain plan for a SQL statement.
Another big advancement is being able to have smart UI’s to make configuration changes. If you wanted to make a change to Hive, you needed to edit the hive-site.xml file. Today, you can visually set configuration via Ambari, which provides recommendations and value ranges. If you make a change, it can alert you of dependencies and can keep track of versions so can revert back if there is an issue.
Anything you wanted to do (query, transform, aggregate, …) with Hadoop, there was only one answer and you had to write yet another MapReduce job. SQL is great, but as you are doing data transformation and processing, sometimes it can have it’s limitations.
Pig provides a powerful, flexible and simple scripting language to transform and query data.
Initially when Pig was released your interface was, yet again, the command line.
You can quickly develop and run Pig scripts with the Ambari Pig View.
Good luck integrating with Java, Python, and other languages.
If you wanted to leverage programming language with Hadoop, you needed to use the Hadoop Streaming utility.
YARN enables a rich collection of data access engines to process data in Hadoop. Spark provides fast in-memory processing via Java, Scala, Python, SQL and R APIs. These days, Spark comes out of the box with the Sandbox. You can either work with your favorite IDE, shell of via Zeppelin.
Once you have your data in HDFS and have chosen your data processing tool, you want to communicate back your insight via a report.
With early Hadoop, you had to export or ftp a subset of your data to a reporting tool since it didn’t have remote access.
Now there is a rich collection of drivers to connect your favorite reporting tool to Hadoop via ODBC or JDBC.
There is a growing ecosystem of tools that are providing development, reporting, visualization and dashboarding capabilities natively with Hadoop. For example, you can use Zeppelin as a web notebook to explore your data with a growing collection of interpreters for Spark, Hive, Flink and other data sources.
Comparing the intro Hadoop tutorial to the Owen’s docker image of his first patch is just the beginning of so many other technologies and capabilities that are part of the Hadoop ecosystem. All the work to make Hadoop enterprise ready with enterprise Security(Ranger, Knox), Governance(Falcon, Atlas) and Operations (Ambari, Oozie, Zookeeper) are fundamental to its adoption. A discussion of how far we have come would not be complete without talking about JIRA-279 and the introduction of YARN. YARN converted Hadoop into a multitenant platform with a rich ecosystem of data access engines.
If you are wondering what’s coming in the next 10 years, check our Arun’s keynote at Apache Con 2015 on the Destiny of Data.