As the original architect of MapReduce, I’ve been fortunate to see Apache Hadoop and its ecosystem projects grow by leaps and bounds over the past seven years.
Today, most of my time is spent as an architect and committer on Apache Hive. Hive is the gateway for doing advanced work on Hadoop Distributed File System (HDFS) and the MapReduce framework. We are on the verge of releasing major improvements to Apache Hive, in coordination with work going on in Apache Tez and YARN.
Here are three reasons why there has never been a more exciting time for talented developers to learn Apache Hive.
And Apache Hive is the gateway for business intelligence and visualization tools integrated with Apache Hadoop.
Learning Apache Hive puts developers on the path to innovate on new data architecture projects and new business applications. I think it’s always more exciting to write code for something that’s ramping up, rather than for maintaining mature systems.
Facebook originally created Hive because they had a pressing need to analyze their petabytes of data at Internet scale. But they did not have enough time to teach all of their data analysts to write Java programs that would kick off MapReduce jobs.
Their analysts already knew how to write SQL queries, so Facebook created Hive as a tool those analysts could use with their existing SQL skills. After Facebook contributed their code to the Apache Foundation, the open community continued developing Hive along these same lines. So the same is true today as it was in the beginning: developers already familiar with SQL can learn Hive quickly and then take part in all of the new opportunities promised by Hadoop v2.0.
Apache Tez generalizes the MapReduce framework so that Apache Pig and Hive can meet demands for faster response times. At Hortonworks, we launched the Stinger Initiative to improve Hive performance by 100x and to make Hive SQL-compatible.
We have completed Phase 1 of Stinger with impressive results. Phase 2 of Stinger is underway, and we expect to push performance on several types of Hive queries across the 100x threshold.
As Hive performance improves, the number of Hive use cases will grow (along with the number of opportunities for engineers who understand Hive).
This is the latest in our series of quick interviews with Apache Hadoop project committers at Hortonworks.