3 Reasons Developers Should Learn Apache Hive

As the original architect of MapReduce, I’ve been fortunate to see Apache Hadoop and its ecosystem projects grow by leaps and bounds over the past seven years.

Today, most of my time is spent as an architect and committer on Apache Hive. Hive is the gateway for doing advanced work on Hadoop Distributed File System (HDFS) and the MapReduce framework. We are on the verge of releasing major improvements to Apache Hive, in coordination with work going on in Apache Tez and YARN.

Here are three reasons why there has never been a more exciting time for talented developers to learn Apache Hive.

Reason #1: Apache Hadoop is the future of enterprise data management.

And Apache Hive is the gateway for business intelligence and visualization tools integrated with Apache Hadoop.

Learning Apache Hive puts developers on the path to innovate on new data architecture projects and new business applications. I think it’s always more exciting to write code for something that’s ramping up, rather than for maintaining mature systems.

Reason #2: Learning Hive is easy for those who already know SQL.

Facebook originally created Hive because they had a pressing need to analyze their petabytes of data at Internet scale. But they did not have enough time to teach all of their data analysts to write Java programs that would kick off MapReduce jobs.

Their analysts already knew how to write SQL queries, so Facebook created Hive as a tool those analysts could use with their existing SQL skills. After Facebook contributed their code to the Apache Foundation, the open community continued developing Hive along these same lines. So the same is true today as it was in the beginning: developers already familiar with SQL can learn Hive quickly and then take part in all of the new opportunities promised by Hadoop v2.0.

Reason #3: Hive is about to get a lot faster (thanks to the Stinger Initiative and Apache Tez)

Apache Tez generalizes the MapReduce framework so that Apache Pig and Hive can meet demands for faster response times. At Hortonworks, we launched the Stinger Initiative to improve Hive performance by 100x and to make Hive SQL-compatible.

We have completed Phase 1 of Stinger with impressive results. Phase 2 of Stinger is underway, and we expect to push performance on several types of Hive queries across the 100x threshold.

As Hive performance improves, the number of Hive use cases will grow (along with the number of opportunities for engineers who understand Hive).

This is the latest in our series of quick interviews with Apache Hadoop project committers at Hortonworks.

Learn more about about Hive here or at the Apache Hive project site.

Categorized by :
CIO & ITDM Data Analyst & Scientist Hive Performance

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Stinger Initiative

The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. More »

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.