Enterprise SQL at Hadoop Scale with Apache Hive

Extending community momentum to the next generation of SQL In Hadoop

In April of this year, Hortonworks, along with the broad Hadoop community delivered the final phase of the Stinger Initiative on schedule, completing the work to bring interactive SQL query to Apache Hive.  The original directive of Stinger was about advancing SQL capabilities at petabyte scale in pure open source. And over 13 months, 145 developers from 44 companies delivered exactly that, contributing over 390,000 lines of code to the Hive project alone.

While this community collaboration has had a tremendously positive impact for data workers, business analysts and the many data center tools around Hadoop that rely on Hive for SQL in Hadoop, it was just the beginning.

Apache Hive, and Enterprise SQL at Hadoop Scale

The Stinger Initiative enabled Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries – all with a common SQL access layer. is a continuation of this initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities.  And just as the original Stinger initiative did, this will be addressed through a familiar three-phase delivery schedule and developed completely in the open Apache Hive community.

r4 Project Goals

Deliver sub-second query response times.
The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes.
Enable transactions and SQL:2011 Analytics for Hive.

Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios.  Explicitly, some of the key deliverables that will enable these new business applications of Hive include:

  • Transactions with ACID semantics allow users to easily modify data with inserts, updates and deletes. They extend Hive from the traditional write-once, and read-often system to support analytics over changing data. This enables reporting with occasional corrections and modifications and allows operational reporting with periodic bulk updates from an operational database.
  • Sub-second queries will allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements.
  • SQL:2011 Analytics allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.

Transactions with ACID semantics in Hive

Hive has been used as a write-once, read-often system, where users add partitions of data and query this data often. ACID is a major shift in the paradigm, adding SQL transactions that allow users to insert, update and delete the existing data. This allows a much wider set of use cases that require periodic modifications to the existing data. ACID will include BEGIN, COMMIT and ROLLBACK for multi-statement transactions in next releases.

Screen Shot 2014-09-02 at 5.03.35 PM

Sub-Second Queries with Hive LLAP

Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called  LLAP (Live Long and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides the following:

  • Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
  • Multi-threaded execution including reads with predicate pushdown and hash joins
  • High throughput IO using Async IO Elevator with dedicated thread and core per disk
  • Granular column level security across applications

YARN will provide workload management in LLAP by using delegation. Queries will bring information from YARN to LLAP about their authorized resource allocation. LLAP processes will then allocate additional resources to serve the query as instructed by YARN.

The hybrid engine approach provides fast response times by efficient in-memory data caching and low-latency processing, provided by node resident processes. However, by limiting LLAP use to the initial phases of query processing, Hive sidesteps limitations around coordination, workload management and failure isolation that are introduced by running entire query within this process as done by other databases.

Screen Shot 2014-09-02 at 5.03.47 PM

Comprehensive SQL:2011 Analytics

SQL:2011 Analytics subset will be supported by Hive, with new features being added over multiple iterations, driven by customer demand. Hive is already much further along than other SQL options for Hadoop with strong SQL support including:

  • Window Functions
  • Common Table Expressions
  • Common sub-queries – correlated and uncorrelated
  • Advanced UDFs
  • Rollup, Cube, and Standard Aggregates
  • Inner, outer, semi and cross Joins will extend this lead to cover most of the frequently used SQL constructs:

  • Non Equi-Joins
  • Set Functions – Union, Except and Intersect
  • Interval types
  • Most sub-queries, nested and otherwise
  • Fixes to syntactical differences from SQL:2011 spec, such as rollup

Integration with Machine Learning Frameworks

Hive-Spark Machine Learning Integration will also allow Hive users to run machine learning models via Hive. Users want to run predictive analytics and descriptive analytics in Hive, both on the same dataset.

Hive on Spark?

There is a lot of talk about Spark as a powerful engine running on YARN, and we at Hortonworks share that excitement and are working actively to make it enterprise ready for Spark users.  In fact, in order to integrate with Spark, the broad Hive community is making use of several of the infrastructure components already added to Hive as part of the Tez integration which was delivered in Hive 0.13.

Some Additional Advances

In addition to these primary use cases, some additional enhancements include:

  • Hive Streaming Ingest helps Hive users expand operational reporting on the latest data.
  • Hive Cross-Geo Query allows users to query and report on datasets distributed across geography due to legal or efficiency constraints. Users currently are unable to do this and need to write their own application code that stitches together multiple results.
  • Materialized views allow storing multiple views of the same data allowing faster analyses. The views can be held speculatively in-memory and discarded when memory is needed.
  • Usability improvements will help users work more simply with Hive.
  • Simplified deployment will focus on providing near plug and play deployment solutions for the most common use cases.

Delivery will be delivered at a rapid pace over the next 18 months. Transactions will release in late 2014. Sub-second queries are coming in the first half of 2015, with a preview in the next few months. An initial outline of the delivery is below.  We expect this work to be completed as the initial work was, in scope and on schedule.


Enthusiasm abounds

It is not just Hortonworks that is enthusiastic about this next phase in the delivery of Enterprise SQL at Hadoop Scale.  Some of our key partners have weighed in on their excitement as well.  Watch this space over the next few days as Microsoft, Informatica, Microstrategy and Tableau all weigh in on this important initiative.

And as always, we are excited to continue our work within the Hive community to extend Hive, the leading SQL on Hadoop solution, further in terms of speed, scale, and SQL semantics.

Hive delivers a message of simplicity.  It already provides a single tool for all SQL across, batch and interactive workload and with it is extended to near real-time.  We’re enthusiastic about the upcoming journey as Hive adds exciting new features toward this goal. Watch this blog for future posts from Apache Hive committers and contributors from around the world, as they share enhancement ideas with the community.

Categorized by :


Winthrop Hayes
September 3, 2014 at 10:12 am


Raviteja Chirala
September 3, 2014 at 10:32 am

Success of Hive in Enterprise clusters is phenomenal. Not very soon or already replaced traditional warehouses.

September 4, 2014 at 5:43 am

exciting! materialized view, hive streaming, ACID support…

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.