Stinger.next: Enterprise SQL at Hadoop Scale
Apache Hive is the de facto standard for SQL-in-Hadoop with more enterprises relying on this open source project than any alternative. The Stinger.next initiative is a broad, community-based effort to drive the future of Apache Hive, delivering true enterprise SQL at Hadoop scale.
In April, the final phase of the initial Stinger Initiative was delivered on schedule. This bought interactive SQL query to Apache Hive, advancing Hadoop’s SQL capabilities at petabyte scale in pure open source. Over 13 months, 145 developers from 44 companies delivered contributed over 390,000 lines of code to the Hive project alone. This work has already had tremendous impact for the Hadoop ecosystem
Stinger.next is a continuation of this initiative to further speed, scale and SQL in Hive across a familiar and attainable three-phase delivery schedule all in the open Apache Hive community, with these three objectives:
Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios. Explicitly, some of the key deliverables that will enable these new business applications of Hive include:
- Transactions with ACID semantics allow users to easily modify data with inserts, updates and deletes. They extend Hive from the traditional write-once, and read-often system to support analytics over changing data. This enables reporting with occasional corrections and modifications and allows operational reporting with periodic bulk updates from an operational database.
- Sub-second queries will allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements.
- SQL:2011 Analytics allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.
In addition to these primary use cases, some additional enhancements include:
- Hive Streaming Ingest helps Hive users expand operational reporting on the latest data.
- Hive Cross-Geo Query allows users to query and report on datasets distributed across geography due to legal or efficiency constraints. Users are currently are unable to do this or write application code to stitch together multiple results.
- Materialized views allow storing multiple views of the same data allowing faster analyses. The views can be held speculatively in-memory and discarded when memory is needed.
- Usability improvements will help users work more simply with Hive.
- Simplified deployment will focus on providing near plug and play deployment solutions for the most common use cases.
Stinger.next will be delivered at a rapid pace over the next 18 months. Transactions will release in late 2014. Sub-second queries are coming in the first half of 2015, with a preview in the next few months. An initial outline of the delivery is to the right. We expect this work to be completed as the initial work was, in scope and on schedule.
We look forward to these next steps but a lot has been completed already. below we outline the delivery of the original Stinger Initiative!
Original Stinger Initiative: Delivered
The Stinger initiative outlined three phases, and the Apache community delivered each on schedule. During Stinger’s thirteen months of continuous community collaboration 145 developers from 44 companies closed 1,672 Jira and added 392,000 lines of Java code. In just over one year, Stinger delivered speed, scale and SQL semantics.
In the first phase of delivery, with HDP 1.3, we saw:
- Performance improvements of 35x-45x for common analytical queries and
- Introduction of SQL windowing functions such as Rank, Lead, Lag, etc.
- Introduction of the ORCFile format
The release of HDP 2.0 marked the second major milestone of Stinger based improvements for Hive, introducing:
- A preview of the vectorized query engine, jointly developed with Microsoft and other community contributors, that speeds all types of queries, adding another 5x-10x improvement.
- Simplified SQL interoperability through the new VARCHAR and DATE datatypes and
- A new query optimizer that speeds complex queries by several factors.
Phase 3, delivered with HDP 2.1, completed the Stinger Initiative on schedule with:
- Apache Hive on Apache Tez, with pre-launched hot containers
- Vectorized query execution
- ACID framework for managing dimension tables and other master data
- Support for additional SQL functions and operators
- Support for data access: HTTP, SSL, Kerberos authentication
The improvements to Apache Hive—through the Stinger Initiative—delivered order-of-magnitude improvements in query latency and pushed several types of queries past 100x faster than in Hive 0.10.
HDP 2.0 introduced several major new performance features that benefit both small reporting queries and deep analytical queries. Some of which are described in this table:
We looked at TPC-DS Query 27, a fairly simple reporting query, back in February 2013 and showed that some improvements to the Hive query planner led to massive performance benefits. HDP 2.0 brings incremental progress by introducing vectorized query, which makes the map stages far more efficient.
HDP 2.1 delivers Apache Hive 0.13 on Apache Tez. With Hive on Tez, users have the option of executing queries on Tez. Tez’s dataflow model on a DAG of nodes facilitates simpler, more efficient query plans, which translates to significant performance improvements.
Hive 0.13 also delivers vectorized query execution mode that performs CPU computations 5-10x faster, translating to a 2-3x improvement in query performance.
ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding. These improvements meant:
- Sustained Query Times. Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale.
- Smaller Footprint. Better encoding with ORCFile in Apache Hive 12 reduces resource requirements for a cluster.
This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings. This dataset contains randomly generated data including strings, floating point and integer data.
We’ve already seen customers whose clusters were maxed out for storage move to ORCFile as a way to free up space while preserving complete compatibility with existing jobs.
Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process works seamlessly. Hive 0.12 built on Hive 0.11’s impressive compression ratios and delivered deep integration at the Hive and execution layers, which further accelerated queries even over larger datasets.
Our goal with SQL support is simple: Make Apache Hive a comprehensive and compliant SQL engine that meets enterprise needs. Hive 0.13 introduces the DECIMAL and CHAR datatypes. With the SQL standard-based authorization feature in Hive 0.13, users can now define their authorization policies in an SQL-compliant fashion. The Apache Hive community extended SQL language to support grant and revoke on entities. Hive also now supports show roles, user privileges, and active privileges.
- Base Optimizations
- SQL Types
- SQL Analytic Functions
- ORCFile Modern File Format
- Advanced Optimizations
- SQL Types
- SQL Analytic Functions
- Performance Boosts via YARN
- Container Pre-Launch
- Container Re-Use
- Tez Integration
- In-Memory Cache
- Transactions with ACID, allowing insert, update & delete
- Temporary tables
- Cost Based Optimizer for star & bushy join queries
- Sub-second queries with LLAP
- Hive-Spark Machine Learning integration
- Operational reporting w/ Hive streaming ingest & transactions
- SQL:2011 Analytics
- Materialized views
- Cross-geo queries
- Workload management via YARN and LLAP integration