Stinger: Interactive Query for Apache Hive
Apache Hive is the de facto standard for SQL-in-Hadoop with more enterprises relying on this open source project than any alternative. The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics.
The Stinger initiative outlines three phases. In the first phase of delivery we saw:
- Performance improvements of 35x-45x for common analytical queries and
- Introduction of SQL windowing functions such as Rank, Lead, Lag, etc.
The release of HDP 2.0 marked the second major milestone of Stinger based improvements for Hive, introducing:
- A preview of the vectorized query engine, jointly developed with Microsoft and other community contributors, that speeds all types of queries, adding another 5x-10x improvement.
- Simplified SQL interoperability through the new VARCHAR and DATE datatypes and
- A new query optimizer that speeds complex queries by several factors.
While this represents great progress, in Phase 3 we will see Hive on
Apache Tez. This will be released separately in beta form soon and will deliver order-of-magnitude improvements in query latency and push several types of queries past the 100x barrier.
Increasing Hive performance 100x remains the primary goal of the Stinger Initiative. The HDP 2.0 Beta introduces several major new performance features that benefit both small reporting queries and deep analytical queries. Some of which are describe in this table:
We looked at TPC-DS Query 27, a fairly simple reporting query, back in February and showed that some improvements to the Hive query planner led to massive performance benefits. HDP 2 brings incremental progress by introducing vectorized query, which makes the map stages far more efficient. The next big boost in Query 27 will come when we introduce the upcoming Tez Beta which unlocks true low latency on Hadoop. We believe Apache Tez will push this query and others like it past the 100x improvement mark.
Hive is not just for simple queries but can also handle quite complex queries. TPC-DS Query 95 is an extremely complex query including a 3-way fact table join. This query benefitted from our Hive 11 improvements but not to the extent that simple star schema join queries like Query 27 did. HDP 2 and the upcoming Hive 12 introduce a query optimizer that benefits complex queries by generating more efficient map/reduce plans. Even in the fastest time, Query 95 runs in 6 distinct MapReduce jobs, so the introduction of Tez will prove to be a massive boost here as well.
ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding. This focus on efficiency leads to some impressive compression ratios. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings. This dataset contains randomly generated data including strings, floating point and integer data. These improvements mean:
- Sustained Query Times. Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale.
- Smaller Footprint. Better encoding with ORCFile in Apache Hive 12 reduces resource requirements for a cluster.
We’ve already seen customers whose clusters are maxed out from a storage perspective moving to ORCFile as a way to free up space while being 100% compatible with existing jobs.
Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly. Hive 12 builds on these impressive compression ratios and delivers deep integration at the Hive and execution layers to accelerate queries, both from the point of view of dealing with larger datasets and lower latencies.
Our goal with SQL support is simple: Make Apache Hive a comprehensive and compliant SQL engine that meets Enterprise class needs. This round of Hive development introduces 2 critical new data types, VARCHAR, a very commonly used SQL type, and DATE, which is also very common and a natural choice for partitioning.
- Base Optimizations
- SQL Types
- SQL Analytic Functions
- ORCFile Modern File Format
Hive 0.11(HDP 1.3)
- Advanced Optimizations
- SQL Types
- SQL Analytic Functions
- Performance Boosts via YARN
Hive 0.12(HDP 2.0)
- Hive on Apache Tez
- Query Service (always on)
- Buffer Cache
- Cost Based Optimizer
Latest Progress reports:
- 3 Reasons to try Stinger Phase 3 Technical Preview
- Delivering on Stinger: a Phase 3 Progress Update
- Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop
- ORCFile in HDP 2: Better Compression, Better Performance
Blog Series on Apache Tez
- Apache Tez: A New Chapter in Hadoop Data Processing
- Data Processing API in Apache Tez
- Runtime API in Apache Tez
- Writing a Tez Input/Processor/Output
- Apache Tez: Dynamic Graph Reconfiguration