Announcing Apache Hive 0.13 and Completion of the Stinger Initiative!
The Apache Hive community has voted on and released version 0.13 today. This is a significant release that represents a major effort from over 70 members who worked diligently to close out over 1080 JIRA tickets.
Hive 0.13 also delivers the third and final phase of the Stinger Initiative, a broad community based initiative to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop.
Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive code base.
The three phases of this important project spanned Hive versions 0.11, 0.12 and 0.13. Additionally, the Apache Hive team coordinated this 0.13 release with the simultaneous release of Apache Tez 0.4. Tez’s DAG execution speeds Hive queries run on Tez.
Speed & Scale
With the delivery of Hive on Tez, users have the option of executing queries on Tez. Tez’s dataflow model on a DAG of nodes facilitates simpler, more efficient query plans, which translates to significant performance improvements and interactive query on Hive / Hadoop.
Some of the techniques that account for the speedup are:
- Broadcast Joins – like MapJoin, but without need to build a hashtable on the client,
- Dynamic Partitioned Hash Joins – to distribute small table based on the Big Table bucketing trait,
- Cardinality estimation-based decision on Join algorithm and parallelism, and
- Pre-launch of containers
Hive now has a vectorized query execution mode that performs CPU computations 5-10x faster, translating to a 2-3x improvement in query performance. Vectorized mode supports:
- All common SQL operators: Project, Filter, MapJoin, SMBJoin, and GroupBy.
- All common SQL functions: In, Case, Between, Comparators, String and Date.
Hive 0.13 introduces a cost-based optimizer supporting join reordering.
Hive 0.13 also includes these other Speed improvements:
- Stats-based short cuts of aggregated queries (e.g. min, max and count)
- Split elimination in ORC, using stripe stats
- Meta store partition pruning for more datatypes
- Faster plan serialization
- Faster MapJoins by improving the Hashtable footprint
- Order of magnitude speedup of fetching Column level Stats
With the SQL standard-based authorization feature in Hive 0.13, users can now define their authorization policies in an SQL-compliant fashion. We extended SQL language to support grant and revoke on entities. Hive also now supports show roles, user privileges, and active privileges. Version 0.13 has a revamped, pluggable authorization API, which plugs gaps in authorization checks.
Other features added in the SQL category include:
- Support for the DECIMAL and CHAR datatypes
- Unqualified joining conditions
- Standard-based Quoted Identifier behavior
- Common table expressions
- Sub-query for IN, NOT IN, EXISTS and NOT EXISTS (correlated and uncorrelated)
- Permanent functions
- JOIN conditions in the WHERE clause
The ongoing ACID work lays the groundwork for managing dimension tables and other master data, with the guarantee of consistent and repeatable reads. It also introduces transactions and support for streaming data into Hive. Hive 0.13 gives a preview of this functionality through allowing data to be streamed into Hive using Apache Flume, making data available for query within seconds.
Hive 0.13 adds many improvements to HiveServer2, HCatalog and JDBC access:
- Hive Server 2
- HTTP support
- SSL support for both binary and HTTP (HTTPS)
- Kerberos authentication over HTTP(S)
- Support for HTTP(S) through a trusted proxy
- HCatalog parity for all Hive data types
- Reconciliation of HCatalog and Hive “INSERT INTO” semantics
- Support for JDBC job cancel
- Async execution
All of these Hive improvements mean that Hive 0.13 accepts a very large percentage of TPC-DS benchmark queries without rewrites.
Other Important Advances
Hive 0.13 introduces operator-level cardinality estimation. This lays the groundwork for cost-based query planning. This is already used in Join algorithm selection and parallelism planning in Tez. Stay tuned for the introduction of a broader cost-based planner in a future release.
The team also delivered:
- Mavenization of Hive
- A parallel test framework
- A new Hive wiki
- Support for Parquet file format
As the community continued to fix hundreds of bugs, we built a strong base for improving our team’s operational efficiency. We moved to builds based on Maven, which significantly increased developer productivity. The parallel test framework cut down the time to run Hive’s large test suite.
The pre-commit testing workflow takes away most of the onerous work of validating new jiras, and we now have a new wiki and a documentation protocol that ensures much better documentation of new features and behavior changes.
MANY THANKS to these contributors on the 0.13 release: Alan Gates, Amareshwari Sriramadasu, Anandha Ranganathan, Ashutosh Chauhan, Bing Li, Brock Noland, Carl Steinbach, Chaoyu Tang, Chinna Rao Lalam, Chris Drome, Chun Chen, Daniel Dai, Deepesh Khandelwal, Edward Capriolo, Eric Hanson, Eugene Koifman, Gopal Vijayaraghavan, Gunther Hagleitner, Hari Sankar, Sivarama Subramaniyan, Jason Dere, Jitendra Nath Pandey, Justin Coffey, Karl Gierach, Kevin Wilfong, Killua Huang, Kostiantyn Kudriavtsev, Kousuke Saruta, Lefty Leverenz, Mark Grover, Maxim Bolotin, Mithun Radhakrishnan, Mohammad Kamrul Islam, Navis Ryu, Nick Dimiduk, Owen O’Malley, Prasad Mujumdar, Prasanth Jayachandran, Rajesh Balamohan, Remus Rusanu, Robert Roland, Sarvesh Sakalanaga, Satish Mittal, Sergey Shelukhin, Shanyu Zhao, Shivaraju Gowda, Shreepadma Venugopalan, Shuaishuai Nie, Steven Wong, Sun Rui, Sushanth Sowmyan, Swarnim Kulkarni, Szehon Ho, Teddy Choi, Teruyoshi Zenmyo, Thejas Nair, Thiruvel Thirumoolan, Timothy Chen, Tony Murphy, Travis Crawford, Vaibhav Gumashta, Venki Korukanti, Vikram Dixit, Viraj Bhat, Xiao Meng, Xuefu Zhang, Yi Tian, Yin Huai, Zhichun Wu and Zhiwen Sun.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.