The Apache Hive community has voted on and released version 0.13 today. This is a significant release that represents a major effort from over 70 members who worked diligently to close out over 1080 JIRA tickets.
Hive 0.13 also delivers the third and final phase of the Stinger Initiative, a broad community based initiative to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop.
Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive code base.
The three phases of this important project spanned Hive versions 0.11, 0.12 and 0.13. Additionally, the Apache Hive team coordinated this 0.13 release with the simultaneous release of Apache Tez 0.4. Tez’s DAG execution speeds Hive queries run on Tez.
With the delivery of Hive on Tez, users have the option of executing queries on Tez. Tez’s dataflow model on a DAG of nodes facilitates simpler, more efficient query plans, which translates to significant performance improvements and interactive query on Hive / Hadoop.
Some of the techniques that account for the speedup are:
Hive now has a vectorized query execution mode that performs CPU computations 5-10x faster, translating to a 2-3x improvement in query performance. Vectorized mode supports:
Hive 0.13 introduces a cost-based optimizer supporting join reordering.
Hive 0.13 also includes these other Speed improvements:
With the SQL standard-based authorization feature in Hive 0.13, users can now define their authorization policies in an SQL-compliant fashion. We extended SQL language to support grant and revoke on entities. Hive also now supports show roles, user privileges, and active privileges. Version 0.13 has a revamped, pluggable authorization API, which plugs gaps in authorization checks.
Other features added in the SQL category include:
The ongoing ACID work lays the groundwork for managing dimension tables and other master data, with the guarantee of consistent and repeatable reads. It also introduces transactions and support for streaming data into Hive. Hive 0.13 gives a preview of this functionality through allowing data to be streamed into Hive using Apache Flume, making data available for query within seconds.
Hive 0.13 adds many improvements to HiveServer2, HCatalog and JDBC access:
All of these Hive improvements mean that Hive 0.13 accepts a very large percentage of TPC-DS benchmark queries without rewrites.
Hive 0.13 introduces operator-level cardinality estimation. This lays the groundwork for cost-based query planning. This is already used in Join algorithm selection and parallelism planning in Tez. Stay tuned for the introduction of a broader cost-based planner in a future release.
The team also delivered:
As the community continued to fix hundreds of bugs, we built a strong base for improving our team’s operational efficiency. We moved to builds based on Maven, which significantly increased developer productivity. The parallel test framework cut down the time to run Hive’s large test suite.
The pre-commit testing workflow takes away most of the onerous work of validating new jiras, and we now have a new wiki and a documentation protocol that ensures much better documentation of new features and behavior changes.
MANY THANKS to these contributors on the 0.13 release: Alan Gates, Amareshwari Sriramadasu, Anandha Ranganathan, Ashutosh Chauhan, Bing Li, Brock Noland, Carl Steinbach, Chaoyu Tang, Chinna Rao Lalam, Chris Drome, Chun Chen, Daniel Dai, Deepesh Khandelwal, Edward Capriolo, Eric Hanson, Eugene Koifman, Gopal Vijayaraghavan, Gunther Hagleitner, Hari Sankar, Sivarama Subramaniyan, Jason Dere, Jitendra Nath Pandey, Justin Coffey, Karl Gierach, Kevin Wilfong, Killua Huang, Kostiantyn Kudriavtsev, Kousuke Saruta, Lefty Leverenz, Mark Grover, Maxim Bolotin, Mithun Radhakrishnan, Mohammad Kamrul Islam, Navis Ryu, Nick Dimiduk, Owen O’Malley, Prasad Mujumdar, Prasanth Jayachandran, Rajesh Balamohan, Remus Rusanu, Robert Roland, Sarvesh Sakalanaga, Satish Mittal, Sergey Shelukhin, Shanyu Zhao, Shivaraju Gowda, Shreepadma Venugopalan, Shuaishuai Nie, Steven Wong, Sun Rui, Sushanth Sowmyan, Swarnim Kulkarni, Szehon Ho, Teddy Choi, Teruyoshi Zenmyo, Thejas Nair, Thiruvel Thirumoolan, Timothy Chen, Tony Murphy, Travis Crawford, Vaibhav Gumashta, Venki Korukanti, Vikram Dixit, Viraj Bhat, Xiao Meng, Xuefu Zhang, Yi Tian, Yin Huai, Zhichun Wu and Zhiwen Sun.