Apache Hive 0.11: Stinger Phase 1 Delivered
In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.
Introducing Apache Hive 0.11 – 386 JIRA tickets closed
As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.
Delivering on the promise of Stinger Phase 1
As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.
Key features in Hive 0.11
- ORCFile. It’s Optimized.
The ORC File (Optimized RC File) presents key new features that speed access of data Apache Hive as it adds meta information at the file and block data level so that queries can be more intelligent and use meta data to optimize access. Further, with the ORC file, only the bytes from the required columns are read from HDFS which minimizes I/O and speeds the query chain. These are major advances for improved performance in Hive.
- Improved Data Types
As Apache Hive marches towards full SQL-compatibility, an update to the decimal data type was made more usable.
- Analytic Functions
Hive 0.11 introduces windowing functions for RANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUE and more. It also introduces aggregate OVER functions with PARTITION BY and ORDER BY
- Joins improved in Hive 0.11
Both the broadcast join and the SMB join were improved considerably in Hive 0.11. Both joins work without user hints, so that the Hive optimizer now picks the correct join rather than depending on the user to do so. More broadcast joins are now packed into a single MapReduce job, making star join queries much more efficient.
Towards YARN and the Power of SQL-IN-Hadoop
Hadoop 2.0 and explicitly YARN turns Hadoop from a single application system to a multi-application operating system. The next generation of Apache Hive, built on YARN, becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed beyond interactive query. It is the delivery of a multi-application data system. In this new world, Hive is a first class citizen along with a variety of workloads within a cluster and resources can be managed more discreetly.
Ultimately, this leads to further performance enhancements for Hive and with the inclusion of Tez, we will be able to demonstrate even more significant improvements as service startup times are removed a newly optimized execution chain within core Hadoop is delivered. The near future is exciting!
Apache Hive is empowering an ecosystem of SQL Based Applications
This release represents significant enhancements to Hive that will improve direct SQL interaction with Hive and light up the hundreds of applications that already rely on Hive as the defacto SQL interface for Hadoop. If you are one of the hundreds of software companies using Hive already, we hope you test out this new release and are happy with the results. We look forward to supporting it in HDP 1.3 in the very near future. 😉
Thank You to the Community
Thanks to 55 developers who contributed time and effort on this release: Alan Gates, Amareshwari Sriramadasu, Andrew Chalfant, Arup Malakar, Ashish Singh, Ashish Vaidya, Ashutosh Chauhan, Bennie Schut, Bhushan Mandhani, Billie Rinaldi, Brock Noland, Carl Steinbach, Chen Chun, Chris Drome, Dilip Joseph, Edward Capriolo, Gang Tim Liu, Gopal V, Gunther Hagleitner, Harish Butani, Ivan Gorbachev, Jarek Jarcec Cecho, Jean Xu, Jingwei Lu, Johnny Zhang, Jonathan Chang, Kevin Wilfong, Lars Francke, Li Yang, Mark Grover, Mayank Garg, Mikhail Bautin, Namit Jain, Navis, Nick Collins, Owen O’Malley, Pamela Vagata, Prajakta Kalmegh, Prasad Mujumdar, Roshan Naik, Sam Tunnicliffe, Samuel Yuan, Sean Busbey, Shreepadma Venugopalan, Sushanth Sowmyan, Teddy Choi, Thejas M Nair, Thiruvel Thirumoolan, Travis Crawford, Vikram Dixit K, Vinod Kumar Vavilapalli, Wonho Kim, Xiao Jiang, Zhenxiao Luo
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.