Category Archives: Tez


Apache Hive 0.11: Stinger Phase 1 Delivered

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop.  Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11.  This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others.  Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them.  See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring.  This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor.  As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Untitled

Key features in Hive 0.11

  • ORCFile.  It’s Optimized.
    The ORC File (Optimized RC File) presents key new features that speed access of data Apache Hive as it adds meta information at the file and block data level so that queries can be more intelligent and use meta data to optimize access.  Further, with the ORC file, only the bytes from the required columns are read from HDFS which minimizes I/O and speeds the query chain.  These are major advances for improved performance in Hive.
  • Improved Data Types
    As Apache Hive marches towards full SQL-compatibility, an update to the decimal data type was made more usable.
  • Analytic Functions
    Hive 0.11 introduces windowing functions for RANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUE and more. It also introduces aggregate OVER functions with PARTITION BY and ORDER BY
  • Joins improved in Hive 0.11
    Both the broadcast join and the SMB join were improved considerably in Hive 0.11.  Both joins work without user hints, so that the Hive optimizer now picks the correct join rather than depending on the user to do so. More broadcast joins are now packed into a single MapReduce job, making star join queries much more efficient.

Towards YARN and the Power of SQL-IN-Hadoop

Hadoop 2.0 and explicitly YARN turns Hadoop from a single application system to a multi-application operating system.  The next generation of Apache Hive, built on YARN, becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed beyond interactive query.  It is the delivery of a multi-application data system.  In this new world, Hive is a first class citizen along with a variety of workloads within a cluster and resources can be managed more discreetly.

Ultimately, this leads to further performance enhancements for Hive and with the inclusion of Tez, we will be able to demonstrate even more significant improvements as service startup times are removed a newly optimized execution chain within core Hadoop is delivered.  The near future is exciting!

Apache Hive is empowering an ecosystem of SQL Based Applications

This release represents significant enhancements to Hive that will improve direct SQL interaction with Hive and light up the hundreds of applications that already rely on Hive as the defacto SQL interface for Hadoop.  If you are one of the hundreds of software companies using Hive already, we hope you test out this new release and are happy with the results.  We look forward to supporting it in HDP 1.3 in the very near future.  ;)

Thank You to the Community

Thanks to 55 developers who contributed time and effort on this release: Alan Gates, Amareshwari Sriramadasu, Andrew Chalfant, Arup Malakar, Ashish Singh, Ashish Vaidya, Ashutosh Chauhan, Bennie Schut, Bhushan Mandhani, Billie Rinaldi, Brock Noland, Carl Steinbach, Chen Chun, Chris Drome, Dilip Joseph, Edward Capriolo, Gang Tim Liu, Gopal V, Gunther Hagleitner, Harish Butani, Ivan Gorbachev, Jarek Jarcec Cecho, Jean Xu, Jingwei Lu, Johnny Zhang, Jonathan Chang, Kevin Wilfong, Lars Francke, Li Yang, Mark Grover, Mayank Garg, Mikhail Bautin, Namit Jain, Navis, Nick Collins, Owen O’Malley, Pamela Vagata, Prajakta Kalmegh, Prasad Mujumdar, Roshan Naik, Sam Tunnicliffe, Samuel Yuan, Sean Busbey, Shreepadma Venugopalan, Sushanth Sowmyan, Teddy Choi, Thejas M Nair, Thiruvel Thirumoolan, Travis Crawford, Vikram Dixit K, Vinod Kumar Vavilapalli, Wonho Kim, Xiao Jiang, Zhenxiao Luo

Hortonworks Data Platform 2.0 Alpha 2 now available: focus on performance

We are very pleased to announce the Alpha 2 release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha2) is now available for download!

A key focus in HDP 2.0 Alpha 2 is on performance as announced in the Stinger initiative, and includes a series of enhancements to the performance of Apache Hive for interactive SQL queries.  In fact HDP 2.0 Alpha 2 was used to perform the tests announced yesterday, showing a 45X performance increase using Hive.  There is much more to come but we are pleased with the early results, and encourage Hive users to take a look and continue to give us feedback.

Consistent with HDP 2.0 Alpha 1, this version is built from the developmental Apache Hadoop 2.0 line and includes Apache YARN, a next-generation resource-management and application framework that enables Hadoop to support an ever-expanding range of use cases.  We are extremely excited about the opportunities that YARN enables – for background, check out Arun Murthy’s blog post series where he provides a YARN overview.

Notable new components over Alpha 1 include:

  • Apache Tez: A new Apache project that provides an optimized data processing framework on top of YARN. Tez is a general-purpose, highly customizable framework that simplifies data processing tasks across both small-scale, low-latency and large-scale, high-throughput workloads in Hadoop. Tez can provide an order of magnitude performance boost for the broader ecosystem of data processing tools such as Apache Hive and Apache Pig.
  • Apache Hive Interactive Query: Beyond the speedups made possible by Apache Tez, several new features were added to speed Hive queries. A new file format called the ORCFile (optimized RC file) optimizes how data is stored and accessed in Hive, and significant query optimizations reduce latency and improve performance.

Note that Tez is not enabled by default.  Instructions for doing so, and allowing Hive to use Tez, are in the installation guide.

Learn More
Please take a look at the Hortonworks Documentation to learn more about installing and using HDP 2.0 Alpha 2.

Download It
You can download HDP 2.0 Alpha 2 from the Hortonworks Download site.

Tell Us About It
Please visit the HDP 2.0 Alpha Forum to ask questions, get help, provide feedback and hear what others are doing with HDP. 

We are excited about the opportunities that Hadoop 2 provides for the future of Hadoop and large-scale data processing. HDP 2.0 Alpha 2 is a key milestone that provides organizations with a packaged release to evaluate and gain experience with the upcoming Apache Hadoop 2 technology stack. We look forward to your feedback on HDP 2.0 Alpha 2 while we work with the community to make Hadoop 2 a stable reality. Enjoy!

Note: This Alpha release is a technology preview to gather feedback from outside of Hortonworks. Some features are missing or incomplete. Some APIs may change. Do not use Alpha 2 for production use. Keep away from open flame. Support is only available via Forums.