While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. Although many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate.
Last week, the Apache Hive community released Apache Hive 0.14, which includes the results of the first phase in the Stinger.next initiative and takes Hive beyond its read-only roots and extends it with ACID transactions. Thirty developers collaborated on this version and resolved more than 1,015 JIRA issues.
Although there are many new features in Hive 0.14, there are a few highlights we’d like to highlight. For the complete list of features, improvements, and bug fixes, see the release notes.
Allows users to modify data using insert, update and delete SQL statements. This provides snapshot isolation and uses locking for writes. Now users can make corrections to fact tables and changes to dimension tables.
Now the query compiler uses a more sophisticated cost based optimizer that generates query plans based on statistics on data distribution. This works really well with complex joins and joins with multiple large fact tables. The CBO generates bushy plans that execute much faster.
Temporary tables exist in scratch space that goes away when the user session disconnects. This allows users and BI tools to store temporary results and further process that data with multiple queries.
After Hive 0.14, we’re planning on working with the community to deliver sub-second queries and SQL:2011 Analytics coverage in Hive. We also plan to work on Hive-Spark integration for machine learning and operational reporting with Hive streaming ingest and transactions.