One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.
These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.
Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.
A few key callouts:
Now let’s dive into the reasons that HDP 2.6 is so much faster than HDP 2.5.
Hive’s core execution engine and operators are highly optimized, and designed to work end-to-end with the ORCFile columnar format and LLAP’s in-memory cache. Given this level of optimization, the majority of Hive’s performance improvements in HDP 2.6 come from more sophisticated cost-based optimization along with dynamic runtime features that minimize the overall work Hive must do to satisfy a query. You can say that with HDP 2.6, Hive is working smarter and not harder, re-using its own work or avoiding it altogether when possible.
Past Hive benchmarks have focused on a subset of the TPC-DS queries. In large part this was because older versions of Hive lacked SQL features used by TPC-DS queries. With HDP 2.6 Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations.) For reference we’ve posted the full set of trivially-modified queries used in the the remainder of this blog post in the updated Hortonworks testbench repository.
Apache Impala is a SQL-on-Hadoop engine built specifically for interactive query. For this benchmark we compared Hive to Impala using the identical set of 99 trivially-modified TPC-DS queries, which can be found in our GitHub repository.
To summarize the results, the aggregate runtime for all queries is similar across the two engines, but Hive is able to run all 99 queries compared to only 60 in Impala.
This graph breaks down a full comparison of all runtimes, across the 60 queries which could be run in both engines.
When we drill into why Impala is only able to run 60 out of the full suite of 99 queries, this is what we find:
These missing features are not obscure SQL features. Instead they are features offered by almost all commercial SQL products and an ever-growing list of open-source SQL tools like Apache Hive. Some of these limitations can be dealt with on a query-by-query basis (such as lack of a date datatype), but such changes are impractical across an entire SQL suite or workload. Other Impala limitations require extremely difficult workarounds, such as missing support for rollups or intersections. Overall, Hive offers much better workload portability if you are considering offloading workloads from a legacy EDW solution.
For reference, this graph shows Hive performance across all 99 trivially-modified TPC-DS queries.