Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.
Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP (Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.
In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.”
The in-memory shift is one of the most important trends of the last 10 years of computing. Still, RAM costs about 100x more than disk so it’s by far the most carefully budgeted and over-allocated resource in the datacenter. Applications can’t expect to hold all data in memory all the time and we know from experience that you won’t see substantial benefits from an in-memory architecture unless it’s carefully architected to make the most of memory.
When we put this together we see a very sharp contrast between naive in-memory implementations versus architectures optimized from the ground up to make the most of memory.
In the Hadoop ecosystem we have seen firsthand that Type 1 systems offer limited benefits, so the Hive community decided to skip directly to a Type 2 In-Memory system. Hive LLAP delivers on all these attributes, with a single shared, managed cache that stores data compressed in memory. In addition, there is work actively underway to optimize this even further, using Flash to supersize the cache and extending LLAP’s ability to operate directly on compressed data without decompressing it.
Let’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.
In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.
Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.
If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1.
If we focus just on smaller queries that run in 30 seconds or less we see about a 9x speedup. At this 10 TB scale a total of 24 queries run in 30 seconds or less with LLAP.
|Hive 1 (HDP 2.5)||Hive 2 (HDP 2.5)|
All queries run through LLAP
All HDP software was deployed using Apache Ambari using HDP 2.5 software. All defaults were used in our installation. In spite of that we mention relevant net-new settings in HDP 2.5 important in this benchmark. These optimizations are set by default for new HDP 2.5 installs. Upgraded HDP clusters would need to set these explicitly.
For the most part, OS defaults were used with 1 exception:
For more information, please check out the following resources: