Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
July 20, 2016
prev slideNext slide

Announcing Apache Hive 2.1: 25x Faster Queries and Much More

Apache Hive 2.1 was released about a month ago and it’s a great opportunity to review how Hive 2 is drastically changing the landscape for SQL on Hadoop.


There is so much new in Hive it’s hard to pick highlights, but here are a few:

  • Interactive query with Hive LLAP. LLAP was introduced in Hive 2.0 and improved in Hive 2.1 to deliver 25x faster performance than Hive 1 (covered in detail below).
  • Robust SQL ACID support with more than 60 stabilization fixes.
  • 2x Faster ETL through a smarter CBO, faster type conversions and dynamic partitioning optimizations.
  • Procedural SQL support, dramatically simplifying migration from EDW solutions.
  • Vectorization support for text files, introducing an option for fast analytics without any ETL.
  • A host of new diagnostics and monitoring tools including a new HiveServer2 UI, a new LLAP UI  and an improved Tez UI.
  • In total there are more than 2,100 features, improvements and fixes between Hive 2.0 and Hive 2.1. The rate of innovation is tremendous and momentum continues to grow.

We’ll explore these other topics in later posts. For now we’ll focus on the most anticipated feature of Hive 2 and the massive performance gains it unlocks.

25x Faster Query Performance with Hive LLAP

The biggest story in Hive 2 is also its most anticipated feature: LLAP which stands for ‘Live Long and Process’. To summarize it, LLAP combines persistent query servers and optimized in-memory caching that allows Hive to launch queries instantly and avoids unnecessary disk I/O. Put another way, LLAP is a second-generation big data system: LLAP brings compute to memory (rather than compute to disk), it caches memory intelligently and it shares this data among all clients, while retaining the ability to scale elastically within a cluster.

Figure 1: The LLAP Architecture


Figure 2: Tez LLAP process compared to Tez execution process and MapReduce process


Comparing Hive with LLAP to Hive on Tez

To measure the improvement LLAP brings we ran 15 queries that were taken from the TPC-DS benchmark, similar to what we have done in the past. The entire process was run using the hive-testbench repository and data generation tools. The queries there are adapted to Hive SQL but are otherwise not modified from the standard TPC-DS queries using any of the tricks that some big data vendors routinely use to show better performance for their tools. This blog only covers 15 queries but a more comprehensive performance test is underway.

The full test environment is explored below but at a high level the tests run using 10 powerful VMs with a 1TB dataset that is intended to show performance at data scales commonly used with BI tools. The same VMs and the same data are used both for Hive 1 and for Hive 2. All reported times represent the average across 3 runs in the respective Hive version.

Figure 3: Hive 1 with Tez versus Hive 2 with LLAP

As you can see, LLAP delivers a dramatic performance gain. Minimum query runtime with Hive LLAP is a mere 1.3 seconds, compared to 9.58 seconds in Hive 1.

Let’s discuss some of the main reasons for these performance gains.

Reason 1: Smarter Map Joins

Hive on Tez is a shared-nothing architecture: each processing unit works independently with its own memory and disk resources. LLAP is a multi-threaded process that allows memory sharing between workers. A map-side join requires a hash table to be distributed 1:1 into each map task. If you have 24 containers on a node you need to make 24 copies of the hash table and distribute it out. With LLAP you build the hash table once per node and cache it in-memory for all workers. This is especially important for low-latency SQL.

A great example of this is Query 55. In TPC-DS, Query 55 touches the smallest amount of data among any query that queries a fact table, just 1 month. To run this query in Hive on Tez the small date_dim and item tables must first be distributed to all Tez tasks. With LLAP this happens once per node, a large part of the reason LLAP’s average execution time is 1.3s, compared to Hive on Tez’s 24.72s.

Reason 2: Better MapJoin vectorization for joins
Many MapJoin optimizations have made their way into Hive 2. As one example, joins against small dimension tables now run as fast as explicitly expanded lists.

A great example of where this helps is Query 43, which has a 37% selectivity in the store dimension join. Better MapJoin vectorization, which takes advantage of repeating sequences in the fact table, helps take Query 43 from 195.2s down to 4.2s.

Reason 3: A Fully Vectorized Pipeline
Hive 2 introduces Map Join vectorization in the reduce side with dynamically partitioned hash joins, essentially a reduce-side version of the MapJoin optimization. With this optimization, reducer inputs are unsorted and streamed through a hash table held on the reduce side. The optimization divides a large dimension table into many small disjoint dimension tables, allowing for the previous dimension table optimizations to scale upwards in size.

A great example here is Query 13 which touches a combination of several very large and several very small dimension tables, so it needs to run as a shuffle join for safety but gets high selectivity from the other dimension filters. This optimization helps take Query 13 from 90.2s down to 4.8s.

Reason 4: A Smarter CBO
The integration with Apache Calcite for sophisticated cost-based optimization continues to deepen and is reaping big rewards. For a few examples, Hive’s CBO can now factor join keys out of deeply nested predicates (avoiding cross joins), infer transitive predicates across joins, and apply basic transformations even when tables have no stats (a big win for ETL jobs).

Test Environment

Hive 1 Hive 2

Hive 1.2.1

Tez 0.7.0

Hadoop 2.7.1

Hive 2.2.0

Tez 0.9.0-snapshot

Hadoop 2.7.1

Other Hive / Hadoop Settings:

All this software is deployed using Apache Ambari using HDP software that is currently in technical preview. In addition to the default settings from Ambari, some new optimizations are made for Hive 2. These optimizations will be set for default new installs at GA.

  • hive.vectorized.execution.mapjoin.native.enabled=true;
  • hive.vectorized.execution.mapjoin.minmax.enabled=true;
  • hive.vectorized.execution.reduce.enabled=true;
  • hive.llap.client.consistent.splits=true;
  • hive.optimize.dynamic.partition.hashjoin=true;
  • 10x d2.8xlarge EC2 nodes
OS Configuration:

For the most part, OS defaults were used with 1 exception:

  • /proc/sys/net/core/somaxconn = 4000

This screenshot gives a sense of how LLAP was configured to take advantage of the hardware within Ambari. Please note, LLAP configuration in Ambari is evolving at the time of this blog and current installation may look slightly different from this image.

Figure 4: Streamlined LLAP configuration in Apache Ambari
  • TPC-DS Scale 1000 data, partitioned by date_sk columns, stored in ORC format. The same data and tables were used both for Hive 1 and for Hive 2.
  • The test was driven by the Hive Testbench in data generation and in queries. The same query text was used both for Hive 1 and for Hive 2.
  • The 15 queries in this benchmark are a sample, a more comprehensive benchmark of Hive with LLAP will be published in a few weeks.

Try Hive 2.1 and LLAP Today!

As always, Apache Hive is 100% open source and can be used on the Hadoop distribution of your choice, and that goes for the performance improvements as well as all the other features discussed in the blog.

If you’re looking to try Hive 2.1 for yourself a few options include:

  1. Download the release directly from Apache and run on your Hadoop cluster.
  2. Try LLAP in the HDP 2.5 Technical Preview Sandbox. If you’re willing to tweak your Sandbox a little to give it more memory you can take LLAP for a test drive with minimal commitment.
  3. Use the LLAP template in the HDP-AWS Technical Preview. Pick the LLAP cluster type and off you go. Requires an AWS account.


Figure 5: LLAP in HDP-AWS


Dheeraj Rokade says:

Thanks for the informative blog. Hive 2.1 definitely looks more promising. I hope ACID support is also enriched. Thanks

Srihari Bhupathiraju says:

Looking forward to use Hive2.1 and see if it helps us to reduce the cluster usage.

Srinivas N says:

Is it possible to update the hive tables without creating ORC files from 2.1 version?

Eugene says:

If you mean SQL Update statement, then no.

Ion says:

Hive2.1 is not on the recent 2.. Ambari Release. Are there plans to make available a “version definition file” that will have Hive2.1 (with LLAP) available for Ambari?

JP says:

Both Hive 1 and Hive 2 are located under the Hive service. To enable Hive 2, select “Enable Interactive Query (Tech Preview)”.

tommyjiang says:

my hadoop cluster version is Hadoop 2.6.0-cdh5.8.2,
hive version is 2.1.0 deployed alone,
how can i deploy the llap service


is Hive 2 support Indexing table? If yes give me example for create index, I have facing one issue is that Operation not allowed: CREATE INDEX(line 1, pos 0)

can anyone help me to reach indexing using hive 2

Help microsoft windows 10 says:

Thanks for such a nice post and as I have this post is really true. One should follow and make your life secure.

Leave a Reply

Your email address will not be published. Required fields are marked *