cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
October 06, 2016
prev slideNext slide

Where is Apache Hive going? To In-memory computing.

Apache Hive(™) is the most complete SQL on Hadoop system, supporting comprehensive SQL, a sophisticated cost-based optimizer, ACID transactions and fine-grained dynamic security.

Though Hive has proven itself on multi-petabyte datasets spanning thousands of nodes many interesting use cases demand more interactive performance on smaller datasets, requiring a shift to in-memory. Hive 2 marks the beginning of Hive’s journey from a disk-centric architecture to a memory-centric architecture through Hive LLAP (Live Long and Process). Since memory costs about 100x as much as disk, memory-centric architectures demand a careful design that makes the most of available resources.

In this blog, we’ll update benchmark results from our earlier blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More.

llaparchitecture

Not All Memory Architectures are Equal

The in-memory shift is one of the most important trends of the last 10 years of computing. Still, RAM costs about 100x more than disk so it’s by far the most carefully budgeted and over-allocated resource in the datacenter. Applications can’t expect to hold all data in memory all the time and we know from experience that you won’t see substantial benefits from an in-memory architecture unless it’s carefully architected to make the most of memory.

This means:

  1. You must cache only the specific data needed for processing queries.
  2. The cache needs to be shared across all users who need the data.
  3. The data needs to stay compressed in RAM to offset the 100x higher cost of RAM.

When we put this together we see a very sharp contrast between naive in-memory implementations versus architectures optimized from the ground up to make the most of memory.

inmemorymaturity

 

In the Hadoop ecosystem we have seen firsthand that Type 1 systems offer limited benefits, so the Hive community decided to skip directly to a Type 2 In-Memory system. Hive LLAP delivers on all these attributes, with a single shared, managed cache that stores data compressed in memory. In addition, there is work actively underway to optimize this even further, using Flash to supersize the cache and extending LLAP’s ability to operate directly on compressed data without decompressing it.

inmemorymaturitymatrix

Benchmarking LLAP at 10 TB Scale

Let’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.

In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.

Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.

atspeedatscale

If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1.

llapblog

If we focus just on smaller queries that run in 30 seconds or less we see about a 9x speedup. At this 10 TB scale a total of 24 queries run in 30 seconds or less with LLAP.

llapblog2

HDP Test Environment

Software:

Hive 1 (HDP 2.5) Hive 2 (HDP 2.5)

Hive 1.2.1

Tez 0.7.0

Hadoop 2.7.3

No LLAP

Hive 2.1.0

Tez 0.8.4

Hadoop 2.7.3

All queries run through LLAP

Other Hive / Hadoop Settings:

All HDP software was deployed using Apache Ambari using HDP 2.5 software. All defaults were used in our installation. In spite of that we mention relevant net-new settings in HDP 2.5 important in this benchmark. These optimizations are set by default for new HDP 2.5 installs. Upgraded HDP clusters would need to set these explicitly.

  • hive.vectorized.execution.mapjoin.native.enabled=true;
  • hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
  • hive.vectorized.execution.mapjoin.minmax.enabled=true;
  • hive.vectorized.execution.reduce.enabled=true;
  • hive.llap.client.consistent.splits=true;
  • hive.optimize.dynamic.partition.hashjoin=true;

Hardware:

  • 10x d2.8xlarge EC2 nodes

OS Configuration:

For the most part, OS defaults were used with 1 exception:

  • /proc/sys/net/core/somaxconn = 4000

Data:

  • TPC-DS Scale 10000 data (10 TB), partitioned by date_sk columns, stored in ORC format with Zlib compression.

Queries:

Learn More and Get Started

For more information, please check out the following resources:

  1. If you’re looking for a quick test on a single node, the Hortonworks Sandbox 2.5. Download the Sandbox and this LLAP tutorial will have you up and running in minutes. Note: you’ll need a system with at least 16 GB of RAM for this approach.
  2. Want a quick start in the cloud? Hortonworks Data Cloud (in Technical Preview) has you covered and this handy tutorial will guide you each step of the way.
  3. Learn more about Apache Hive resource page on Hortonworks.
  4. Share your experiences with us on the Hortonworks Community Connection.

Comments

  • Any chance you could update this post with some higher resolution images & charts (like ones as wide as the text on the page or links to high res images)?

  • Leave a Reply

    Your email address will not be published. Required fields are marked *