Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 21, 2015
prev slideNext slide

Impala vs. Hive Performance Benchmark

Yahoo! JAPAN needed a data platform that could scale to generate 100,000 reports per day as well as having the ability to process large amounts of data. It needed to keep the last 13 months’ worth of data, which is approximately 500 billion rows, organized and easily accessible. Relational Database Management Systems (RDBMS) cannot scale to these levels from a cost and processing power perspective. Yahoo! JAPAN explored Hadoop to achieve this and evaluated two platforms based on our requirements; Hortonworks Hive and Tez on YARN and Cloudera Impala. Hive and Tez on YARN was able to scale beyond 15,000 queries per hour while Impala hovered at about 2,500 queries per hour.

BACKGROUND

Our business report systems generate 15,000 reports per hour from about 500 billion lines of Time Series data over the last 13 months based on conditions predefined by a user. These report systems had the following problems because of file-based processing:

  • While data for the last 13 months is stored, we can use only three months of data.
  • Only small variations of grouping, such as by region and gender, are available
  • There are no filters to select values above specific levels, etc.

It is possible to put data in an RDBMS and use SQL queries for implementing more processing variations and filter functions. However, it was not possible to migrate to an RDBMS due to the large amount of data and large number of jobs.

We reviewed Hadoop-based SQL engines from Cloudera and Hortonworks, and solved these challenges by employing Hortonworks’ HDP2.2 and Apache Hive-0.14, Apache Tez.

Use Case

When a user specifies query conditions, this system creates asynchronous report data for later download.

  1. Period (dates: FROM-TO)
  2. Grouping by gender, region, etc.
  3. Timing of report generation (daily, weekly, etc.)

The time report generation option was the most challenging by far. Many users specify periodic reporting, such as daily and weekly reporting. A huge amount of report creating processes are executed immediately after daily data updates, which increases the number of queries per unit of time.

 

Data

This report system handles files in the following TEXT gz format.

 

 

Format                                 Text gz

File size per day                  10GB 

Number of lines per day     1.3 billion lines 

Data retention period         13 months 

Total data amount               6,000GB 

Total lines                            450 billion lines

Approach

Originally, MapReduce is suited for batch processing. We thought that it would be practical to use it in the report system, if we could control the latency for each query and ensure parallel execution performance.

Considering Impala

We tried Impala, which has a different execution engine from MapReduce.

Impala was promising because it executes a query in a relatively short amount of time. It had a typical range of several seconds to several tens of seconds.

If the response time for one query is 15 seconds, about 240 queries can be executed per hour.

For increasing the processing amount per unit of time, we considered the idea of increase the number of parallel queries per unit time by increasing the multiplicity of queries.

However, we encountered a problem. Impala’s processing time increases linearly when the degree of query parallelism increases.

It could not handle all of the batch queries that are automatically executed every day after data updates. Therefore, we had to give up on the idea of employing Cloudera Impala.

Considering Hive and Tez on YARN

With the advent of Hadoop-2.x and YARN, more fine-tuned parallel execution control has become possible, and the use of the Tez engine has significantly reduced latency of MapReduce.

YARN and Tez adds significant processing power, scaling data processing to the levels beyond 15,000 jobs per hour. This is a dramatic leap from the 100,000 jobs per day processing capability that we had with our previous cluster setup that is based on Hadoop 1.0.

TEST

Test Data and SQL

At first, for the testing purpose, little less than 2,000 SQL queries were picked up from those actually used by the real report generation processing.

Most SQL returns the result set of less than 1,000 rows, but also include some which returned big sets of more than 100,000 rows.

fig1_rows_of_sample

Hive on Tez vs Impala

At first, we compared with Impala which we were planning to deploy. These 2,000 SQL run in 32 parallels, and fig 2 is the graph of the breakdown of all the SQL processing time.

Hive processes most SQL within 20 seconds, and even the SQL with the big result set, which takes more time, completed the process within 70 seconds.

On the other hand, many took 30 to 90 seconds for Impala, and some SQL’s took more than 10 minutes, namely the big result sets, and performance degradation during parallel execution was remarkable.

 

fig2_impala_vs_hive

Impala can process in milliseconds when running at low load conditions and Impala is one of the valid choices if no SQL parallel processing is executed.

However, there is a batch execution in our use case, and SQL parallel execution was a mandatory requirement, and that is why we could not choose Impala.

Concurrency of Hive

Then we tested how many parallel queries can be executed, and how much of the degree of parallelism is processed most effectively for a single HiveServer2 instance.

We raised number of SQL parallel execution by 16 a time, and confirmed all processing times of the sample SQL.

The throughput was raised rapidly up until it reached parallel 64. At this point, the throughput started declining. In this environment, it was concluded that our needs would be best met if we set the number of parallels 64

fig3_concurrency

Conclusion

The single SQL execution time is around 15 seconds for Hive on Tez, and it would scale by raising the parallel number. (The limit of parallel number depends on the size and performance of cluster)

Impala returns a very fast response when the cluster is at low load state, but is not suitable for use such as running the SQL in parallel.

We have decided to adopt Hive on Tez because it can process the 15,000 SQL per hour that is being requested from our service.

Tags:

Comments

  • Very interesting results.

    Can you also share details of hardware config, Impala tuning and Hive/Tez optimization? Did you work with Impala experts at Cloudera to tune the system?

    Appreciate you sharing this info.

  • Hello,

    This is Yifeng from Hortonworks. I worked closely with the Yahoo! JAPAN team on this project. Let me share more details of Hive/Tez optimization (Impala was tuned by Cloudera engineers).

    Goal of Hive/Tez optimization is basically full utilization of cluster.

    1) Partitioning
    Basically maximizing the number of partitions while keeping the total number under a couple thousand partitions. Hive and the HCat seem to be able to handle this for queries very well.

    2) Loading Data
    We had to convert the data into ORC file. There are three main parameters for defining how ORC stores the data: block size, stripe size, and compression. In general the Hive defaults of 64MB stripes, and 256MB blocks should work well. We went with ZLIB compression. ZLIB is optimized highly in new hive versions.

    We need to make sure that ORC files are big enough. General best practice is that they should be between 1 and 10 HDFS blocks if possible. One mapper/reducer generates 1 file, so when creating ORC files, we make sure that we don’t have lots of reducers that write to all partitions. This can be done by enalbing dynamic partition sort optimization or using DISTRIBUTED BY. We chose DISTRIBUTED BY for fine grained control.

    3) Query Execution
    A query execution essentially is put together from
    – Client execution [ 0s if done correctly ]
    – Optimization [HiveServer2] [~ 0.1s]
    – HCatalog lookups [Hcatalog, Metastore] [ very fast in hive 14 ]
    – Application Master creation [4-5s]
    – Container Allocation [3-5 s]
    – Query Execution

    Our queries were already highly optimized, so we focus on some other parts.

    Connection setup has high overhead
    – Open one connection and execute large number of queries
    – Standard connection pooling

    HiveServer becomes bottleneck at around 8-15 queries/s
    – 2 HiveServers and distribute queries between them
    – New fix out parallelizing query compilation in later versions

    Reinitializing Tez session takes 5+ seconds
    – Tez sessions are reused and can be pre-initialized with hive pre-warm enabled (with some drawback).

    Re-creating containers takes 3 seconds
    – Enable container reuse and keeping containers for a small period of time
    – Key is to reach 100% utilization without wasting resources.

    4) Paralization
    In most query tuning scenarios this is at first ignored. However it is perhaps the most important point for many real-world scenarios. Oftentimes single queries benefit from additional resources but this can reduce throughput.

    To gain highest throughput, Tez memory settings are key. We adjusted Tez memory and achieved 90% CPU utilization in cluster.

    Cheers,
    Yifeng

  • I got some offline request about the Impala part. I don’t know the details but the customer had a subscription contract with Cloudera. They did get support from Cloudera.

  • Let me explain about the Impala cluster.
    The Impala cluster was tuned by Yahoo! JAPAN engineers,
    not Cloudera’s engineers or any Impala experts.
    We only use the information that we got from Q&A within the subscription contruct of Cloudera.

  • Hello,

    This is Yifeng from Hortonworks. I worked closely with the Yahoo! JAPAN team on this project. Let me share more details of the Hive/Tez optimization part.

    Goal is full utilization of cluster and reuse of Tez containers.

    1) Partitioning
    Basically maximizing the number of partitions while keeping the total number under a couple thousand partitions. Hive and the HCatalog seem to be able to handle this for queries very well.

    2) Loading Data
    We had to convert the data into ORC file. There are three main parameters for defining how ORC stores: the data block size, Stripe size, and compression. In general the Hive defaults of 64MB stripes, and 256MB blocks should work well. We chose ZLIB compression. ZLIB is optimized highly in new hive versions.

    We need to make sure that ORC files are big enough. General best practice is that they should be between 1 and 10 HDFS blocks if possible. One mapper/reducer generates 1 file so when creating ORC files we make sure that we don’t have lots of reducers that write to all partitions. This can be done by enabling optimize sort dynamic partitioning or using the DISTRIBUTED BY clause. We chose DISTRIBUTED BY for fine grained control.

    3) Query Execution
    Our queries were already highly optimized. So we focused on some other parts. A query execution essentially is put together from
    – Client execution [ 0s if done correctly ]
    – Optimization [HiveServer2] [~ 0.1s]
    – HCatalog lookups [Hcatalog, Metastore] [ very fast in hive 14 ]
    – Application Master creation [4-5s]
    – Container Allocation [3-5s]
    – Query Execution

    Here are the optimizations we did.

    Connection setup has high overhead
    – Open one connection and execute large number of queries
    – Standard connection pooling

    HiveServer becomes bottleneck at roughly 8-15 queries/s
    – 2 HiveServers to distribute queries
    – New fix out parallelizing query compilation in later versions

    Reinitializing Tez session takes 5+ seconds
    – Tez sessions are reused and can be pre-initialized with pre-warm (with some drawbacks).
    – With pre-warm, full speed is practically instantaneous

    Re-creating containers takes 3 seconds
    – Enable container reuse and keeping containers for a small period of time
    – Key is to reach 100% utilization without wasting resources

    4) Parallelization
    In most query tuning scenarios this is at first ignored. However it is perhaps the most important point for many real-world scenarios. Oftentimes single queries benefit from additional resources but this can reduce throughput.

    Tez memory settings are key for parallelization. With optimized Tez memory settings, we achieved 90% CPU utilization in cluster.

    Cheers,
    Yifeng

    • Yifeng, Can you be more specific on the memory optimizations for Tez? Specifically, what parameters were tuned, and to what values? In conjunction with what other variables were they tuned? (Or, upon which variables did the memory tuning depend? Block size? Data size?)

  • did you use Parquet+Partitioning with Impala?

    afaik Impala doesn’t support ORC, so I think it’s a little unfair that you used ORC+Partitioning with Hive+Tez but you didn’t use Parquet+Partitioning with Impala.

    • Yes, I used Parquet+Partitioning with Impala.
      I choose the strage format depending on each of Impala and Hive,
      but I made both of them have a same partition layout.

      • Are there any details on hardware, cluster size and the underlying Hadoop configurations?

        Of the testing that was done, I only see a direct comparison vs. Impala with the second diagram (“Number of queries in time range”). It’s not clear to me what this test is all about.

        Did you submit the same query simultaneously 2000 times? Or are these 2000 distinct queries? What was the query? Was Impala used through JDBC/HiveServer2, or were the queries issued directly to the daemons? Did you load balance between the Impala daemons? (i.e., Did you remove the client/proxy as the bottleneck, as was done for Hive/Tez?)

        I’d also like to see how Impala did on the other two tests.

  • For me this is an obvious result based on the architectures of Hive and Impala.

    Impala is an MPP-like engine, so each query you are executing on it will start executor on each and every node of your cluster. This delivers the best performance for a single query running on the cluster, but the total throughput degrades heavily under high concurrency. In such systems you should limit the amount of parallel queries to kinda low value of ~10.

    Hive is an opposite, it runs on top of MP/Tez framework which requests resources based on the amount of data to process. This way for large clusters it would give you much better concurrency for “small” queries you’ve tested, as each of them would request small amount of execution resources which would result in more queries running in parallel.

    But still Impala should be faster for more complex queries like queries with a set of operations over different window functions and similar ones. They require many separate execution steps so Impala here would take an advantage of a better pipelining of intermediate data (of course, if it fits into memory).

    • Hello Alexey,

      I think you are spot-on with the advantages of Hive/Tez for large number of concurrent queries. Since each session is running in its own completely separate Tez instance we could linearly scale the number of queries until we ran into CPU limitations. I don’t think the answer is that simple for queries with large amount of data however. Tez is after all a DAG framework and can reuse steps as well. And it can keep and reuse containers on all nodes. Impala had some advantages on very short execution times however with LLAP and other performance enhancements in future versions we would hope to close that as well. Now there are obviously some specific queries that work better on either system but I doubt that it can be generalized quite as easily.

      Best regards,

      Benjamin

  • Leave a Reply

    Your email address will not be published. Required fields are marked *