Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 20, 2015
prev slideNext slide

Evaluating Hive with Tez as a Fast Query Engine

Kristen Hardwick, Vice President of Big Data Solutions at Spry, Inc is our guest blogger. In this blog, Kristen shares performance analysis during Spryinc’s evaluation of Apache Hive with Tez as a fast query engine.

In early 2014, Spry developed a solution that heavily utilized Hive for data transformations. When the project was complete, three distinct data sources were integrated through a series of HiveQL queries using Hive 0.11 on HDP 2.0. While the project was ultimately successful, the workflow itself took an astounding two full days to execute, with one query taking 11 hours.

Intuitively, these lengthy runtimes for queries heavily impact each stage of the development process – imagine troubleshooting why the query that takes eleven hours returns an empty set! After the project was delivered, the whole team started to question whether or not Hive was a viable option for these types of analytic development projects. The wide variety of commercial tools available in the Hadoop-compatible SQL market prompted the need to perform an analysis of alternatives.

Devising the Benchmarking Plan

Just as we were reaching this conclusion, Hortonworks announced the GA release of HDP 2.1, Apache Hive 0.13, and Apache Tez, including the promise of significant improvements to Hive through the Stinger Initiative. After upgrading our cluster and seeing some initial performance improvements, we began to design an experiment that would allow us to quantify the Hive performance improvements against the promises from other commercial tools.

Query Engines

Throughout the benchmarking effort, we worked with many different query engines, including SpliceMachine, SparkSQL, and Impala. We also experimented with a variety of file formats and optimization strategies to determine the best configuration for our analytic needs.

Of the larger set, the following query engine configurations were most appropriate for our needs:

  • Hive 0.11 on HDP 2.0, using MapReduce without ORCFiles, 1 GbE
  • Hive 0.13 on HDP 2.1, using Tez with ORCFiles and zlib compression, 10 GbE
  • Hive 0.14 on HDP 2.2, using Tez with ORCFiles and zlib compression, 10 GbE
  • Vendor 1, following vendor-specified configuration best practices, 10 GbE
  • Vendor 2, following vendor-specified configuration best practices, 10 GbE

Cluster Specifications

We performed the benchmarking process using virtual machines on the hardware normally used for analytic development. Each server hosted a single virtual machine with maximum available specifications. In this exercise, the cluster contained six nodes, each with 96 GB RAM, 6 1-TB drives, and 24 threads allocated.

The cluster was re-built after the evaluation of each query engine, with a fresh installation of the Hadoop distribution being performed each time in order to eliminate any lingering configuration effects between runs. All Hadoop and query engine properties were configured with vendor assistance, although additional time could be spent here in future benchmarking efforts to ensure that all Hive and OS-level configurations are fully optimized. Each query was executed such that no other users or queries were making use of the system resources.

Data Sizes

In order to generalize things so that the conclusions can be re-used for future efforts, we defined three classifications of data sizes based on number of records. Number of records were used instead of disk footprint in order to have the classification remain relevant regardless of the selected compression scheme.

These classifications are:

  • Small – Fewer than one million records
  • Medium – Fewer than one billion records, but more than one million records
  • Large – Fewer than one trillion records, but more than one billion records
  • Very Large – More than one trillion records

For the purposes of this benchmarking exercise, we worked with data in the Small, Medium, and Large ranges.

Query Use Cases

Of the hundreds of queries used to develop the analytic that sparked the investigation, we determined that there were fourteen prominent query characteristics across the entire set. We selected ten queries that provided suitable examples for various combinations of the characteristics, and used those queries as our test set.

The characteristics and associated query examples are described in the table below:

hive_tez_1

Each vendor was given access to the set of queries prior to execution on the cluster. The queries were updated to follow all vendor recommendations in terms of query best practices and data storage guidelines.

Collected Metrics

In order to compare the query engines fairly, we developed a “metrics collector” application that would be installed and configured on each node of the cluster. This metrics collector would monitor and log system usage during the execution of each query, so that we could have better insight into the utilization of resources for each query use case. We turned to a custom metrics collector instead of something built-in (like Ganglia) so that we could have full control over the level of granularity at which data was being captured for this benchmarking process.

The following metrics were captured for each query:

  • Runtime
  • CPU
    • User: % of CPU running in user space
    • System: % of CPU running in system space
    • Idle: 100 – User – System
  • RAM
    • RAM Usage: Number of GiB used by the system
    • Swap Usage: Number of GiB used by the system for swap
  • Disk
    • Read: Aggregated MiB/s over all the drives
    • Write: Aggregated MiB/s over all the drives
  • Network
    • Transmit: MiB/s transmitted over bond0
    • Receive: MiB/s received over bond0

One note about the collected metrics is that they were gathered with a single run, rather than an average of multiple runs. This was done intentionally as the goal was to obtain results that were closest to normal usage of the system.

How Did the Query Engines Compare?

Anecdotally, all query engines that we evaluated showed a significant improvement over what we experienced using HDP 2.0 and Hive 0.11. The following graph shows the runtime associated with each query, with all runtimes being drastically improved from when the same queries were executed in early 2014. The query that originally executed in more than 11 hours dropped under one hour for all query engines, and down to 15 minutes in the best case.

hive_tez_2

The following runtime graph removes the Hive 0.11 data and focuses on the differences in runtimes between two versions of Hive against two different commercial products. In all cases, the Hive 0.14 runtime was either comparable or significantly better than the runtimes from the enterprise tools.

Note that Query 4 was not run against Vendor 2 due to compatibility issues.

hive_tez_3

How Did Each Query Engine Utilize the Resources?

After seeing the runtime results, the immediate next step was to investigate the detailed resource utilization metrics that were collected during query execution to better understand how each query engine made use of the available resources, and to identify any factors that were limiting performance.

Query 1, HDP 2.2

The following graphs illustrate the performance metrics captured during the execution of Query 1 through Hive 0.14 with Tez. Query 1 is a query containing a one-to-many, large-to-medium dataset join, having both a long runtime and high RAM utilization in the original Hive 0.11 run.

Runtime: 16:13

CPU

hive_tez_4

RAM

hive_tez_5

Disk

hive_tez_6

Network

hive_tez_7

Analysis

From analyzing the graphs, it appears that CPU limited this query. One way to overcome this would be use machines with higher processing power. An alternative solution could be to add additional nodes with the same CPU specifications, especially since the Network Transmission rates are nearly zero.

Query 1, Vendor 2

When comparing execution times across all engines, Vendor 2 was 22 seconds faster than the Hive with Tez execution time for this query. The following graphs illustrate the performance metrics captured during the execution of Query 1 through Vendor 2’s query engine.

Runtime: 15:51

CPU

hive_tez_8

RAM

hive_tez_9

Disk

hive_tez_10

Network

hive_tez_11

Analysis

We immediately noticed that CPU, Network, and RAM utilization was extremely low throughout the duration of execution. This drastically contrasts with the Hive with Tez utilization, which had significant RAM and CPU usage. Overall, it appears that the limiting resource for Vendor 2’s query engine in this use case seems to be disk write speed. The disk write speed did not appear to be an issue for the Hive with Tez execution of the query.

Query 4, HDP 2.2

The following graphs illustrate the performance metrics captured during the execution of Query 4 through Hive 0.14 with Tez. Query 4 is a query containing both large-to-large dataset joins and the unions of multiple sub-queries.

CPU

hive_tez_12

RAM

hive_tez_13

Disk

hive_tez_14

Network

hive_tez_15

Analysis

It is clear that CPU and RAM are major resources for executing this query. The data seems to be processed in batches, so increasing the amount of available RAM or adding additional nodes to the cluster would both be appropriate ways to further improve the query performance.

Query 4, Vendor 1

When comparing execution times across all queries, Vendor 1 had the best runtime for this query. The following graphs illustrate the performance metrics captured during the execution of Query 4 through Vendor 1’s query engine.

CPU

hive_tez_16

RAM

hive_tez_17

Disk

hive_tez_18

Network

hive_tez_19

Analysis

The CPU resources were used very efficiently for this query, especially when compared to the utilization required by Hive with Tez. Additionally, the RAM utilization remains consistent over the full query execution time, which contrasts with the “batch” behavior illustrated with Hive with Tez.

Conclusion

After investigating, it is clear that Hive 0.14 is a strong competitor, even against commercial-grade query engines. We entered into this benchmarking process expecting Hive to be in last place at every turn, and exited having the utmost respect for the Stinger Initiative and the advancements that team has accomplished. The fact that Hive 0.14 with Tez performed faster or on-par with other query engines that we evaluated, along with the fact that Hive is the best choice from a cost perspective as well, makes us excited to continue to rely on Hive for our needs.

For more information, or for other blogs from the Spry team, visit www.spryinc.com or contact Kristen Hardwick, Vice President of Big Data Solutions (khardwick@spryinc.com).

Tags:

Comments

  • I’ve also been turned around on Hive on Tez as a very decent platform for large scale SQL queries on Hadoop. Initially I was skeptical about smaller low latency queries since I came from Impala but I’m also pleased with the Stinger initiatives results when combining Hive-on-Tez with ORC (which have also made Hive’s execution backend pluggable and opened the door for the upcoming Hive-on-Spark initiative being lead by Cloudera).

    I’m extremely eager to see the results of the Stinger.next initiative with LLAP for low latency. It’s taken a while to do all this heavy lifting for Hive but Hive is a workhorse – it just works compared to other SQL systems we’ve tried – and we rely on Hive exclusively at the moment. We might additionally use Apache Drill for low latency, SQL exploration + BI if it delivers what we need since it’s just gone GA and it’s hard to deny Dremel’s architecture as attractive (in which case I hope Hortonworks integrates it natively since it’s in Apache)

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>