Kristen Hardwick, Vice President of Big Data Solutions at Spry, Inc is our guest blogger. In this blog, Kristen shares performance analysis during Spryinc’s evaluation of Apache Hive with Tez as a fast query engine.
In early 2014, Spry developed a solution that heavily utilized Hive for data transformations. When the project was complete, three distinct data sources were integrated through a series of HiveQL queries using Hive 0.11 on HDP 2.0. While the project was ultimately successful, the workflow itself took an astounding two full days to execute, with one query taking 11 hours.
Intuitively, these lengthy runtimes for queries heavily impact each stage of the development process – imagine troubleshooting why the query that takes eleven hours returns an empty set! After the project was delivered, the whole team started to question whether or not Hive was a viable option for these types of analytic development projects. The wide variety of commercial tools available in the Hadoop-compatible SQL market prompted the need to perform an analysis of alternatives.
Just as we were reaching this conclusion, Hortonworks announced the GA release of HDP 2.1, Apache Hive 0.13, and Apache Tez, including the promise of significant improvements to Hive through the Stinger Initiative. After upgrading our cluster and seeing some initial performance improvements, we began to design an experiment that would allow us to quantify the Hive performance improvements against the promises from other commercial tools.
Throughout the benchmarking effort, we worked with many different query engines, including SpliceMachine, SparkSQL, and Impala. We also experimented with a variety of file formats and optimization strategies to determine the best configuration for our analytic needs.
Of the larger set, the following query engine configurations were most appropriate for our needs:
We performed the benchmarking process using virtual machines on the hardware normally used for analytic development. Each server hosted a single virtual machine with maximum available specifications. In this exercise, the cluster contained six nodes, each with 96 GB RAM, 6 1-TB drives, and 24 threads allocated.
The cluster was re-built after the evaluation of each query engine, with a fresh installation of the Hadoop distribution being performed each time in order to eliminate any lingering configuration effects between runs. All Hadoop and query engine properties were configured with vendor assistance, although additional time could be spent here in future benchmarking efforts to ensure that all Hive and OS-level configurations are fully optimized. Each query was executed such that no other users or queries were making use of the system resources.
In order to generalize things so that the conclusions can be re-used for future efforts, we defined three classifications of data sizes based on number of records. Number of records were used instead of disk footprint in order to have the classification remain relevant regardless of the selected compression scheme.
These classifications are:
For the purposes of this benchmarking exercise, we worked with data in the Small, Medium, and Large ranges.
Of the hundreds of queries used to develop the analytic that sparked the investigation, we determined that there were fourteen prominent query characteristics across the entire set. We selected ten queries that provided suitable examples for various combinations of the characteristics, and used those queries as our test set.
The characteristics and associated query examples are described in the table below:
Each vendor was given access to the set of queries prior to execution on the cluster. The queries were updated to follow all vendor recommendations in terms of query best practices and data storage guidelines.
In order to compare the query engines fairly, we developed a “metrics collector” application that would be installed and configured on each node of the cluster. This metrics collector would monitor and log system usage during the execution of each query, so that we could have better insight into the utilization of resources for each query use case. We turned to a custom metrics collector instead of something built-in (like Ganglia) so that we could have full control over the level of granularity at which data was being captured for this benchmarking process.
The following metrics were captured for each query:
One note about the collected metrics is that they were gathered with a single run, rather than an average of multiple runs. This was done intentionally as the goal was to obtain results that were closest to normal usage of the system.
Anecdotally, all query engines that we evaluated showed a significant improvement over what we experienced using HDP 2.0 and Hive 0.11. The following graph shows the runtime associated with each query, with all runtimes being drastically improved from when the same queries were executed in early 2014. The query that originally executed in more than 11 hours dropped under one hour for all query engines, and down to 15 minutes in the best case.
The following runtime graph removes the Hive 0.11 data and focuses on the differences in runtimes between two versions of Hive against two different commercial products. In all cases, the Hive 0.14 runtime was either comparable or significantly better than the runtimes from the enterprise tools.
Note that Query 4 was not run against Vendor 2 due to compatibility issues.
After seeing the runtime results, the immediate next step was to investigate the detailed resource utilization metrics that were collected during query execution to better understand how each query engine made use of the available resources, and to identify any factors that were limiting performance.
The following graphs illustrate the performance metrics captured during the execution of Query 1 through Hive 0.14 with Tez. Query 1 is a query containing a one-to-many, large-to-medium dataset join, having both a long runtime and high RAM utilization in the original Hive 0.11 run.
From analyzing the graphs, it appears that CPU limited this query. One way to overcome this would be use machines with higher processing power. An alternative solution could be to add additional nodes with the same CPU specifications, especially since the Network Transmission rates are nearly zero.
When comparing execution times across all engines, Vendor 2 was 22 seconds faster than the Hive with Tez execution time for this query. The following graphs illustrate the performance metrics captured during the execution of Query 1 through Vendor 2’s query engine.
We immediately noticed that CPU, Network, and RAM utilization was extremely low throughout the duration of execution. This drastically contrasts with the Hive with Tez utilization, which had significant RAM and CPU usage. Overall, it appears that the limiting resource for Vendor 2’s query engine in this use case seems to be disk write speed. The disk write speed did not appear to be an issue for the Hive with Tez execution of the query.
The following graphs illustrate the performance metrics captured during the execution of Query 4 through Hive 0.14 with Tez. Query 4 is a query containing both large-to-large dataset joins and the unions of multiple sub-queries.
It is clear that CPU and RAM are major resources for executing this query. The data seems to be processed in batches, so increasing the amount of available RAM or adding additional nodes to the cluster would both be appropriate ways to further improve the query performance.
When comparing execution times across all queries, Vendor 1 had the best runtime for this query. The following graphs illustrate the performance metrics captured during the execution of Query 4 through Vendor 1’s query engine.
The CPU resources were used very efficiently for this query, especially when compared to the utilization required by Hive with Tez. Additionally, the RAM utilization remains consistent over the full query execution time, which contrasts with the “batch” behavior illustrated with Hive with Tez.
After investigating, it is clear that Hive 0.14 is a strong competitor, even against commercial-grade query engines. We entered into this benchmarking process expecting Hive to be in last place at every turn, and exited having the utmost respect for the Stinger Initiative and the advancements that team has accomplished. The fact that Hive 0.14 with Tez performed faster or on-par with other query engines that we evaluated, along with the fact that Hive is the best choice from a cost perspective as well, makes us excited to continue to rely on Hive for our needs.