Earlier we talked about reasons for integrating Druid and Hive in a THREE-PART SERIES (Part 1, Part 2 , Part 3) OF DOING ULTRA FAST OLAP ANALYTICS WITH APACHE HIVE AND DRUID. Since then we have spent even more of our time and efforts on — bug fixes, correctness, performance and supporting new features.
Today, We are excited to share an update on our performance benchmark blog and compare the performance numbers of running 1 TB OLAP benchmark at 1 TB Scale for HDP 2.6 vs HDP 3.0. The benchmark results show an overall improvement of 42% in average query performance.
There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :
To benchmark Hive/Druid integration we used the Star-Schema Benchmark which is based on TPC-H benchmark. Overall the SSB benchmark is meant to simulate the process of iteratively and interactively querying a data warehouse to play what-if scenarios, drill down and better understand trends, as opposed to the pre-canned, batch-style reports used by TPC-H.
In our previous post, we made some cosmetic query adjustments like removing the Order By clause and rewriting the between predicate into two separate inequality predicates.
It is worth noting that in this blog-post we are running the original SSB queries as it is without any modifications. The Hive Query optimizer is now able to generate optimized plans without any Query adjustments on the user side.
In the benchmark SSB queries were run via JDBC through HiveServer2 and backed by Druid. Table below shows the improvements in min,max and average query response times.
|HDP Version||Average Query time||Min Query Time||Max Query Time|
For additional reference here are the specifics of the cluster where these numbers were generated:
The GitHub repo also contains some additional tuning notes with detailed Java command line arguments.