Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.
AVAILABLE NEWSLETTERS:
Thank you for subscribing!
Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.
Thank you for subscribing!
Hortonworks Data Platform (HDP) provides Hadoop for the Enterprise, with a centralized architecture of core enterprise services, for any application and any data. HDP is uniquely built around native YARN services to enable a centralized architecture through which multiple data access applications interact with a shared data set. Apache Hive is one of the most important of those data access applications—the defacto standard for interactive SQL queries over petabytes of data in Hadoop.
This blog shares benchmark-testing results that show performance improvements made by the Apache Hive community with the Cost Based Optimizer (CBO). The CBO is part of Phase 1 of the Stinger.next Initiative to further enhance Hive’s speed, scale and SQL capabilities.
As more analysts and data scientists adopt Hive across their organizations, performance becomes crucial. Many users rely on BI tools to generate queries that are far from optimal in Hive, so Apache Hive 0.14 includes the CBO to maintain high performance while also allowing your team members to continue using their preferred BI tools.
CBOs do this by planning how a query will be executed, using statistics about the data and sophisticated algorithms to come up with query plans that optimize performance and maximize resource utilization across the Hadoop cluster. We want Hive to just work, without users having to concern themselves with performance.
We’re delighted with the performance gains shown in the table below, using the TPC-DS benchmark:
We would like to thank the members of the Apache Hive community who contributed code to make these improvements possible – Ashutosh Chauhan, Gunther Hagleitner, Harish Butani, Jesus Camacho Rodriguez, Julian Hyde, John Pullokkaran, Mostafa Mokhtar, and Pengcheng Xiong.
Below are the highlights comparing run times of “data mining”, “interactive” and “reporting” queries in Apache Hive 0.14. The numbers compare speeds with CBO on with the same queries when CBO was disabled. We categorized queries by these six CBO effects:
CBO Effect | Description |
Same plan | Execution plan was the same between CBO ON & CBO OFF, because the query explicitly specified a join order. |
Lucky query | The order in which the tables were referenced in the TPC-DS SQL query was identical to the optimal join order. |
Predicate push down | Pushed a filter below the join, avoiding an extra filter operator |
Join ordering | The join order was changed by CBO |
Bushy join | A wide join tree was produced instead of a traditional left-deep tree |
Join simplification | CBO is able to identify common join predicates from conjunction and disjunction clause and creating an implied join predicate, doing such avoids costly cross product joins. |
TPC-DS Query Results | |||||
DATA MINING @ 30TB SCALE | |||||
CBO ON | CBO OFF | X Factor | CBO Effect | ||
Q31 | 956 | 1,020 | 1.1 | Same plan | |
Q34 | 207 | 465 | 2.2 | Predicate push down | |
Q39 | 129 | 176 | 1.4 | Join ordering | |
Q71 | 332 | 374 | 1.1 | Same plan | |
Q73 | 174 | 391 | 2.2 | Predicate push down | |
Q88 | 1,125 | 1,691 | 1.5 | Predicate push down | Join ordering |
Q98 | 208 | 432 | 2.1 | Predicate push down | Join ordering |
INTERACTIVE @ 30TB SCALE | |||||
CBO ON | CBO OFF | X Factor | CBO Effect | ||
Q3 | 143 | 226 | 1.6 | Predicate push down | Join ordering |
Q7 | 266 | 263 | 1.0 | Same plan | |
Q12 | 106 | 162 | 1.5 | Join ordering | |
Q13 | 5,903 | 10,000 | 1.7 | Join simplication | |
Q15 | 1,557 | 10,000 | 6.4 | Bushy Join | Join ordering |
Q19 | 228 | 215 | 0.9 | Same plan | |
Q26 | 173 | 178 | 1.0 | Same plan | |
Q27 | 249 | 252 | 1.0 | Predicate push down | |
Q42 | 143 | 148 | 1.0 | Lucky query | |
Q43 | 540 | 512 | 0.9 | Lucky query | |
Q52 | 149 | 149 | 1.0 | Lucky query | |
Q55 | 143 | 164 | 1.1 | Lucky Query | |
Q82 | 1,295 | 2,223 | 1.7 | Bushy Join | Join ordering |
Q84 | 73 | 138 | 1.9 | Bushy Join | Join ordering |
Q91 | 28.57 | 100 | 3.5 | Bushy Join | Join ordering |
Q96 | 156 | 175 | 1.1 | Join ordering | |
REPORTING @ 30TB SCALE | |||||
CBO ON | CBO OFF | X Factor | CBO Effect | ||
Q17 | 429 | 2,957 | 6.9 | Bushy join | Join ordering |
Q20 | 132 | 267 | 2.0 | Join ordering | |
Q21 | 28 | 46 | 1.6 | Join ordering | |
Q25 | 443 | 3,187 | 7.2 | Bushy join | Join ordering |
Q28 | 754 | 2,902 | 3.1 | Lucky query | |
Q29 | 950 | 747 | 1.0 | Bushy join | Join ordering |
Q32 | 339 | 351 | 1.0 | Lucky query | |
Q40 | 1,929 | 2,462 | 1.3 | Join ordering | |
Q45 | 88 | 1,094 | 12.4 | Bushy join | Join ordering |
Q46 | 318 | 407 | 1.3 | Bushy join | Join ordering |
Q48 | 257 | 10,000 | 39.0 | Join Simplification | |
Q49 | 390 | 2,304 | 5.9 | Bushy join | Join ordering |
Q50 | 1,113 | 1,380 | 1.2 | Bushy join | Join ordering |
Q54 | 347 | 1,525 | 4.4 | Bushy join | Join ordering |
Q56 | 361 | 407 | 1.1 | Bushy join | Join ordering |
Q58 | 335 | 1,897 | 5.7 | Bushy join | Join ordering |
Q60 | 356 | 403 | 1.1 | Bushy join | Join ordering |
Q66 | 381 | 1,331 | 3.5 | Join ordering | |
Q68 | 298 | 327 | 1.1 | Lucky query | |
Q75 | 2,659 | 3,511 | 1.3 | Join ordering | |
Q79 | 438 | 535 | 1.2 | Predicate Push Down | |
Q85 | 1,355 | 2,252 | 1.7 | Join ordering | |
Q87 | 1,398 | 1,468 | 1.1 | Same plan | |
Q89 | 288 | 273 | 0.9 | Same plan | |
Q90 | 113 | 116 | 1.0 | Lucky query | |
Q92 | 922 | 1,028 | 1.1 | Same plan | |
Q97 | 1,270 | 1,258 | 1.0 | Same plan |
Hortonworks Data Platform 2.2 with:
HDP was deployed using Apache Ambari 1.7.0. For the most part, the cluster used the Ambari defaults (except where noted below). Hive 0.14.0 runs were done using Java 7 (default JVM).
Apache Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.
Other Settings:
The following additional optimizations were used for Hive 0.14.0:
20 physical nodes, each with:
Note: Based on the YARN Node Manager’s Memory Resource setting used below, only 48 GB of RAM per node was dedicated to query processing, the remaining 200 GB of RAM were available for system caches and HDFS.
Note: Queries were run as-is with no additional hinting, and there was no special tuning used for any of the queries.
This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, Nifi Registry, HAWQ, Zeppelin, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.
© 2011-2018 Hortonworks Inc. All Rights Reserved.
Comments
Cannot reproduce plan changes described above as “Predicate push down”. Could you give execution plan with and without cbo for any of queries of the category (Q1, Q27, Q34, Q73, Q88,…)?
I’m curious to see CBO interact with the impending LLAP for Hive release… this is starting to really shape up, exciting times 🙂
In the benchmark results you posted on 6/2/14 that compared Hive 0.10 to 0.13, you reported that “Total time to run all queries decreased from 7.8 days to 9.3 hours”. In this benchmark of Hive 0.14 with and without CBO, you report that “Total workload runtime shrank from 20.6 to 9 hours”. In both cases you use a 30TB dataset and presumably the same TPS-DS queries were used in both. It also looks like the cluster configurations are identical. What am I missing that would account for the total runtime to go from 9.3 hours with Hive 0.13 to 20.6 hours with Hive 0.14 (without CBO)?