Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 21, 2017
prev slideNext slide

3x Faster Interactive Query with Apache Hive LLAP

One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.

These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.

Starting Off: 3x Performance Gains in HDP 2.6 with Hive LLAP

Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.

Hive Interactive Query

 

A few key callouts:

  1. HDP 2.6 runs this set of queries in 5155.3 seconds, compared to 14983.4 seconds for HDP 2.5, almost a 3x improvement for HDP 2.6.
  2. HDP 2.6 is faster than HDP 2.5 on almost all queries.
  3. HDP 2.6 is more than 5x faster than HDP 2.5 on 8 queries.

Now let’s dive into the reasons that HDP 2.6 is so much faster than HDP 2.5.

Hive Working Smarter, Not Harder

Hive’s core execution engine and operators are highly optimized, and designed to work end-to-end with the ORCFile columnar format and LLAP’s in-memory cache. Given this level of optimization, the majority of Hive’s performance improvements in HDP 2.6 come from more sophisticated cost-based optimization along with dynamic runtime features that minimize the overall work Hive must do to satisfy a query. You can say that with HDP 2.6, Hive is working smarter and not harder, re-using its own work or avoiding it altogether when possible.

Hive - Key Performance Advancements - HDP 2.6

Hive LLAP Supports the Full Suite of 99 TPC-DS Queries

Past Hive benchmarks have focused on a subset of the TPC-DS queries. In large part this was because older versions of Hive lacked SQL features used by TPC-DS queries. With HDP 2.6 Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations.) For reference we’ve posted the full set of trivially-modified queries used in the the remainder of this blog post in the updated Hortonworks testbench repository.

Comparing Apache Hive to Apache Impala

Apache Impala is a SQL-on-Hadoop engine built specifically for interactive query. For this benchmark we compared Hive to Impala using the identical set of 99 trivially-modified TPC-DS queries, which can be found in our GitHub repository.

To summarize the results, the aggregate runtime for all queries is similar across the two engines, but Hive is able to run all 99 queries compared to only 60 in Impala.

Comaring Hive to Impala

This graph breaks down a full comparison of all runtimes, across the 60 queries which could be run in both engines.

Hive versus Impala

When we drill into why Impala is only able to run 60 out of the full suite of 99 queries, this is what we find:

Impala SQL Gaps

 

These missing features are not obscure SQL features. Instead they are features offered by almost all commercial SQL products and an ever-growing list of open-source SQL tools like Apache Hive. Some of these limitations can be dealt with on a query-by-query basis (such as lack of a date datatype), but such changes are impractical across an entire SQL suite or workload. Other Impala limitations require extremely difficult workarounds, such as missing support for rollups or intersections. Overall, Hive offers much better workload portability if you are considering offloading workloads from a legacy EDW solution.

Hive LLAP Performance Across All 99 Queries

For reference, this graph shows Hive performance across all 99 trivially-modified TPC-DS queries.

Hive Query Timings

 

Hardware Configuration (Used in All Test Configurations)

9 worker nodes, each with:

  • 256 GB RAM
  • Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
  • 2x HGST HUS726060AL4210 A7J0 disks for HDFS and YARN storage
  • Cisco VIC 1227 10 Gigabit Network Connection

HDP 2.6 Software Configuration

  • Ambari-managed HDP 2.6.2 stack
  • Hive version = 2.1.0.2.6.2.4-4
  • Tez version = 0.8.4.2.6.2.4-4
  • LLAP container size: 180GB (per node)
  • LLAP Heap Size: 128GB (per daemon)
  • LLAP Cache Size: 32GB (per daemon)
  • OS Setting: net.core.somaxconn set to 16k, ntpd and nscd running
  • Data: ORCFile format, scale 10,000, with daily partitions on fact tables (load scripts available in the GitHub repository).
  • Note: This HDP build will also be available in Hortonworks Data Cloud for AWS as version 1.16.5 within 7 days from this blog publication.

CDH 5.12 Software Configuration

  • CDH-managed CDH 5.12 stack
  • Impala version = 2.9
  • Impala was given all memory on the worker nodes, using all defaults provided by Cloudera Manager.
  • Data: Parquet format, scale 10,000, with daily partitions on fact tables.

Comments

  • Why don’t you ever post the results of concurrent queries like most benchmarks do? What good is a single user query engine?

    Also, you had to make “trivial” modifications to the Hive queries; Did you do the same for the Impala queries? If the answer is “no”, then why not?

    • Robert, agreed this is useful info. We didn’t try to address it in this blog because (1) the blog was already pretty lengthy, even without the supersized comments we occasionally get and (2) we wanted to focus on the key evaluation criteria we consistently see among our customers. Some users get to multi threaded testing and some don’t, single threaded numbers are always a useful point of reference.

      We will be posting more about concurrency in the future. In the meantime you can pull some data points from https://www.slideshare.net/HadoopSummit/llap-subsecond-analytical-queries-in-hive-74811820#8 which shows the relationship between response times and concurrent queries ranging from 5 up to 100. That slide doesn’t mention it but this is with respect to HDP 2.6, same as in this post.

  • Can you explain why you disallow explicitly permitted modifications per the TPC-DS specification for Apache Impala (or call out failures) yet allow non-permitted modifications (hence the use of “trivial”) for Apache Hive?

    For reference the latest TPC-DS spec can be found here:
    http://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v2.5.0.pdf

    For Apache Impala:

    No date datatype

    The claim is made that 12 TPC-DS queries require date datatype. I’d like to highlight two Clauses from the TPC-DS spec v2.5, the most recent version.

    2.2.2.1.f Date means that the column shall be able to express any calendar day between January 1, 1900 and December 31, 2199.

    2.2.2.2 The datatypes do not correspond to any specific SQL-standard datatype. The definitions are provided to highlight the properties that are required for a particular column. The benchmark implementer may employ any internal representation or SQL datatype that meets those requirements.

    Given there is clearly no TPC-DS requirement for a SQL DATE datatype and the properties of Apache Impala’s SQL TIMESTAMP datatype clearly meet the TPC-DS requirements, the claim that 12 queries “fail” for this reason is clearly false.

    No support for grouping or rollup

    While it is true that Apache Impala does not, as of v2.10, support these two SQL features, there are TPC-DS approved query variants for said queries that Apache Impala does indeed support. To claim that Apache Impala does not support these queries is to ignore the fact that TPC-DS has approved variants for said queries, and disallow the use of approved variants as found in Appendix C of the TPC-DS specification.

    Given that use of either the original or the approved variant is explicitly permitted per the TPC-DS specification, the claim that Apache Impala can not run these 8 TPC-DS queries is clearly false.

    Returns is a keyword

    While it is true that returns is a keyword in Apache Impala, nowhere in the TPC-DS specification is there a specific requirement or definition for what is, or is not a keyword. Furthermore the use of returns in TPC-DS is that of a column alias like such: sum(return_amt) as returns. Clause 4.2.3.4.e specifically permits the addition or modification of column aliases.

    Given there is clearly no TPC-DS requirement for specific keywords and it explicitly permits the addition or modification of column aliases as a permitted minor query modification, the claim that Apache Impala does not support these 3 queries is clearly false.

    No support for SQL standard string concatenation

    Nowhere in the TPC-DS specification is there any such requirement for a specific string concatenation operator. In fact, just the opposite is clearly stated in Clause 4.2.3.4.c.3:

    String concatenation operator – For queries which use string concatenation operators, vendor specific syntax can be used (e.g. || can be substituted with +).

    Apache Impala does support string concatenation using CONCAT() and substituting it for the || (double pipe) operator is clearly identified as a minor query modification and is explicitly permitted.

    Given there is clearly no TPC-DS requirement for specific string concatenation operator and vendor specific syntax is explicitly permitted, the claim that Apache Impala does not support these 2 queries is clearly false.

    No interval support

    Nowhere in the TPC-DS specification is there a stated requirement for INTERVAL support. In fact, Apache Impala does support INTERVAL expressions [1] and the use of those expressions in date math calculations clearly is permitted per Clause 4.2.3 Minor Query Modifications, and more specifically Clause 4.2.3.4.f.1:

    Date expressions – For queries that include an expression involving manipulation of dates (e.g., adding/subtracting days/months/years, or extracting years from dates), vendor-specific syntax may be used instead of the specified syntax. Examples of acceptable implementations include “YEAR()” to extract the year from a date column or “DATE() + 3 MONTHS” to add 3 months to a date.

    Given there is clearly no TPC-DS requirement for any INTERVAL support or syntax, the claim that Apache Impala does not support this query is clearly false.

    For Apache Hive:
    Additionally I’d like to contest the following Hortonworks’ claim:

    Claim of full TPC-DS query support in Apache Hive

    In this blog the heading “Hive LLAP Supports The Full Suite Of 99 TPC-DS Queries” and the following statement are mutually exclusive per TPC-DS definitions and rules:

    With HDP 2.6 Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations.)

    In order for a system to claim support for all TPC-DS queries, one must do so under the rules stated in the TPC-DS specification. If the queries are modified beyond the rules then the queries are thus no longer TPC-DS approved queries. The only permitted modifications to TPC-DS queries are clearly specified in Section 4.2 of the TPC-DS spec. The Hortonworks stated definition of “trivial” thus is at odds with what is permitted in Section 4.2:

    trivial – defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations

    Given that Hortonworks made modifications to the queries that fall outside the permitted query modification methods of Section 4.2, the claim that Apache Hive supports all 99 TPC-DS queries is clearly false.

    [1] http://impala.apache.org/docs/build/html/topics/impala_datetime_functions.html

    • Greg, I want to re-focus on the core themes of the post: SQL support and performance.
       
      First, SQL:
       
      We agree that alternative re-writes of TPC queries can run in Impala. Where we seem to disagree is how much this matters.  We do not claim that it is impossible to re-write TPC-DS queries to run in Impala. Cloudera makes regular updates about this and we leave it to their expertise. We took 99 queries and only 60 of these run in Impala. Should a user care? Absolutely! The missing features called out above are not obscure SQL features, they are basic expectations of any mature SQL system and their absence should be considered a severe limitation.
       
      Let’s take a concrete example. I’ve been asked dozens of times whether Hive supports EXCEPT (or MINUS), from people looking to port workloads to Hadoop from Oracle or other databases. Query 87 covers this case, compare the Hive Version at 20 lines versus Impala, at 60+ lines and a very complex rewrite. The comment above acknowledges this limitation but does not elaborate on how severe the limitation is. We know that virtually no user would attempt the complex Impala re-write, it is simply a show-stopper to adopting the technology. With Hive the SQL is ready-to-run.
       
      There are other examples and anyone is free to compare the trivial changes needed in Hive versus the complex rewrites needed in Impala. Almost anyone will agree that Hive makes it a lot easier to get TPC-DS running. The comment above doesn’t really dispute that, rather it argues that these barriers can be overcome with enough time and expertise. In our experience most customers are unwilling to invest the time and effort.
       
      To get a feel of the Hive modifications: the majority of the 35 query modifications fall into categories such as changing or adding column names, switching double quotes to backticks, use of intervals for days offsets and so forth. The changes are mechanical changes that could be automated without difficulty.
       
      If you consider the Levenshtein edit distance, the average number of changed characters per query is 29 characters and across all 99 queries only 2,876 characters changed versus the original 170,597 characters (less than 2% of the workload changed in some way). The Impala suite has several individual queries with larger changes than all Hive changes combined, such as Query 3 or Query 46. The only large re-write for Hive is Query 9. 49% of all edits in the Hive suite were because of this query, needing a rewrite because of HDP 2.6’s lack of subquery support in the select clause. Going forward even this re-write will not be needed thanks to HIVE-16091 (available as part of Hive 2.3).
       
      To net it out, with Hive you get the superior economics of a scale-out system, comparable performance and a lot less compromise.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>