Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
July 25, 2017
prev slideNext slide

SQL and Hadoop Query Performance Smackdown

LLAP wins the fastest execution among the SQL engines!

Comcast is one of the nation’s leading providers of communications, entertainment and cable products and services. Headquartered in Philadelphia, PA, they employ over 100,000 employees nationwide whose goal is to deliver the highest level of service and improve the customer experience. Comcast decided to run what they call their “Hadoop Query Performance Smackdown” for SQL engines.

The Comcast Big Data team uses an enterprise data like with over 1000+ daily active users running on 70 racks with petabytes of usable enterprise data available via Hive tables. Their uses cases range from ad hoc jobs, batch and streaming data, and reporting.  They wanted to pick a SQL engine which would give them the best performance for the most practical use cases. They ran tests against MapReduce, Spark, LLAP, Tez and Presto. The end result was to pick a SQL engine to recommend to the CTO and the community.

They used a test methodology which utilized TPC-DS queries defined in the Hive benchmark for each of the SQL engines. Each query was run one at a time to utilize all the resources from the cluster. The team ensured that care was taken to tune and configure the engines.  Furthermore, each query was run three times to make sure there were no anomalies.  By doing this, the team calculated an average run time from the three rounds.  Take notice of how the tests that were run against LLAP have much faster execution times that the other engines. LLAP had been explained as only been best optimized for ORC.  The Comcast team found that they achieved much better performance across the board. LLAP was by far the fastest engine against Tez and Presto.  SparkSQL did not manage to complete the benchmark successfully.

SQL and Hadoop Query Performance Smackdown
*SQL and Hadoop Query Performance Smackdown

To learn more, read the Datanami article.

In addition, watch the Comcast youtube video session from the Hortonworks Dataworks Summit on June 14, 2017  in San Jose, CA to learn about how you can use these results to help guide your company’s big data initiative on that journey of supporting interactive queries.

For more information on Apache Hive Performance Booster with LLAP, read our blog here.


Deepak says:

Does LLAP only best optimized for native hive with ORC?
Our run with hive storage handler (with LLAP) was showing performance degradation against hive with tez.
Benchmark suit was TPCH with 10G of data.

Carter Shanklin says:

Hive, whether you use LLAP or not, works best with ORCFile because of optimizations like vectorization. That said, LLAP works with any Hive format, but you only get partial benefits. Most importantly, LLAP’s persistent architecture means you don’t spin up containers per query, so you get lower latency and higher concurrency than you would with older execution engines.

When you say storage handler it’s not clear where you data is held (it could be in a remote system like HBase, for example), so it’s difficult to say what you should expect to see.

In HDP 2.6, LLAP only caches data that is in ORCFile format, but that will soon change with the introduction of Parquet caching ( In principle, LLAP can cache any form of data, including whatever your storage format produces.

Nitin Pasumarthy says:

How does LLAP compare with columnar stores like Vertica for aggregate queries?

Tanmay says:

Hi, Does LLAP support optimized non-equi joins ?

Carter Shanklin says:

Yes, Hive LLAP does support optimized non-equi joins.
In more detail, Apache Hive 2.2 added support for non-equijoins (HIVE-15211) while Apache Tez recently added the ability to run non-equijoins (aka theta joins) in a parallel fashion (TEZ-2104). This is all enabled in Hive LLAP within HDP 2.6.

cs joshi says:

you only get partial benefits. Most importantly, LLP’s persistent architecture means you don’t spin up containers per query, so you get lower latency and higher concurrency than you would with older execution engines.

Leave a Reply

Your email address will not be published. Required fields are marked *