3 Reasons to try Stinger Phase 3 Technical Preview

Whether you were busy finishing up last minute Christmas shopping or just taking time off for the holidays, you might have missed that Hortonworks released the Stinger Phase 3 Technical Preview back in December. The Stinger Initiative is Hortonworks’ open roadmap to making Hive 100x faster while adding standard SQL. Here we’ll discuss 3 great reasons to give Stinger Phase 3 Preview a try to start off the new year.

Reason 1: It’s The Fastest Hive Yet

Whether you want to process more data or lower your time-to-insight, the benefits of a faster Hive speak for themselves. Stinger Phase 3 brings 3 key new components into Hive that lead to a massive speed boost.

Component Benefit
Tez A modern implementation of Map/Reduce. New Tez operators simplify and accelerate complex data processing natively in Hadoop.
Tez Service Maintains warm containers and caches key information to allow fast query launch.
Vectorized Query New execution engine that takes advantage of modern hardware architectures to accelerate computations of data in memory up to 100x.

What does it add up to? To find out we compared Stinger Phase 3 Preview head-to-head against Hive 12 on the same hardware and over the same dataset.

stinger-p3-1

In this broad-based benchmark including both large reporting type queries as well as more targeted drill-down queries, Stinger Technical Preview shows an average 2.7x speedup versus Hive 12. Remember that Hive 12 includes all the performance benefits that have gone into Stinger Phases 1 and 2, and is the fastest Hive generally available today.

We also did some limited comparisons between Hive on Tez and Hive 10. Hive 10 pre-dates the Stinger initiative and its focus on improving Hive performance.

stinger-p3-2

In this limited subset of queries we see speedups ranging from 5x to 40x going from Hive 10 to Hive on Tez.

Configuration Details

Hardware: Software:
20 physical nodes, each with:

  • 2 x 2.3GHz Xeon E5-2630 for total of 12 cores per node.
  • 64GB RAM.
  • 6 1TB drives per node.
  • 1 Gigabit interconnect between the nodes.
  • Hadoop 2.3.0-SNAPSHOT
  • Tez 0.2
  • Hive 0.13 Snapshot taken from Stinger Technical Preview.
  • Hive 0.12 taken from Hortonworks HDP 2.0 GA.
  • Hive 0.10 built manually against Hadoop 0.23. (The GA HDP package is not compatible with Hadoop 2).
  • Configuration settings for Hive 12 and Hive 13 were the same as those found in the Stinger Preview Quickstart. Hive 10 settings were those found in Hortonworks HDP 1.2.
Data: Queries:
  • TPC-DS Scale 200 data, partitioned by day.
  • Hive 12 and Stinger Preview were run against data stored in ORCFile using all default settings.
  • Hive 10 was run against text data because it doesn’t support ORCFile.
  • Queries were those published in the Hive Testbench. Queries were run as they appear in the testbench and not individually tuned. The data generator we use is also included in the Hive Testbench.

Reason 2: Hive is now Interactive

Stinger Phase 3 Preview introduces the Tez Service, a persistent service that runs as a YARN Application Master. The Tez Service’s job is to facilitate fast query launch, and does this in two ways: First, the Tez Service keeps hot containers on standby to ensure fast query launch.

Second, the Tez Service caches key information such as split calculations. Any time data in Hadoop is processed, maps are assigned to splits of files on the filesystem in order to divide-and-conquer the work. This involves querying the NameNode to identify where the data is physically located and can take several seconds for large datasets. Because Tez Service caches this data, subsequent queries over the same data launch much faster.

stinger-p3-3

Let’s take a look at a few examples of how the Tez Service helps.

Query

Tez Cold (s)

Tez Warm (s)

Speedup from Tez Service (s)

query27 24.3 8.8 15.5
query79 80.9 45.2 35.8

Some example speedups using Tez Service.

Query 27 is a simple star-schema join involving one fact table and many dimension tables. When Tez Service has cached data and has warm containers, time to execute falls by more than 50% to under 10 seconds, which many people regard as the bar for “interactive query”.

Query 79 is a more complex fact-to-fact join that addresses much more data. Because more data is addressed, caching benefits the query even more, saving more than 30 seconds.

stinger-p3-4

In the results, Hot queries ran an average of 17 seconds faster than cold queries. This is a big deal for queries smaller, interactive queries because now Hive is able to run queries in less than 10 seconds over large datasets, enabling interactive query in Hadoop.

Reason 3: Hive is 100% Community Open Source

At Hortonworks we spend a lot of time talking about Hive but it’s important to remember that Hive is a community effort and represents the hard work of hundreds of individuals who either contribute privately or represent one of more than 10 companies that contribute to Hive. Through this collective effort, Hive is quickly becoming the most robust, mature and secure SQL solution for Hadoop. Apache Hive is the only SQL solution for Hadoop supported by every major Hadoop distribution. Choosing Hive means 100% Community Open Source and 0% lock-in.

Try It For Yourself

We hope you’ll try the Stinger Phase 3 Preview for yourself. All you need is an HDP 2.0 cluster or Sandbox. To get started, follow the instructions on the announcement blog post. As always, if you have questions or need help, head to the Hortonworks Forums for tips and advice.

Categorized by :
Administrator Architect & CIO Data Analyst & Scientist Developer Hadoop 2.0 HDP 2 Hive Performance Stinger Tez

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Stinger Initiative

The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. More »

Join the Webinar!

YARN Ready – Office Hours
Thursday, September 11, 2014
1:00 PM Eastern / 10:00 AM Pacific

More Webinars »

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.