Delivering on Stinger: a Phase 3 Progress Update

With the attention of the Hadoop community on Strata/Hadoop World in New York this week, it’s seems an appropriate time to give everyone an early update on continued community development of Apache Hive. This progress well and truly cements Hive as the standard open-source SQL solution for the Apache Hadoop ecosystem for not just extremely large-scale, batch queries but also for low-latency, human-interactive queries.

You can catch me at our session ‘Apache Hive & Stinger: Petabyte Scale SQL, IN Hadoop’ along with Owen and Alan where we’ll be happy to dive into more of the details. It’s on 30th Oct, at 11:50am, Gramercy Suite B.

Many of you have heard of Project Stinger already, but for those who have not, Stinger is a community-facing roadmap laid out to improve Hive’s performance 100x and bring true interactive query to Hadoop. You can read more at www.hortonworks.com/labs/stinger.

We’ve gotten really excited lately as we’ve started to piece together the performance gains brought on by the past 9 months of hard work, including more than 700 closed Hive JIRAs and the launch of Apache Tez, which moves Hadoop beyond batch into a truly interactive big data platform.

Query Performance

Let’s start with an update of queries we first reported back in March of this year during Stinger Phase 1.

stinger1

We’re very excited to see the progress Hive has made toward becoming a truly interactive SQL system built entirely in Hadoop.

As a benchmark, TPC-DS does a great job simulating decision support around retail operations, but it’s certainly not the only benchmark around. One benchmark that takes a more web-centric view is the Berkeley AMPLab Big Data Benchmark, which defines queries that range from simple filters to more complex revenue analytics.

stinger2

stinger3

stinger4

This isn’t an apples-to-apples comparison since we can’t run hive-0.10 on ORC etc., but the key point here is that with upcoming hive-0.13, we get very low latency SQL queries with Hive using Apache Tez.

Interactive Query, In Hadoop

Historically, even simple Hive queries could not run in less than 30 seconds, yet many of these queries are running in less than 10 seconds. How did that happen? The answer mainly boils down to Apache Tez and Apache Hadoop YARN, which proves that Hadoop is more than just batch. Tez features such as container pre-launch and re-use overcome Hadoop’s traditional latency barriers, and are available to any data processing framework running in Hadoop.

Here’s a list of Stinger Phase 3 features and how they contribute to making Hadoop truly interactive:

stinger5

Looking Ahead

We’ve mentioned before that we will be releasing Stinger 3 Beta in Q4 and are still working steadily to that goal. For everyone excited to take a look at Interactive Query in Hadoop using Hadoop’s de-facto SQL engine, you’ll be able to try it out for yourself soon.

Join us on 10/30 at 11:50am, Gramercy Suite B for ‘Apache Hive & Stinger: Petabyte Scale SQL, IN Hadoop

Categorized by :
Architect & CIO Data Analyst & Scientist Developer Hadoop in the Enterprise HDP 2 Hive Performance Tez YARN

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Stinger Initiative

The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. More »

Join the Webinar!

YARN Ready – Integrating to YARN using Slider (part 2 of 3)
Thursday, August 7, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.