cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 09, 2014
prev slideNext slide

Stinger.next: The Next Step in Hive Performance

In this partner guest blog, Microsoft Principal Software Development Engineer Eric Hanson weighs in how Stinger.next will benefit HDInsight customers. Coming from someone who worked on Microsoft SQL Server for years and is a committer to Apache Hive, Eric explains that Stinger.next initiatives and capabilities are essential to take Hive to the next level.

Apache Hive is one of the most-used features of Microsoft’s cloud Hadoop service, Azure HDInsight. So our HDInsight customers of course will enjoy new capabilities that make Hive faster.

Hive performance has improved tremendously since the start of the Stinger initiative in winter of 2013. Hive used to take minutes to run quite simple queries even on a moderate amount of data, due to process startup overhead, file spooling costs, un-optimized file formats, and a CPU-hungry inner loop. Stinger made great progress on a bunch of these problems with the ORC file format, vectorized query execution, Tez, and container (process) reuse. Queries that used to take 10 minutes or more now can complete in 20 or 30 seconds. But there is still work to be done to make Hive have truly quick response time.

The primary focus for the next version of Stinger, Stinger.next, is to drive toward sub-second response time. This is great. I can tell you from personal experience with Microsoft SQL Server columnstore enhancements that customers get incredible satisfaction from a system when it feels like it’s giving instant answers. They get more creative about getting insights from their data because they aren’t inhibited by expectations of slow response time. And they don’t lose their train of thought waiting for answers to come back.

Technically, getting sub-second response time for queries requires two things:

  1. Fast per-row query processing – Stinger enhancements have already driven down the per-row query overhead in Hive to quite low levels. Efficient Apache Tez pipelines and Vectorized query execution contributed to making query execution faster. They’ll be improved further in Stinger.next.
  2. Low query setup time – This is the key next thing to be done technically in Hive to get quick response time. The plan for Stinger.next is to use a multi-threaded service process (daemon) called Live Long and Process (LLAP) on each node. LLAP will maintain an in-memory data cache. This will dramatically reduce process startup costs, I/O latency, and deserialization overhead. The LLAP architecture squarely addresses the goal of interactivity.

These techniques have been used in traditional database systems for a long time. So applying them in Hive is technically feasible. Of course, it’s a big job to re-architect the Hive runtime to use these capabilities, but the Hive community has a track record of success on comparable development tasks.

The functionality of Hive is terrific for big data analysis. Hive’s useful features for this include non-procedural query with SQL, a rich external tables feature and data input format adapters (game-changers for flexible big-data analysis), an on-by-default Java user-defined function feature, window functions, and more.

The final thing required to make selecting Hive a no-brainer is more interactive performance. We’re on our way with Stinger, and Stinger.next can move Hive to instant response time and make it simply fun to use.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *