In this partner guest blog, Microsoft Principal Software Development Engineer Eric Hanson weighs in how Stinger.next will benefit HDInsight customers. Coming from someone who worked on Microsoft SQL Server for years and is a committer to Apache Hive, Eric explains that Stinger.next initiatives and capabilities are essential to take Hive to the next level.
Hive performance has improved tremendously since the start of the Stinger initiative in winter of 2013. Hive used to take minutes to run quite simple queries even on a moderate amount of data, due to process startup overhead, file spooling costs, un-optimized file formats, and a CPU-hungry inner loop. Stinger made great progress on a bunch of these problems with the ORC file format, vectorized query execution, Tez, and container (process) reuse. Queries that used to take 10 minutes or more now can complete in 20 or 30 seconds. But there is still work to be done to make Hive have truly quick response time.
The primary focus for the next version of Stinger, Stinger.next, is to drive toward sub-second response time. This is great. I can tell you from personal experience with Microsoft SQL Server columnstore enhancements that customers get incredible satisfaction from a system when it feels like it’s giving instant answers. They get more creative about getting insights from their data because they aren’t inhibited by expectations of slow response time. And they don’t lose their train of thought waiting for answers to come back.
Technically, getting sub-second response time for queries requires two things:
These techniques have been used in traditional database systems for a long time. So applying them in Hive is technically feasible. Of course, it’s a big job to re-architect the Hive runtime to use these capabilities, but the Hive community has a track record of success on comparable development tasks.
The functionality of Hive is terrific for big data analysis. Hive’s useful features for this include non-procedural query with SQL, a rich external tables feature and data input format adapters (game-changers for flexible big-data analysis), an on-by-default Java user-defined function feature, window functions, and more.
The final thing required to make selecting Hive a no-brainer is more interactive performance. We’re on our way with Stinger, and Stinger.next can move Hive to instant response time and make it simply fun to use.