Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 11, 2017
prev slideNext slide

Ultra-fast OLAP Analytics with Apache Hive and Druid – Part 1 of 3

This is part 1 of a three-part (Part 2, Part 3) series of doing Ultra Fast OLAP Analytics with Apache Hive and Druid.

Unlock Sub-Second SQL Analytics over Terabytes of Data with Hive and Druid

Modern corporations are increasingly looking for near real time analytics and insights to make actionable decisions. To fuel this, this blog introduces Ultra fast analytics with Apache Hive and Druid.

We’ll show how the Hive/Druid integration delivers ultra-fast SQL analytics that can be consumed from your favorite BI tool to get accelerated business results.  And we will show benchmark results of BI queries running in just milliseconds over a 1TB dataset.  

 What is Druid?

Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is included as a technical preview in HDP 2.6 and you can read more about Druid on our project page, or at the project website

What Makes Druid great for Low Latency Analytics?

Druid is great for low latency analytics because it combines the best qualities of a column store and inverted indexing. The benefits of column stores are well known, because they minimize I/O costs for analytical queries.

Many analytical applications require drilling down to extremely fine levels of detail, e.g. “How much did we sell of Brand ‘Best Choice’ on this particular date, broken down by State”. A query like this needs 5 or 10 rows out of potentially billions of total rows. An inverted index makes it possible to load only those specific rows you need to satisfy the query.

Figure 1: Druid combines the best qualities of a column store and inverted indexing

This is a huge departure from the existing crop of SQL-on-Hadoop solutions like Apache Hive or SparkSQL which rely on columnar storage to provide high-throughput aggregation, but do not deal well with finding the “needles in the haystack”. Druid was built from the ground up to be great at this: As more dimensions are added, fewer rows are needed and Druid queries run even faster.

 Druid Sounds Great! Should I Use It For Everything?

Druid’s strong points are very compelling but there are many important problems that Druid does not attempt to solve. Most important of these is joins. Druid has some basic capabilities to join its internal data against small dimension tables loaded from external systems (this is called query-time lookup in Druid terms.) This ability aside, the Druid authors make it clear that large-scale join support is not a priority.

Second, Druid’s SQL native implementation is extremely new. Though progress is being made, SQL is a huge and complex language and it is unlikely that we will see comprehensive SQL analytical capabilities such as set operators and windowing functions any time soon. In addition, we can expect that Druid’s SQL implementation will not support heavy-duty joins.

Integrating Apache Hive and Druid

Although scalable joins and comprehensive SQL are not core project goals for Druid, they are core project goals for Apache Hive. Integrating these projects blends their capabilities nicely. More importantly perhaps, querying through Hive means that any analytics tool can now query Druid using mature and high-quality ODBC/JDBC drivers that Hortonworks provides.

Apache Hive supports federation between native and external data sources through its StorageHandler interfaces while Apache Calcite provides a flexible way to optimize and re-write queries to divide work among different engines based on their strengths. Substantial progress has been made, with Read Path Integration allowing you to query arbitrary Druid tables from Hive and Write Path Integration allowing you to create and update Druid tables from Hive.

In the detailed example below we will use the Write Path Integration to create OLAP cubes based on data in Hive. After doing that, the OLAP cube appears as an ordinary Hive table and can be queried with any Hive SQL expression, thanks to the Read Path Integration.

Pre-Aggregate into Druid using Hive SQL
Figure 2: Pre-Aggregate into Druid using Hive SQL

The result is that Druid provides essentially an OLAP Index over data in Hive, providing pre-aggregation and fast access to the needles-in-a-haystack, thanks to Druid’s inverted indexing.

Fast or Deep

We mentioned how Apache Calcite divides work according to the engine’s strengths. Simple queries can be answered directly from Druid and benefit from Druid’s extensive OLAP optimizations. More complex operations will push work down into Druid when it can, then run the remaining bits of the query in Hive within the YARN cluster.

Figure 3: Simple queries are pushed directly to Druid

What does “simple” mean in this context? Druid is well suited for things like Top N queries, time-series queries and for simple aggregations grouped by defined dimensions. Thanks to the Calcite integration, Hive is able to cement in any gaps in the SQL-like capabilities Druid exposes.

Types of Analystics
Figure 4: Types of Analytics

Future Directions

Hive / Druid integration remains an active area. Some improvements you can expect include:

  • Materialized View Navigation: You noticed above that we re-wrote the original query to target the OLAP index. This is good enough for some use cases, feel free to hook Tableau or other BI tools up to these tables because they act like regular Hive tables. This doesn’t work for transparent acceleration. In the future, we will use the Calcite layer to transparently re-write queries to use the OLAP index when possible, you can track this work in CALCITE-1731.
  • Indexing Speed: Building the OLAP index from Hive is bottlenecked because there is a 1-1 relationship between segments and reducers. If you use coarse-grained segments like monthly you will experience long load times. For example, building the index for this benchmark at 1 TB takes about 3 hours or so. In the future we will support parallel writes to segments, which should reduce build times 5x or more.
  • Seamless Lambda Architectures: Our goal is fast, flexible analytics capable of combining real-time and historical data. HDP 2.6 lays a solid foundation by unlocking deep SQL over data streamed to Druid, while at the same time simplifying historical data maintenance with ACID MERGE. In the future, look for better ability to perform joins between real-time and historical data, and simplified ingestion / retention into the historical layer.
  • AtScale Integration: Hive/Druid integration provides a compelling query engine that supports OLAP-style queries, but it is not a complete OLAP platform in itself.  For example, while Hive provides a SQL query interface to Druid tables, many OLAP solutions need to support the MDX (multi-dimensional expression) query language. Additionally, Druid and Hive don’t express native OLAP concepts such as hierarchies or multi-fact relationship constructs. As such, we’re excited that AtScale will be enhancing their Adaptive Cache technology to provide native support for writing and reading Druid-backed tables using Hive. AtScale + Hive/Druid together will make an ideal platform for OLAP-style analysis: the combined solution delivers a robust, scalable data platform for large scale OLAP style queries while providing business users (and their BI tools of choice) the ability to use a consistent and robust OLAP interface using either SQL or MDX.

Hortonworks makes Druid Easy

Druid is available as a Technical Preview in HDP 2.6. In addition to the Hive/Druid integration, Hortonworks has made it easy to deploy, configure and monitor Druid using Apache Ambari, making it easy to get started. With these simplifications, a knowledgeable Hadoop user should be able to reproduce anything in this document within a few hours with the help of the materials on GitHub. We encourage you to try Druid, in HDP or on Hortonworks Data Cloud and give us your feedback in the Hortonworks Community.  Please see the Blog from our partner AtScale to learn more about Atscale + Hive/Druid.

 

Comments

  • This looks great! May I know why you have chosen Druid? HBase is the native NoSQL Columnar DB for Hadoop. Also Hive supports HBase integration already.

    Does Druid run on Hadoop / HDFS with default 3 replication factors enabled?

    • Yes druid run on HDFS.
      Unlike Hbase Druid is more than a key value store. Druid has indexes and bitmaps to filter the data and preform aggregation as part of the scan.

    • Hortonworks evaluated several OLAP engines before settling on Druid, no surprise that Kylin was one of those. All of the solutions had pros and cons.

      We favored Druid mostly because it was built from the ground up to support real-time analytics over streaming data. One of the biggest trends in analytics is that in future architectures, analytics will mostly be consumed by software rather than by people. With software there is an ability, and thus a need, to have latencies of both data freshness and analytics measured in milliseconds. Druid had that from day one while other technologies struggle to get even to minutes.

      On the SQL point, Kylin claims support for “Most ANSI SQL query functions” on their website. As far as I know they don’t use the Hive dialect. While it’s difficult to tell what they do and don’t support based on that statement, it’s likely that Hive can do more. For one data point, with HDP 2.6 Hive is able to run all 99 TPC-DS queries with some trivial modifications. Whether Kylin can do that would be an easy question to pose to a Kylin expert.

  • Do we have any performance stats Druid vs Hbase using hive or presto. Would like understand, what are use cases to use Druid

  • Wait! Wouldn’t using Presto as the execution engine make way more sense given what you are trying to achieve?! If there’s a large join here we’d get an actual Hadoop reduce phase? In the case of a map join, who’s doing the joining Hive or Calcite?

    Is Calcite what is doing the predicate and aggregation pushdown? From the Calcite tutorial they write:
    Current limitations: The JDBC adapter currently only pushes down table scan operations; all other processing (filtering, joins, aggregations and so forth) occurs within Calcite. Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If a Calcite query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Calcite will use the most efficient distributed query approach that it can.

    • Max, Hive has come a long way since Facebook open sourced it back in the day. With Hive LLAP you have an MPP style architecture which avoids the need to launch new YARN jobs for ever query and supports very low latency. True, if you ran this setup on CDH or MapR the performance will be garbage, but using modern bits like we do in HDP you can absolutely get low latency analytics.

      Equally as important is that the Druid/Calcite adapter does indeed support push-down. Those docs may be out of date, or maybe only apply to the JDBC connector. I asked the guy who knows the comment on this thread.

      On this blog there used to be a Part 2 in this series, which seems to have been deleted, so I have to refer to the absolutely unauthorized copy the DZone guys made (thanks, I guess) : https://dzone.com/articles/ultra-fast-olap-analytics-with-apache-hive-and-dru-1

      On a 1 TB data set the average response time was < 1s, and that is measuring starting from the SQL query into Hive to the final row returned in JDBC. Most of the work is being done in Druid. In cases where push down is impossible, as much work as possible is done in Druid and the last mile is finished in Hive. Again, with LLAP, no YARN jobs are launched.

      • What are the requirements to get this working? I’d assume a certain version of Hortonwork’s packaged Hive and a certain version of Druid and above?

    • Maxime, let me confirm what Carter mentioned previously. The documentation that you checked applies to the JDBC adapter only.

      Hive uses Calcite as its query optimizer, i.e., it does not use Calcite execution internals. Once we declare or create a table in Hive that is stored in Druid, Calcite (within Hive) is responsible to “push” as much computation as possible to Druid during the optimization phase, including filtering, projections, aggregations, sorting and limit. The rest of the query is optimized as usual, including join algorithm selection. Thus, once the results are obtained from Druid, the rest of the query execution is done in Hive.

  • We are using HDP 2.6.2 and installed Druid, but because of the Hive version included in HDP we weren’t able to test Hive integration of Druid. Is there any chance to upgrade Hive in HDP to version 2.2 at least?

    • Enable and use Hive Interactive in Ambari and use the associated JDBC url for HiveServer2 Interactive to load into Druid.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *