Modern corporations are increasingly looking for near real time analytics and insights to make actionable decisions. To fuel this, this blog introduces Ultra fast analytics with Apache Hive and Druid.
We’ll show how the Hive/Druid integration delivers ultra-fast SQL analytics that can be consumed from your favorite BI tool to get accelerated business results. And we will show benchmark results of BI queries running in just milliseconds over a 1TB dataset.
Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is included as a technical preview in HDP 2.6 and you can read more about Druid on our project page, or at the project website.
Druid is great for low latency analytics because it combines the best qualities of a column store and inverted indexing. The benefits of column stores are well known, because they minimize I/O costs for analytical queries.
Many analytical applications require drilling down to extremely fine levels of detail, e.g. “How much did we sell of Brand ‘Best Choice’ on this particular date, broken down by State”. A query like this needs 5 or 10 rows out of potentially billions of total rows. An inverted index makes it possible to load only those specific rows you need to satisfy the query.
This is a huge departure from the existing crop of SQL-on-Hadoop solutions like Apache Hive or SparkSQL which rely on columnar storage to provide high-throughput aggregation, but do not deal well with finding the “needles in the haystack”. Druid was built from the ground up to be great at this: As more dimensions are added, fewer rows are needed and Druid queries run even faster.
Druid’s strong points are very compelling but there are many important problems that Druid does not attempt to solve. Most important of these is joins. Druid has some basic capabilities to join its internal data against small dimension tables loaded from external systems (this is called query-time lookup in Druid terms.) This ability aside, the Druid authors make it clear that large-scale join support is not a priority.
Second, Druid’s SQL native implementation is extremely new. Though progress is being made, SQL is a huge and complex language and it is unlikely that we will see comprehensive SQL analytical capabilities such as set operators and windowing functions any time soon. In addition, we can expect that Druid’s SQL implementation will not support heavy-duty joins.
Although scalable joins and comprehensive SQL are not core project goals for Druid, they are core project goals for Apache Hive. Integrating these projects blends their capabilities nicely. More importantly perhaps, querying through Hive means that any analytics tool can now query Druid using mature and high-quality ODBC/JDBC drivers that Hortonworks provides.
Apache Hive supports federation between native and external data sources through its StorageHandler interfaces while Apache Calcite provides a flexible way to optimize and re-write queries to divide work among different engines based on their strengths. Substantial progress has been made, with Read Path Integration allowing you to query arbitrary Druid tables from Hive and Write Path Integration allowing you to create and update Druid tables from Hive.
In the detailed example below we will use the Write Path Integration to create OLAP cubes based on data in Hive. After doing that, the OLAP cube appears as an ordinary Hive table and can be queried with any Hive SQL expression, thanks to the Read Path Integration.
The result is that Druid provides essentially an OLAP Index over data in Hive, providing pre-aggregation and fast access to the needles-in-a-haystack, thanks to Druid’s inverted indexing.
We mentioned how Apache Calcite divides work according to the engine’s strengths. Simple queries can be answered directly from Druid and benefit from Druid’s extensive OLAP optimizations. More complex operations will push work down into Druid when it can, then run the remaining bits of the query in Hive within the YARN cluster.
What does “simple” mean in this context? Druid is well suited for things like Top N queries, time-series queries and for simple aggregations grouped by defined dimensions. Thanks to the Calcite integration, Hive is able to cement in any gaps in the SQL-like capabilities Druid exposes.
Hive / Druid integration remains an active area. Some improvements you can expect include:
Druid is available as a Technical Preview in HDP 2.6. In addition to the Hive/Druid integration, Hortonworks has made it easy to deploy, configure and monitor Druid using Apache Ambari, making it easy to get started. With these simplifications, a knowledgeable Hadoop user should be able to reproduce anything in this document within a few hours with the help of the materials on GitHub. We encourage you to try Druid, in HDP or on Hortonworks Data Cloud and give us your feedback in the Hortonworks Community.