Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
On May 15, Owen O’Malley and Carter Shanklin hosted the second of our seven Discover HDP 2.1 webinars. Owen and Carter discussed the Stinger Initiative and the improvements to Apache Hive that are included in HDP 2.1:
- Faster queries with Hive on Tez, vectorized query execution and a cost-based optimizer
- New SQL semantics and datatypes
- SQL-standard authorization
- The Hive job visualizer in Apache Ambari
- And many more
Here is the complete recording of the webinar.
Here is the presentation deck.
Attend our next Discover HDP 2.1 webinar on Wednesday, May 21 at 10am Pacific Time: Apache Falcon for Data Governance in Hadoop
We’re grateful to the many participants who joined this webinar and asked excellent questions. Here’s the complete Q & A from the webinar:
|Are you using the Tez engine for processing these queries?||Yes. We used Tez for all queries in the demo. Notice that the output of the query is different for Tez than it would be for MapReduce.|
|Does Tez leverage in-memory processing?||Yes. There are various ways in which we take advantage of memory. We automatically figure out which tables to bring into memory and then stream the large table through that. The older versions of Hive could not do that.|
|Are there any plans to integrate ORC with other tools like Crunch, Cascading, Spark, Giraph?||Yes. We’ve got plans on our roadmap to have native support for Cascading, Crunch, and other services. We’ve already introduced native ORCFile support for Pig. We still have some additional work to completely decouple that from the Hive engine so that it’s very easy to use outside of Hive.|
|Does Hive use only one reducer?||No. It depends on what you’re doing. Hive will automatically decide how many reducers it thinks it needs. There are certain operations (e.g. order by) where you have to use one reducer only, but Hive will automatically ascertain how many reducers it will need.|
|Are ORC stats used to optimize the number of map tasks? For example, if I have very wide table, but I want to read only two columns, will Tez/Hive be able to notice this fact and start fewer tasks than needed for a full table scan?||We are improving the statistics gathering for Hive 0.14. The cost-based optimizer is a main thrust for Hive 0.14, and it will use the stats to scale up or scale down the job appropriately.|
|For this demo, are you using this Ambari 1.5.1?||Yes. We showed Ambari 1.5.1.|
|When should you not use Tez with Hive?||
Tez does not support a couple of Hive features yet such as SMB Join, SELECT TRANSFORM and Indexes. So for now, you should still run those queries on MapReduce.
The reverse of that question is, “When should I use Hive on Tez?” If you need interactive query, you need to use Hive on Tez.
The long-term direction of the roadmap is to completely replace MapReduce with Tez.
|Can you compare the performance of Hive on Tez with Impala?||
Absolutely. We see people doing that every day. If you compare, make sure that you see how the systems perform at scale and under a variety of workloads. Don’t make a decision using 5 GB of data.
Make sure to use a large amount of data and use the SQL semantics that you need for deep processing of the data. Hive has many more SQL features than you’ll get for other SQL tools for Hadoop.
|What is the maximum number of joins that can perform a query? Can we use outer – left – right – inner joins?||One query we regularly run has a 9-way join with a mixture of inner and left outer joins. There’s no hard-coded limit to the number of joins in one query. Hive on MR and Hive on Tez support all the same join types.|
|Can we use Tez to query HBase Tables?||Yes, Hive on Tez can be used with HBase tables.|
Visit our Stinger: Interactive Query for Hive labs page to learn more.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.