Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive

Follow up on Apache Hive webinar

On May 15, Owen O’Malley and Carter Shanklin hosted the second of our seven Discover HDP 2.1 webinars. Owen and Carter discussed the Stinger Initiative and the improvements to Apache Hive that are included in HDP 2.1:

  • Faster queries with Hive on Tez, vectorized query execution and a cost-based optimizer
  • New SQL semantics and datatypes
  • SQL-standard authorization
  • The Hive job visualizer in Apache Ambari
  • And many more

Here is the complete recording of the webinar.

Here is the presentation deck.

Attend our next Discover HDP 2.1 webinar on Wednesday, May 21 at 10am Pacific Time: Apache Falcon for Data Governance in Hadoop

We’re grateful to the many participants who joined this webinar and asked excellent questions. Here’s the complete Q & A from the webinar:

Question Answer
Are you using the Tez engine for processing these queries? Yes. We used Tez for all queries in the demo. Notice that the output of the query is different for Tez than it would be for MapReduce.
Does Tez leverage in-memory processing? Yes. There are various ways in which we take advantage of memory. We automatically figure out which tables to bring into memory and then stream the large table through that. The older versions of Hive could not do that.
Are there any plans to integrate ORC with other tools like Crunch, Cascading, Spark, Giraph? Yes. We’ve got plans on our roadmap to have native support for Cascading, Crunch, and other services. We’ve already introduced native ORCFile support for Pig. We still have some additional work to completely decouple that from the Hive engine so that it’s very easy to use outside of Hive.
Does Hive use only one reducer? No. It depends on what you’re doing. Hive will automatically decide how many reducers it thinks it needs. There are certain operations (e.g. order by) where you have to use one reducer only, but Hive will automatically ascertain how many reducers it will need.
Are ORC stats used to optimize the number of map tasks? For example, if I have very wide table, but I want to read only two columns, will Tez/Hive be able to notice this fact and start fewer tasks than needed for a full table scan? We are improving the statistics gathering for Hive 0.14. The cost-based optimizer is a main thrust for Hive 0.14, and it will use the stats to scale up or scale down the job appropriately.
For this demo, are you using this Ambari 1.5.1? Yes. We showed Ambari 1.5.1.
When should you not use Tez with Hive? Tez does not support a couple of Hive features yet such as SMB Join, SELECT TRANSFORM and Indexes. So for now, you should still run those queries on MapReduce.

The reverse of that question is, “When should I use Hive on Tez?” If you need interactive query, you need to use Hive on Tez.

The long-term direction of the roadmap is to completely replace MapReduce with Tez.

Can you compare the performance of Hive on Tez with Impala? Absolutely. We see people doing that every day. If you compare, make sure that you see how the systems perform at scale and under a variety of workloads. Don’t make a decision using 5 GB of data.

Make sure to use a large amount of data and use the SQL semantics that you need for deep processing of the data. Hive has many more SQL features than you’ll get for other SQL tools for Hadoop.

What is the maximum number of joins that can perform a query? Can we use outer – left – right – inner joins? One query we regularly run has a 9-way join with a mixture of inner and left outer joins. There’s no hard-coded limit to the number of joins in one query. Hive on MR and Hive on Tez support all the same join types.
Can we use Tez to query HBase Tables? Yes, Hive on Tez can be used with HBase tables.

Visit our Stinger: Interactive Query for Hive labs page to learn more.

Categorized by :
Architect & CIO Business Analytics Data Analyst & Scientist Developer HDP 2.1 Hive MapReduce Stinger Tez YARN

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Join the Webinar!

YARN Ready – Integrating to YARN natively (part 1 of 3)
Thursday, July 24, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.