Apache Hive

The standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop. And with the completion of the Stinger Initiative, and the first phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Throughout all the innovation, Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

Stinger.next: Hortonworks Investment Themes for Apache Hive

The Stinger Initiative successfully delivered a fundamental new Apache Hive, which evolved Hive’s traditional architecture and made it faster, with richer SQL semantics and petabyte scalability. We continue to work within the community to advance these three key facets of hive:

Speed
Deliver sub-second query response times
Scale
The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes
SQL
Enable transactions and SQL:2011 Analytics for Hive

Stinger.next is focused on the vision of delivering enterprise SQL at Hadoop scale, accelerating the production deployment of Hive for interactive analytics, reporting and and ETL. More explicitly, some of the key areas that we will invest in include:

Investment Theme Planned Enhancements
Speed
  • LLAP, a process for multi-threaded execution, will work with Apache Tez to achieve sub-second response times
  • Sub-second queries will support interactive dashboards and explorative analytics
  • Materialized views will allow multiple views of the same data and speed analysis
Scale
    Cross-geo query will allow users to query and report on geographically distributed datasets
SQL Semantics
  • Transactions with ACID semantics will allow users to easily modify data with inserts, updates and deletes
  • SQL:2011 Analytics will allow rapid deployment of rich Hive reporting
  • A powerful cost-based optimizer will ensure that complex queries run quickly
Spark Machine Learning Integration
    Will allow users to build and run machine learning models via Hive, using Spark Machine learning libraries
Streaming Ingest
    Will help users expand operational reporting on the latest data by replicating from operational databases

Goals for Upcoming Releases

Goal Description
Transactions with ACID semantics
    Delivered in HDP 2.2, ACID transactions will allow users to easily modify data with inserts, updates and deletes. They extend Hive from the traditional write-once, and read-often system to support analytics over changing data. This enables reporting with occasional corrections and modifications and allows operational reporting with periodic bulk updates from an operational database.
Sub-second queries
    Will allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements
SQL:2011 Analytics
    Will allow rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.

Recent Hive Releases

r4

Apache Hive Version Prior Enhancements
0.14 (Coming Soon)
  • Speed: cost-based optimizer for star and bushy join queries
  • Scale: temporary tables
  • Scale: transactions with ACID semantics
0.13
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
0.12
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to explore, structure and analyze that data, then turn it into business insight.

Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop:

Feature Description
Familiar
    Query data with a SQL-based language
Fast
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance

How Hive Works

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Try these Tutorials

Apache Top-Level Project Since
September 2010
Hortonworks Committers
17
Project Page

Try Hive with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Join the Webinar!

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Thursday, October 30, 2014
1:00 PM Eastern / 12:00 PM Central / 11:00 AM Mountain / 10:00 AM Pacific

More Webinars »

Resources

More posts on:
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Explore Technology Partners
Hortonworks nurtures an extensive ecosystem of technology partners, from enterprise platform vendors to specialized solutions and systems integrators.