Apache Hive

The standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop. And with the completion of the Stinger Initiative, and the first phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Throughout all the innovation, Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

Stinger.next: Hortonworks Focus for Apache Hive

The Stinger Initiative successfully delivered a fundamental new Apache Hive, which evolved Hive’s traditional architecture and made it faster, with richer SQL semantics and petabyte scalability. We continue to work within the community to advance these three key facets of hive:

Deliver sub-second query response times
The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes
Enable transactions and SQL:2011 Analytics for Hive

Stinger.next is focused on the vision of delivering enterprise SQL at Hadoop scale, accelerating the production deployment of Hive for interactive analytics, reporting and and ETL. More explicitly, some of the key areas that we will invest in include:

Focus Planned Enhancements
  • LLAP, a process for multi-threaded execution, will work with Apache Tez to achieve sub-second response times
  • Sub-second queries will support interactive dashboards and explorative analytics
  • Materialized views will allow multiple views of the same data and speed analysis
    Cross-geo query will allow users to query and report on geographically distributed datasets
SQL Semantics
  • Transactions with ACID semantics will allow users to easily modify data with inserts, updates and deletes
  • SQL:2011 Analytics will allow rapid deployment of rich Hive reporting
  • A powerful cost-based optimizer will ensure that complex queries run quickly
Spark Machine Learning Integration
    Allow users to build and run machine learning models via Hive, using Spark Machine learning libraries
Streaming Ingest
    Help users expand operational reporting on the latest data by replicating from operational databases

Focus for Innovation

Goal Description
Transactions with ACID semantics
    Delivered in HDP 2.2, ACID transactions will allow users to easily modify data with inserts, updates and deletes. They extend Hive from the traditional write-once, and read-often system to support analytics over changing data. This enables reporting with occasional corrections and modifications and allows operational reporting with periodic bulk updates from an operational database.
Sub-second queries
    Will allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements
SQL:2011 Analytics
    Will allow rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.

Recent Hive Releases


Apache Hive Version Prior Enhancements
  • Speed: cost-based optimizer for star and bushy join queries
  • Scale: temporary tables
  • Scale: transactions with ACID semantics
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to explore, structure and analyze that data, then turn it into business insight.

Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop:

Feature Description
    Query data with a SQL-based language
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance

How Hive Works

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Apache Top-Level Project Since
September 2010
Hortonworks Committers
Project Page

Try Hive with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox


More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.