Apache Hive

The standard for SQL queries in Hadoop

Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data.

The HiveQL language requires the same familiar skills and semantics that experienced analysts already understand for database SQL queries, providing a familiar way to make interactive queries. Finally, Apache Hive easily integrates with existing tools using a familiar JDBC interface.

Apache Hive 0.13 introduces the DECIMAL and CHAR datatypes. With the SQL standard-based authorization feature in Hive 0.13, users can now define their authorization policies in an SQL-compliant fashion. The Apache Hive community extended SQL language to support grant and revoke on entities. Hive also now supports show roles, user privileges, and active privileges.

What Hive Does

Hadoop was built to organize and store massive amounts of data. A Hadoop cluster is a reservoir of heterogeneous data, from multiple sources and in different formats. Hive allows the user to explore and structure that data, analyze it, and then turn it into business insight.

Learn how the Stinger Initiative aims to bring 100x performance improvements and continued SQL compatibility to Hive.

How Hive Works

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language, called HiveQL, which is similar to SQL. Hive supports overwriting or appending data, but not updates and deletes.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports primitive data formats such as TIMESTAMP, STRING, FLOAT, BOOLEAN, DECIMAL, BINARY, DOUBLE, INT, TINYINT, SMALLINT and BIGINT. In addition, primitive data types can be combined to form complex data types, such as structs, maps and arrays.

Here are some advantageous characteristics of Hive:

  • Familiar Hundreds of unique users can simultaneously query the data using a language familiar to SQL users.
  • Fast Response times are typically much faster than other types of queries on the same type of huge datasets.
  • Scalable and extensible As data variety and volume grows, more commodity machines can be added to the cluster, without a corresponding reduction in performance.
  • Informative Familiar JDBC and ODBC drivers allow many applications to pull Hive data for seamless reporting. Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats.

Try these Tutorials

Apache Top-Level Project Since
September 2010
Hortonworks Committers
Project Page

Try Hive with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox


More posts on:
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.