Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
Apache Projects
Apache Hive

Apache Hive



The de facto standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.

With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.

Advantages of using Hive for enterprise SQL in Hadoop

Feature Description
    Query data with a SQL-based language
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance
    Works with traditional data integration and data analytics tools.

How Hive Works

Hive on LLAP (Live Long and Process) makes use of persistent query servers with intelligent in-memory caching to avoid Hadoop’s batch-oriented latency and provide as fast as sub-second query response times against smaller data volumes, while Hive on Tez continues to provide excellent batch query performance against petabyte-scale data sets.

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Innovation & Focus

The Stinger and Stinger.next initiatives successfully delivered a fundamentally new Apache Hive, which evolved Hive’s traditional architecture and made it much faster, with richer SQL standards compliant semantics and petabyte scalability. We continue to work within the Hive community to advance these four key facets of hive:

Deliver sub-second query response times

The only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes

Enable transactions and SQL:2011 Analytics for Hive

Provide native interfaces for MDX compliant analytics tools including Excel


The focus for the next major Hive release will be:

Focus Planned Enhancements
MDX interface
    Provide native support for the MDX (multi-dimensional expressions) query language, the de-facto query language used by online analytical process (OLAP) tools, include Microsoft Excel.
Complete SQL:2011 Compliance
    Close the remaining gaps to achieve complete SQL:2011 standard SQL compliance, enabling maximum compatibility with SQL-based BI, visualization and machine learning tools.

Recent Hive Releases


Apache Hive Version Prior Enhancements
  • Hive LLAP: Persistent query servers with intelligent in-memory caching.
  • ACID GA: Hardened and proven at scale.
  • Expanded SQL Compliance: More capable integration with BI tools.
  • Performance: Interactive query, 2x faster ETL.
  • Security: Row / Column security extending to views, Column level security for Spark.
  • Operations: LLAP integration in Ambari, new Grafana dashboards.
  • Speed: HBase to store Hive Metadata
  • Workflow: HPL/SQL – Implementing Procedural SQL in Hive
  • Scale: first version of LLAP, and Hive on Spark
  • SQL: Hive-on-Spark Self Union/Join
  • Speed: Vectorized Map Join brings up to 5x faster map joins
  • Scale: Hybrid-Hybrid Grace Hash Join allows analytical queries at large scale without complex tuning
  • Scale: Bloom Filter support added to ORCFile
  • SQL: Added support for UNION DISTINCT and Interval Types
  • Speed: Cost-based optimizer for star and bushy join queries
  • Scale: Temporary tables
  • Scale: Transactions with ACID semantics
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions


Apache™ HCatalog is a table management layer that exposes Hive metadata to other Hadoop applications. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

What HCatalog Does

Apache HCatalog provides the following benefits to grid administrators:

  • Frees the user from having to know where the data is stored, with the table abstraction
  • Enables notifications of data availability
  • Provides visibility for data cleaning and archiving tools

How HCatalog Works

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

SQL and HiveQL

Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

Below is some basic information to get you started. Of course, if you really want to get to grips with Hive, then take a look at the full language manual.

Retrieving Information

Function MySQL Hive
Retrieving Information (General) SELECT from_columns FROM table WHERE conditions; SELECT from_columns FROM table WHERE conditions;
Retrieving All Values SELECT * FROM table; SELECT * FROM table;
Retrieving Some Values SELECT * FROM table WHERE rec_name = "value"; SELECT * FROM table WHERE rec_name = "value";
Retrieving With Multiple Criteria SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Retrieving Specific Columns SELECT column_name FROM table; SELECT column_name FROM table;
Retrieving Unique Output SELECT DISTINCT column_name FROM table; SELECT DISTINCT column_name FROM table;
Sorting SELECT col1, col2 FROM table ORDER BY col2; SELECT col1, col2 FROM table ORDER BY col2;
Sorting Reverse SELECT col1, col2 FROM table ORDER BY col2 DESC; SELECT col1, col2 FROM table ORDER BY col2 DESC;
Counting Rows SELECT COUNT(*) FROM table; SELECT COUNT(*) FROM table;
Grouping With Counting SELECT owner, COUNT(*) FROM table GROUP BY owner; SELECT owner, COUNT(*) FROM table GROUP BY owner;
Maximum Value SELECT MAX(col_name) AS label FROM table; SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”) SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)


Function MySQL Hive
Selecting a database USE database; USE database;
Listing tables in a database SHOW TABLES; SHOW TABLES;
Describing the format of a table DESCRIBE table; DESCRIBE (FORMATTED|EXTENDED) table;
Creating a database CREATE DATABASE db_name; CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name; DROP DATABASE db_name (CASCADE);


Hive Tutorials

Hive in our Blog

Webinars & Presentations