Apache Hive & Hadoop

The de facto standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.

With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to explore, structure and analyze that data, then turn it into actionable business insight.

Advantages of using Hive for enterprise SQL in Hadoop:

Feature Description
    Query data with a SQL-based language
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance

How Hive Works

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Innovation & Focus

The Stinger Initiative successfully delivered a fundamental new Apache Hive, which evolved Hive’s traditional architecture and made it faster, with richer SQL semantics and petabyte scalability. We continue to work within the community to advance these three key facets of hive:

Deliver sub-second query response times
The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes
Enable transactions and SQL:2011 Analytics for Hive

Stinger.next is focused on the vision of delivering enterprise SQL at Hadoop scale, accelerating the production deployment of Hive for interactive analytics, reporting and and ETL. More explicitly, some of the key areas that we will invest in include:


  • LLAP, a process for multi-threaded execution, will work with Apache Tez to achieve sub-second response times
  • Sub-second queries will support interactive dashboards and explorative analytics
  • Materialized views will allow multiple views of the same data and speed analysis
Focus Planned Enhancements
    Cross-geo query will allow users to query and report on geographically distributed datasets
SQL Semantics
  • Transactions with ACID semantics will allow users to easily modify data with inserts, updates and deletes
  • SQL:2011 Analytics will allow rapid deployment of rich Hive reporting
  • A powerful cost-based optimizer will ensure that complex queries run quickly
Spark Machine Learning Integration
    Allow users to build and run machine learning models via Hive, using Spark Machine learning libraries
Streaming Ingest
    Help users expand operational reporting on the latest data by replicating from operational databases

Recent Hive Releases


Apache Hive Version Prior Enhancements
  • Speed: Vectorized Map Join brings up to 5x faster map joins
  • Scale: Hybrid-Hybrid Grace Hash Join allows analytical queries at large scale without complex tuning
  • Scale: Bloom Filter support added to ORCFile
  • SQL: Added support for UNION DISTINCT and Interval Types
  • Speed: Cost-based optimizer for star and bushy join queries
  • Scale: Temporary tables
  • Scale: Transactions with ACID semantics
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions


Apache™ HCatalog is a table management layer that exposes Hive metadata to other Hadoop applications. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

What HCatalog Does

Apache HCatalog provides the following benefits to grid administrators:

  • Frees the user from having to know where the data is stored, with the table abstraction
  • Enables notifications of data availability
  • Provides visibility for data cleaning and archiving tools

How HCatalog Works

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

SQL and HiveQL

Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

Below is some basic information to get you started. Of course, if you really want to get to grips with Hive, then take a look at the full language manual.

Retrieving Information

Function MySQL Hive
Retrieving Information (General) SELECT from_columns FROM table WHERE conditions; SELECT from_columns FROM table WHERE conditions;
Retrieving All Values SELECT * FROM table; SELECT * FROM table;
Retrieving Some Values SELECT * FROM table WHERE rec_name = "value"; SELECT * FROM table WHERE rec_name = "value";
Retrieving With Multiple Criteria SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Retrieving Specific Columns SELECT column_name FROM table; SELECT column_name FROM table;
Retrieving Unique Output SELECT DISTINCT column_name FROM table; SELECT DISTINCT column_name FROM table;
Sorting SELECT col1, col2 FROM table ORDER BY col2; SELECT col1, col2 FROM table ORDER BY col2;
Sorting Reverse SELECT col1, col2 FROM table ORDER BY col2 DESC; SELECT col1, col2 FROM table ORDER BY col2 DESC;
Counting Rows SELECT COUNT(*) FROM table; SELECT COUNT(*) FROM table;
Grouping With Counting SELECT owner, COUNT(*) FROM table GROUP BY owner; SELECT owner, COUNT(*) FROM table GROUP BY owner;
Maximum Value SELECT MAX(col_name) AS label FROM table; SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”) SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)


Function MySQL Hive
Selecting a database USE database; USE database;
Listing tables in a database SHOW TABLES; SHOW TABLES;
Describing the format of a table DESCRIBE table; DESCRIBE (FORMATTED|EXTENDED) table;
Creating a database CREATE DATABASE db_name; CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name; DROP DATABASE db_name (CASCADE);

Hive Tutorials

Hive in our Blog

Webinars & Presentations


to create new topics or reply. | New User Registration

This forum contains 518 topics and 951 replies, and was last updated by  Divakar Annapureddy 1 day, 6 hours ago.

Viewing 23 topics - 1 through 20 (of 521 total)
Viewing 23 topics - 1 through 20 (of 521 total)

You must be to create new topics. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Stay up to date!
Developer updates!