Apache Hive

The standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop. And with the completion of the Stinger Initiative, and the first phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to explore, structure and analyze that data, then turn it into business insight.

Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop:

Feature Description
Familiar
    Query data with a SQL-based language
Fast
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance

How Hive Works

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Innovation & Focus

The Stinger Initiative successfully delivered a fundamental new Apache Hive, which evolved Hive’s traditional architecture and made it faster, with richer SQL semantics and petabyte scalability. We continue to work within the community to advance these three key facets of hive:

Speed
Deliver sub-second query response times
Scale
The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes
SQL
Enable transactions and SQL:2011 Analytics for Hive

Stinger.next is focused on the vision of delivering enterprise SQL at Hadoop scale, accelerating the production deployment of Hive for interactive analytics, reporting and and ETL. More explicitly, some of the key areas that we will invest in include:

Focus Planned Enhancements
Speed
  • LLAP, a process for multi-threaded execution, will work with Apache Tez to achieve sub-second response times
  • Sub-second queries will support interactive dashboards and explorative analytics
  • Materialized views will allow multiple views of the same data and speed analysis
Scale
    Cross-geo query will allow users to query and report on geographically distributed datasets
SQL Semantics
  • Transactions with ACID semantics will allow users to easily modify data with inserts, updates and deletes
  • SQL:2011 Analytics will allow rapid deployment of rich Hive reporting
  • A powerful cost-based optimizer will ensure that complex queries run quickly
Spark Machine Learning Integration
    Allow users to build and run machine learning models via Hive, using Spark Machine learning libraries
Streaming Ingest
    Help users expand operational reporting on the latest data by replicating from operational databases

Recent Hive Releases

r4

Apache Hive Version Prior Enhancements
0.14
  • Speed: cost-based optimizer for star and bushy join queries
  • Scale: temporary tables
  • Scale: transactions with ACID semantics
0.13
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
0.12
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions

Hive Tutorials

Hive in our Blog

Webinars & Presentations

Forums

to create new topics or reply. | New User Registration

This forum contains 474 topics and 885 replies, and was last updated by  Rupert Bailey 5 hours, 35 minutes ago.

Viewing 23 topics - 1 through 20 (of 477 total)
Viewing 23 topics - 1 through 20 (of 477 total)

You must be to create new topics. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Stay up to date!
Developer updates!