Apache HAWQ (incubating) provides native SQL on Apache Hadoop based on an advanced MPP elastic query engine. HAWQ represents a new generation of high performance, advanced analytics that transforms Hadoop into an enterprise analytic database. Move and analyze entire workloads, while simplifying management and expanding the breadth of data access and analytics, all natively in Hadoop.
HAWQ is an elastic SQL query engine that combines exceptional MPP-based analytics performance and robust ANSI SQL compliance – enabling you to run fast ad hoc queries. Hortonworks HDB powered by Apache HAWQ includes integrated Apache MADlib (incubating) machine learning – enabling SQL-based predictive analytics.
HAWQ and MADlib advantages include:
Evolved from over a decade’s worth of intellectual property from Pivotal Greenplum™ and open source PostgreSQL, HAWQ operates natively in Hadoop, which simplifies overall system management of cluster resources.
The flow for setting up, loading, managing and using HAWQ and MADlib is listed below:
The high level architecture of Apache HAWQ is shown below. In a typical deployment, each slave node includes a physical HAWQ segment, an HDFS DataNode and a NodeManager. Masters for HAWQ, HDFS and YARN are on separate nodes.
HAWQ is tightly integrated with YARN for query resource management. HAWQ caches containers from YARN in a resource pool and then manages those resources locally leveraging its own finer-grained resource management for users and groups.
For a query to be executed, it allocates a set of virtual segments according to the cost of a query, resource queue definitions, data locality and the current resource usage in the system. Then the query is dispatched to corresponding physical hosts (can be a subset of nodes of the whole cluster). The HAWQ resource enforcer on each node monitors and controls the real time resources used by the query to avoid resource usage violations.
Nodes can be added dynamically without data redistribution. Expansion takes only seconds. When a new node is added, it automatically contacts the HAWQ master, which makes the resource available on the node to be used for future queries immediately.
The Hortonworks HDB support subscription offering is a combination of Apache HAWQ and Apache MADlib, fully supported by Hortonworks running on the Hortonworks Data Platform (HDP). Apache Hive is the de facto standard for SQL queries over petabytes of data in Hadoop.
Hortonworks HDB complements Hive by adding the following capabilities:
|Interactive query performance||
|MADlib big data Machine Learning in SQL||
|Data federation using HAWQ Extension Framework||
Choose the right SQL engine based on your application’s needs:
|Hortonworks HDB powered by Apache HAWQ||