Apache Tez

A framework for near real-time big data processing

Apache™ Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks. MapReduce has been the data processing backbone for Hadoop®, but its batch-oriented nature makes it unsuited for interactive query. Tez will allow projects in the Apache Hadoop® ecosystem such as Apache Hive and Apache Pig to meet demands for fast response times and extreme throughput at petabyte scale. The Apache Tez project is part of the Stinger Initiative.

What Tez Does

Tez is the logical next step for Apache Hadoop after Apache Hadoop YARN. With YARN the community generalized Hadoop MapReduce to provide a general-purpose resource management framework wherein MapReduce became merely one of the applications that could process data in a Hadoop cluster. Tez provides a more general data-processing application to the benefit of the entire ecosystem.

Tez will speed Pig and Hive workloads by an order of magnitude. By eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS, Tez speeds up data processing across both small-scale, low-latency and large-scale, high-throughput workloads.


Read more on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

How Tez Works

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG of tasks which can then be shared by other applications in the Hadoop ecosystem. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries and throughput for large-scale queries. Tez provides a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task. For example, any given SQL query can be expressed as a single job using Tez.

Try these Tutorials

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Thank you for subscribing!