Apache Tez

A Framework for YARN-based, Data Processing Applications In Hadoop

Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.

Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it.

H1H2Tez

Hive with Tez

As the defacto standard for SQL-In-Hadoop, Apache Hive is optimal for both batch and interactive queries at petabyte scale. Hive embeds Tez so that it can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability. Apache Tez innovations drove many of the Hive performance improvements delivered by the Stinger Initiative, a broad community effort that included contributions from 145 engineers across 44 different organizations. Tez helps make Hive interactive.

Tez and the Open Community

Originally developed by Hortonworks, the Apache Tez project entered the Apache Incubator in February 2013 and then graduated to a top-level project in July 2014. In just a short time, Tez has attracted many committers that represent a who’s who of leading Hadoop companies, including Cloudera, Facebook, LinkedIn, Microsoft, NASA JPL, Twitter, and Yahoo. Significant contributions from this open community propelled Tez to become a cornerstone of core Apache projects like Apache Hive and Apache Pig and to adoption by other important open-source projects like Cascading.

Hortonworks Focus for Tez

Current work in Apache Tez innovation focuses on improvements to speed, scale and usability.

Granular Counters
Additional counters, including input and output counters help developers with diagnostics
Frequent Updates
Updates come in as the job is running to simplify diagnosis of jobs that take longer than expected
Improved Recovery
Reduce the amount of re-work after failures through more granular checkpointing

Recent Progress in Apache Tez

Version Enhancements
0.5.1
  • Stable developer API
  • Support running Tez in local mode
  • Swim lane UI tool
0.4
  • Application recovery
  • Data shuffle optimizations

How Tez Works

Apache Tez’ improvement of data processing in Hadoop extend well beyond gains seen in Apache Hive and Apache Pig. The project has set the standard for true integration with YARN for interactive workloads. Read the following short descriptions about how Apache Tez completes core tasks.

Express, model and execute processing logic

Tez models data processing as a dataflow graph, with the graph vertices representing application logic and its edges representing movement of data. A rich data flow definition API allows users to intuitively express complex query logic. The API fits well with query plans produced by higher-level declarative applications like Apache Hive and Apache Pig.

Model interaction between Input, Processor and Output Modules

Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. Input & Output determine the data format and how and where it is read or written. The Processor holds the data transformation logic. Tez does not impose any data format and only requires that Input, Processor and Output formats are compatible with each other.

Dynamically reconfigure graphs

Distributed data processing is dynamic, and it is difficult to determine optimal data movement methods in advance. More information is available during runtime, which may help optimize the execution plan further. So Tez includes support for pluggable vertex management modules to collect runtime information and change the dataflow graph dynamically to optimize performance and resource utilization.

Optimize performance and resource management

YARN manages resources in a Hadoop cluster, based on cluster capacity and load. The Tez execution engine framework efficiently acquires resources from YARN and reuses every component in the pipeline such that no operation is duplicated unnecessarily.

API for defining directed acyclic graphs (DAGs)

Tez defines a simple Java API to express a DAG of data processing. The API has three components

  • DAG – this defines the overall job. The user creates a DAG object for each data processing job.
  • Vertex – this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
  • Edge – this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.

Re-use containers

Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf. This model comes with inherent costs for process startup and initialization, handling stragglers and allocating each container via the YARN resource manager.

Ecosystem Support for Apache Tez

  • Cascading is a high-level dataflow engine for processing data in Hadoop, Cascading supports running jobs using Apache Tez.
  • Datameer introduced Tez support in version 5.0 of their data analytics product for Hadoop.

    Try these Tutorials

  • Apache Top-Level Project Since
    July 2014
    Hortonworks Committers
    17
    Project Page

    Try Tez with Sandbox

    Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

    Get Sandbox

    View Past Webinars

    Accelerate Big Data Application Development with Cascading and HDP
    YARN Ready – Integrating to YARN with Tez (part 3)

    More Webinars »

    More posts on:
    Hortonworks Data Platform
    The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
    Get started with Sandbox
    Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
    Modern Data Architecture
    Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.