A Framework for YARN-based, Data Processing Applications In Hadoop
Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
What Tez Does
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.
Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it.
How Tez Works
Apache Tez’ improvement of data processing in Hadoop extend well beyond gains seen in Apache Hive and Apache Pig. The project has set the standard for true integration with YARN for interactive workloads. Read the following short descriptions about how Apache Tez completes core tasks.
Tez models data processing as a dataflow graph, with the graph vertices representing application logic and its edges representing movement of data. A rich data flow definition API allows users to intuitively express complex query logic. The API fits well with query plans produced by higher-level declarative applications like Apache Hive and Apache Pig.
Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. Input & Output determine the data format and how and where it is read or written. The Processor holds the data transformation logic. Tez does not impose any data format and only requires that Input, Processor and Output formats are compatible with each other.
Distributed data processing is dynamic, and it is difficult to determine optimal data movement methods in advance. More information is available during runtime, which may help optimize the execution plan further. So Tez includes support for pluggable vertex management modules to collect runtime information and change the dataflow graph dynamically to optimize performance and resource utilization.
YARN manages resources in a Hadoop cluster, based on cluster capacity and load. The Tez execution engine framework efficiently acquires resources from YARN and reuses every component in the pipeline such that no operation is duplicated unnecessarily.
Tez defines a simple Java API to express a DAG of data processing. The API has three components
- DAG – this defines the overall job. The user creates a DAG object for each data processing job.
- Vertex – this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
- Edge – this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf. This model comes with inherent costs for process startup and initialization, handling stragglers and allocating each container via the YARN resource manager.
Hive with Tez
As the defacto standard for SQL-In-Hadoop, Apache Hive is optimal for both batch and interactive queries at petabyte scale. Hive embeds Tez so that it can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability. Apache Tez innovations drove many of the Hive performance improvements delivered by the Stinger Initiative, a broad community effort that included contributions from 145 engineers across 44 different organizations. Tez helps make Hive interactive.
Recent Progress in Apache Tez
Originally developed by Hortonworks, the Apache Tez project entered the Apache Incubator in February 2013 and then graduated to a top-level project in July 2014. In just a short time, Tez has attracted 31 committers that represent a who’s who of leading Hadoop companies, including Cloudera, Facebook, LinkedIn, Microsoft, NASA JPL, Twitter, and Yahoo. Significant contributions from this open community propelled Tez to become a cornerstone of core Apache projects like Apache Hive and Apache Pig and to adoption by other important open-source projects like Cascading.
Ecosystem Support for Apache Tez
Hortonworks Focus for Tez
Current work in Apache Tez innovation focuses on improvements to speed, scale and usability.
- Granular Counters Additional counters, including input and output counters help developers with diagnostics
- Frequent UpdatesUpdates come in as the job is running to simplify diagnosis of jobs that take longer than expected
- Improved RecoveryReduce the amount of re-work after failures through more granular checkpointing