Apache™ Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications in Hadoop that need to handle TB to PB scale datasets. It allows projects in the Hadoop ecosystem, such as Apache Hive and Apache Pig, as well as 3rd-party software vendors to express fit-to-purpose data processing applications in a way that meets their unique demands for fast response times and extreme throughput at petabyte scale.
What Tez Does
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows applications to seamlessly span the scalability dimension from GB’s to PB’s of data and 10’s to 1000’s of nodes. The Apache Tez component library allows developers to use Tez to create Hadoop applications that integrate with YARN and perform well within mixed workload Hadoop clusters.
And, since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over general-purpose, end-user-facing engines such as MapReduce and Spark. Finally, it offers a customizable execution architecture that allows you to express complex computations as dataflow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.
Hive with Tez
As the defacto standard for SQL-In-Hadoop, Apache Hive has been optimized to serve both batch and interactive queries at petabyte scale. As of the 0.13 release Hive now embeds Tez so that it can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability across a wide range of use cases and data set sizes. This advance was a key driver of the Stinger Initiative, a broad community effort that included contributions from 145 engineers across 44 different organizations. Tez helps make Hive interactive.
Tez and an Open Community
Originally developed by Hortonworks, the Apache Tez project entered the Apache Incubator in February 2013 and then graduated to a top level project in July 2014. In just a short time, Tez has gathered 31 committers which represent a who’s who of leading Hadoop companies, including Cloudera, Facebook, LinkedIn, Microsoft, NASA JPL, Twitter, and Yahoo. The substantial contribution from this open community has propelled Tez to become a cornerstone of core Apache projects like Apache Hive and Apache Pig and to be embraced by other important open-source projects like Cascading. There is much more to come.
How Tez Works
The motivations, architecture and performance gains of Apache Tez for data processing in Hadoop extend well beyond Hive and Pig and the project has set the standard for true integration with YARN for interactive workloads. We invite you to learn more about Tez with these following links:
- Apache Tez: A New Chapter in Hadoop Data Processing
- Data Processing API in Apache Tez
- Runtime API in Apache Tez
- Writing a Tez Input/Processor/Output
- Apache Tez: Dynamic Graph Reconfiguration
- Reusing containers in Apache Tez
- Introducing Tez Sessions