A Framework for YARN-based, Data Processing Applications In Hadoop
Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.
Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it.
Apache Tez’ improvement of data processing in Hadoop extend well beyond gains seen in Apache Hive and Apache Pig. The project has set the standard for true integration with YARN for interactive workloads. Read the following short descriptions about how Apache Tez completes core tasks.
Tez models data processing as a dataflow graph, with the graph vertices representing application logic and its edges representing movement of data. A rich data flow definition API allows users to intuitively express complex query logic. The API fits well with query plans produced by higher-level declarative applications like Apache Hive and Apache Pig.
Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. Input & Output determine the data format and how and where it is read or written. The Processor holds the data transformation logic. Tez does not impose any data format and only requires that Input, Processor and Output formats are compatible with each other.
Distributed data processing is dynamic, and it is difficult to determine optimal data movement methods in advance. More information is available during runtime, which may help optimize the execution plan further. So Tez includes support for pluggable vertex management modules to collect runtime information and change the dataflow graph dynamically to optimize performance and resource utilization.
YARN manages resources in a Hadoop cluster, based on cluster capacity and load. The Tez execution engine framework efficiently acquires resources from YARN and reuses every component in the pipeline such that no operation is duplicated unnecessarily.
Tez defines a simple Java API to express a DAG of data processing. The API has three components
Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf. This model comes with inherent costs for process startup and initialization, handling stragglers and allocating each container via the YARN resource manager.
As the defacto standard for SQL-In-Hadoop, Apache Hive is optimal for both batch and interactive queries at petabyte scale. Hive embeds Tez so that it can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability. Apache Tez innovations drove many of the Hive performance improvements delivered by the Stinger Initiative, a broad community effort that included contributions from 145 engineers across 44 different organizations. Tez helps make Hive interactive.
Originally developed by Hortonworks, the Apache Tez project entered the Apache Incubator in February 2013 and then graduated to a top-level project in July 2014. In just a short time, Tez has attracted 31 committers that represent a who’s who of leading Hadoop companies, including Cloudera, Facebook, LinkedIn, Microsoft, NASA JPL, Twitter, and Yahoo. Significant contributions from this open community propelled Tez to become a cornerstone of core Apache projects like Apache Hive and Apache Pig and to adoption by other important open-source projects like Cascading.
Current work in Apache Tez innovation focuses on improvements to speed, scale and usability.
Introduction Hadoop has always been associated with BigData, yet the perception is it’s only suitable for high latency, high throughput queries. With the contribution of the community, you can use Hadoop interactively for data exploration and visualization. In this tutorial you’ll learn how to analyze large datasets using Apache Hive LLAP on Amazon Web Services […]
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Introduction In this tutorial, you will learn about the different features available in the HDF sandbox. HDF stands for Hortonworks DataFlow. HDF was built to make processing data-in-motion an easier task while also directing the data from source to the destination. You will learn about quick links to access these tools that way when you […]
Introduction JReport is a embedded BI reporting tool can easily extract and visualize data from the Hortonworks Data Platform 2.3 using the Apache Hive JDBC driver. You can then create reports, dashboards, and data analysis, which can be embedded into your own applications. In this tutorial we are going to walkthrough the folllowing steps to […]
The Hortonworks Sandbox is delivered as a Dockerized container with the most common ports already opened and forwarded for you. If you would like to open even more ports, check out this tutorial.
Introduction R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
Apache Zeppelin on HDP 2.4.2 Author: Vinay Shukla In March 2016 we delivered the second technical preview of Apache Zeppelin, on HDP 2.4. Meanwhile we and the Zeppelin community have continued to add new features to Zeppelin. These features are now available in the final technical preview of Apache Zeppelin. This technical preview works with […]
Welcome to the Hortonworks Sandbox! Look at the attached sections for sandbox documentation.
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, Nifi Registry, HAWQ, Zeppelin, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.