Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 05, 2015
prev slideNext slide

Big Data Processing on Steroids!

Sumeet Kumar Agrawal, principal product manager for Big Data Edition product at Informatica, is our guest blogger. In this blog, explains how Informatica’s Big Data Edition integrates with Tez and allow for significant performance gains.

Informatica Big Data Edition’s codeless visual development environment accelerates the ability of enterprises to take advantage of amazing innovations in big data to solve new challenges with skill sets that already exist within many organizations. Informatica natively integrates with big data platforms like Hadoop and NoSQL to enable next-generation big data solutions, including data warehouse optimization and 360 customer analytics.

As the Hadoop computing platform moves into its next phase with YARN, it has decoupled itself from MapReduce. MapReduce is no longer the only distributed processing framework for Hadoop, opening up the opportunity to create alternative data processing paradigms to meet newer challenges. Apache Tez has emerged as the next generation MapReduce framework enabling a broad range of batch and interactive applications at Petabyte scale.

What is Apache Tez?

I like to categorize Apache Tez as “MapReduce” on steroids! Apache Tez generalizes the MapReduce paradigm to execute a complex directed acyclic graph (DAG) of tasks. It represents the next logical next step for Hadoop 2 and puts YARN and its more general-purpose resource management framework to work.

More information on Tez can be found in this blog.

Informatica Integration with Tez

One way Informatica innovates in the big data world is by building applications on leading edge open source technologies. Informatica’s Intelligent Data Platform allows data engineers to adapt quickly to new big data innovations like Tez with no changes to existing jobs, reducing risk and saving considerable development time for our customers. With Informatica and Tez integration, our customers gain benefits in performance while deriving greater value from Hadoop.

Hortonworks & the open source community have made Tez fully compatible with Hive. This categorically makes Tez easy to integrate with any application without any change to existing logic. Using Informatica Big Data Edition, customers can utilize the power of Tez by simply specifying the following:

inf_tez_1

With this simple setting, Informatica can utilize the power of Tez for data integration and data quality jobs running on Hadoop.

Performance Gains with Tez

Hortonworks has published various recommendations to tune jobs running with Tez. Similar to Hortonworks’ guidelines, we see major performance benefits with Tez under the following scenarios:

  1. ORC is used as the file format
  2. Use cases involve complex Transformation logic, such as multiple MapReduce stages. (Multiple joins, aggregations etc.)
  3. Pre-warming of containers further improves Tez performance.

For more information, please see this blog.

Upon following the tuning guidelines, we’ve seen major performance gains (up to 18 times) over the default processing option for Hadoop.

Conclusion

Apache Tez is built to express complex query logic in an efficient manner and execute it with high performance. With Informatica Big Data Edition, customers can now utilize the power of Tez with no change to their existing work.

Having Tez fully compatible with Hive makes it easy for applications to integrate with it. Hive is the defacto SQL engine of Hadoop, supporting a broad ecosystem of applications and vendors.

As Hortonworks and Hadoop open community continue to innovate Hive with ongoing work on Hive LLAP and Apache Spark and Apache Flink integrations, we, at Informatica are committed to supporting the latest enhancements to deliver unparalleled business value and accelerate organizational transformation.

About the Author

Sumeet Kumar Agrawal is a Principal Product Manager for Big Data Edition product at Informatica. Based in the Bay Area, Sumeet has over 8 years of experience working on different Informatica technologies. Sumeet is responsible for defining Informatica’s big data product strategy, roadmap & working with customers to define their big data platform. Sumeet’s expertise includes Hadoop ecosystem, security, as well as development oriented technologies such as Java & web services. Sumeet is also responsible for evaluating Hadoop partner integration technologies for Informatica.

 

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *