This guest blog post is from Syncsort, a Hortonworks Technology Partner and certified on HDP 2.0, by Keith Kohl, Director, Product Management, Syncsort (@keithkohl)
Several years ago, Syncsort set on a journey to contribute to the Apache Hadoop projects to open and extend Hadoop, and specifically the MapReduce processing framework. One of the contributions was to open the sort – both map side sort and reduce side – and to make it pluggable. This not only allows other sorts to be inserted, but it also allows sort to be avoided.
With the General Availability of Hadoop 2, pluggable sort is now a reality for all Hadoop 2-based distributions. With the GA of the Hortonworks Data Platform 2.0 (HDP 2.0), Syncsort is announcing that we are extending our partnership with Hortonworks, including support and certification of HDP 2.0 with YARN.
So what is YARN and why is it important? There is a lot of information on YARN on the Hortonworks web site, but in essence it separates the processing components (for instance MapReduce) and the resource management. It also enables a broader set of use cases for Hadoop and data stored in HDFS beyond MapReduce. But MapReduce is still there and sits on top of YARN.
I heard a quote the other day that really made me think about the experiences I hear from our customers and partners: 2013 was the year companies tried to find budget for Hadoop, 2014 is the year they ARE budgeting for Hadoop projects. But what are people doing with Hadoop and HDP?
ETL is a common use case for Hadoop, even though most people don’t even know they are doing “ETL”. Some call it data refinement or data management. At the end of the day, it’s ETL. If you’re joining data together then to do some grouping, counting, averaging, etc. – that’s aggregations. Yup: ETL. If you’re processing web logs to understand users’ behavior on your web site: ETL. Some estimate 40% to up to 70% of Hadoop use cases today is ETL.
I’m personally excited about our relationship because of what our combined offering can bring to organizations. With our support (and certification) now we can not only bring acceleration for sort to HDP 2.0 applications, our DMX-h offering also delivers an easy to use graphical interface with a full functional ETL tool running natively in MapReduce on HDP 2.0 on YARN. And, BTW, that means no code generation. No Java. No HiveQL. No Pig. Yes, ETL.
Hortonworks did something pretty cool by providing users with a VM of a completely installed version of HDP call the Hortonworks Sandbox. A DMX-h test drive is now available in Hortonworks Sandbox. We also include some sample job templates – like the use cases above – and sample data.