Expanding Hadoop’s Reach with Microsoft
Microsoft and Hortonworks have a shared passion of not just delivering Apache Hadoop to the Microsoft Windows and Azure environments, but also to enable the world of Microsoft users to gain value from Hadoop. Together, our engineers partner to deliver integration so that
- Analysts can use Hadoop with tools like Microsoft Excel and PowerBI,
- Developers can build Hadoop enabled apps with .NET,
- Operators can manage Hadoop environment using Microsoft System Center.
What We Have Accomplished So Far
Microsoft and Hortonworks have collaborated in the open to expand the reach of Apache Hadoop and its ecosystem components.
Apache Hadoop on Windows
It all started with the Hadoop-8079 JIRA. Since then, Microsoft, Hortonworks and the Apache community have worked to enable Windows to be a first class Operating System for Apache Hadoop. This resulted in a significant milestone where Apache Hadoop Release 2.1.0-beta officially supported Windows as a platform OS. Now, going forward, Windows support is part and parcel of every forthcoming Apache Hadoop release.
Pig-2793 and Hive-2998 are similar JIRAs that represent the work done to enable the Hadoop ecosystem projects to support the Windows OS. The Hortonworks Data Platform 2.0 for Windows supports the latest components of each Apache component, running on Windows.
Stinger: Hive and Interactive Query
Stinger is the initiative to improve query execution time and increase SQL functionality for Apache Hive. Microsoft and Hortonworks have been active participants in the Apache community to work towards this initiative.
Microsoft joined forces with Hortonworks to form the founding committee of the Apache Tez project. Tez is a DAG engine that runs on YARN and provides the enhanced Hive execution engine that improves performance.
Microsoft architects lent their expertise to the original design of the Optimized Row Column (ORC) Format. ORC offers excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.
Apache Hive 0.13 introduces Vectorized Query execution. Microsoft architects contributed to the design and delivery of this vital piece of Hive functionality. With Vectorized Query execution, Hive queries can process batches of about a thousand rows at a time, instead of a single row at a time. This substantially reduces CPU time used, and gives excellent instructions per cycle (i.e. improved processor pipeline utilization).
The Microsoft data platform and SQL Server 2014 demonstrate Microsoft’s commitment to and innovation in the area of Big Data. A key element of our Big Data strategy can be seen in our work with Hortonworks: Together we are helping customers embrace Hadoop because it has become the Big Data standard. In this effort, Microsoft has logged over 6,000 engineering hours in the last year, committing code and jointly driving innovation across a range of open source projects including the Hive/Stinger initiative. In addition, we have committers on Hadoop, and Microsoft employee Chris Douglas is the Apache Working Group Chair for Hadoop.
David Campbell, Microsoft Fellow and CTO.
Microsoft REEF delivers a machine learning framework for Hadoop 2
REEF (Retainable Evaluator Execution Framework) is a set of libraries that runs on top of YARN for simplifying the creation and execution of machine learning jobs. Data Science and data exploration is unique in that jobs are often run multiple times, loading and reloading data while modifying details of the execution until an expected outcome is found. The REEF framework allows a job to maintain state across runs of a job and simplifies data versioning. It is a critical toolset for data scientists in the Hadoop 2 world.