Expanding Hadoop’s Reach with Microsoft

Collaboration in the open to expand the reach of Apache Hadoop and its ecosystem components

Microsoft and Hortonworks have a shared passion for not just delivering Apache Hadoop to the Microsoft Windows and Azure environments, but also to bring Hadoop’s value to Microsoft users. Together, our engineers partner to integrate the platforms so that:

  • Analysts can use Hadoop with tools like Microsoft Excel and PowerBI,
  • Developers can build Hadoop enabled apps with .NET,
  • Operators can manage Hadoop environment using Microsoft System Center.

Initiative Goals

Focus on integration across deployment, usage and management of Hadoop
Ensure hadoop works seamlessly on premise or in the cloud with Microsoft Azure
Open Community
Provide leadership within and engage with the broad community to work with us.

What We Have Accomplished So Far

Microsoft and Hortonworks have collaborated in the open to expand the reach of Apache Hadoop and its ecosystem components.

Apache Hadoop on Windows

It all started with the Hadoop-8079 JIRA. Since then, Microsoft, Hortonworks and the Apache community have worked to enable Windows to be a first class Operating System for Apache Hadoop. This resulted in a significant milestone where Apache Hadoop Release 2.1.0-beta officially supported Windows as a platform OS. Going forward, Windows support will be part and parcel of every Apache Hadoop release.

Pig-2793 and Hive-2998 are similar JIRAs that represent the work done so that Hadoop ecosystem projects support the Windows OS. Hortonworks Data Platform 2.0 for Windows supports the latest components of each Apache component, running on Windows.

Stinger: Hive and Interactive Query

Stinger is the initiative to improve query execution time and increase SQL functionality for Apache Hive. Microsoft and Hortonworks worked actively in the Apache community towards completing Stinger.

For example, Microsoft joined forces with Hortonworks to form the founding committee of the Apache Tez project. Tez is a DAG engine that runs on YARN and provides the enhanced Hive execution engine that improves performance.

Microsoft architects also lent their expertise to the original design of the Optimized Row Column (ORC) Format. ORC offers excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.

Apache Hive 0.13 introduces Vectorized Query execution. Microsoft architects contributed to the design and delivery of this vital piece of Hive functionality. With Vectorized Query execution, Hive queries can process batches of about one thousand rows at a time, instead of a single row at a time. This substantially reduces CPU time and improves processor pipeline utilization.

The Microsoft data platform and SQL Server 2014 demonstrate Microsoft’s commitment to and innovation in the area of Big Data. A key element of our Big Data strategy can be seen in our work with Hortonworks: Together we are helping customers embrace Hadoop because it has become the Big Data standard. In this effort, Microsoft has logged over 6,000 engineering hours in the last year, committing code and jointly driving innovation across a range of open source projects including the Hive/Stinger initiative. In addition, we have committers on Hadoop, and Microsoft employee Chris Douglas is the Apache Working Group Chair for Hadoop.
David Campbell, Microsoft Fellow and CTO.

Microsoft REEF delivers a machine learning framework for Hadoop 2

REEF (Retainable Evaluator Execution Framework) is a set of libraries that runs on top of YARN for simplifying the creation and execution of machine learning jobs.  Data science is different than data exploration because data science jobs often run multiple times, loading and reloading data while modifying details of the execution until finding an expected outcome.  The REEF framework allows a job to maintain state across multiple runs, which simplifies data versioning.  It is a critical toolset for data scientists who use Hadoop 2.

Try HDP on Windows

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.