Category Archives: Azure


Microsoft’s Contributions to the Stinger Initiative and Apache Hive

Guest blog post from Eric Hanson, Principal Program Manager, Microsoft

Hadoop had a crazy and collaborative beginning as an OSS project, and that legacy continues. There have been over 1,200 contributors across 80 companies since its beginning. Microsoft has been contributing to Hadoop since October 2011, and we’re committed to giving back and keeping it open.

Our first wave of contributions, in collaboration with Hortonworks, has been to port Hadoop to Windows, to enable it both for our HDInsight service on Windows Azure and for on-premises Big Data installations on Windows. Now, we’re starting to contribute to the Stinger initiative to dramatically speed up Hive and make it more enterprise-ready.

Contribution to the core of Apache Hadoop through Stinger

Our main activity in Stinger right now is around Tez, and vectorized query execution. One of our developers, Mike Liddell, has experience with DAG-based computations in Microsoft’s internal Dryad-LINQ effort, and has just joined Tez as a founding committer. I kick-started and helped guide our project to introduce columnstore data formats and vectorized (a.k.a. “batch mode”) query execution into SQL Server 2012.  After moving to the SQL Server Big Data team, I’ve been collaborating with Hortonworks developers since late last fall regarding how to make Hive faster. We heard about the ORC project, led by Owen O’Malley of Hortonworks, to improve the RCFile columnstore format. I’ve had several productive design discussions with Owen about ORC, and we really like the way it’s shaping up.

Based on our experience, we knew that a great columnstore format is only part of the story about making data warehouse-style queries run really fast. Good process and communication architecture is one – Tez is a great step there. Another is fast query execution (QE), and vectorized query execution research and field experience has shown it can speed up queries on the order of 10X-100X.

Some people were saying that fast QE required a total-rewrite in C++. I didn’t buy that, and I prototyped vectorized scan and filter operators in Java and shared this with Hortonworks. For simple conditions like column = constant, we’ve seen the ability to filter about 150 million rows per second on one thread in Java. We now have a two-company team introducing vectorized QE to Hive, consisting of two Hortonworks folks (Jitendra Pandey and Owen) and several Microsoft engineers. We’re going to take it in small steps, adding vectorized scans over ORC, and basic filter operations first. Then we’ll move on to vectorized aggregates and joins.

We think that the functional surface area of Hive, including its SQL query language, the open, extensible storage model over HDFS, and its easy programmer extensibility with Java UDFs, is quite compelling. It gives non-procedural access to Big Data, with ability for programmers to create custom Java add-ins that let them do complex calculations more easily that they can with Map-Reduce programs. Hive also has a strong community of OSS developers and users. It works on ultra-scale clusters on data sets vastly bigger than total cluster memory. Stinger aims to boost the speed of Hive to complement its rich functionality in a way that users will love.

An active participant in the open community

We’ve been part of OSS Big Data world for about a year and half now. Through the combined efforts of the overall Hadoop community, Microsoft, and Hortonworks, Hadoop is now accessible on Windows Server and Windows Azure. We’ve gained so much from the community. Now we’re helping return the favor by contributing to Stinger, with our eye on 100X performance gains.

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside)

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

There was a comedian who had a bit about… remember when you first saw Jurassic Park for the first time? No matter how old you were, your child-like response was, “DINOSAURS ARE REAL!!!!!!$!!$##!” Our reaction to Jurassic Park was CGI technology disrupting cinema, provoking the same kind of reaction early cinema had on viewers who felt real concern that the horse or train approaching would run them over. At least thats what I learned wasting a lottery-funded academic scholarship on film classes at a state university before having the good sense to fail out and use my time productively.

That feeling you got when you saw your first CGI raptor is what Microsoft’s demo was like, except it went… “HADOOP IS IN EXCEL!!$%!%!%!$????!!!”

This is a serious thing for me, because I hooked up Pig and Excel years ago:

Which is a crappy demo of Hadoop connecting to Excel, but which gives me mucho moral authority to state that Microsoft’s demo was the right way to hook data to Excel. Take it from someone that spent half of his twenties trying to build web applications that could compete against Excel: until data is in Excel… it ain’t real. With Microsoft’s new offering… big data just got real.

To put this into perspective:

And just so you know I’m not bullshitting you about Hadoop and Big Data and Raptors and next thing you know you’re checking for your wallet and nodding awkwardly and trying to find a pause in this lunatic rant to get the hell out of here, I’ll just come out and tell you:

I have a raptor named lame-o-saurus in a Cowboy Curtis hat permanently tattood on my body. Again, we resort to visualization (mind the hair):

To summarize:

  1. I am the world’s primary authority on the wrong way to hook Hadoop to Excel.
  2. I have strange tattoos which affirm the validity of my metaphors.
  3. Microsoft has fundamentally altered Big Data with their HDInsights offering.
  4. Yesterday, a breakthrough happened in the Regent Parlor of the Hilton, NYC.

Visicalc… we’ve come such a long way.

Enabling Big Data Insight for Millions of Windows Developers

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

Since one of the world’s most widely used business intelligence tools is Microsoft Excel, and since Microsoft is arguably one of the best companies at enabling and mobilizing large and vibrant developer communities, needless to say we at Hortonworks are excited and bullish on the expansion of our partnership with Microsoft.

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

This news represents a significant inflection point for the big data market in general and for the importance of open source Apache Hadoop in particular. Unlocking the Windows Server and Windows Azure markets for Hadoop means more businesses will be able to tap into its benefits.

Moreover, these new offerings represent months of joint engineering work across both the Microsoft and Hortonworks engineering and product teams. Microsoft’s commitment to doing this work in a way that improves open source Apache Hadoop and related Apache projects has been unwavering; which translates into goodness for the open source community.

I encourage you to try out the fruits of our labors in one of two ways:

• Download Microsoft HDInsight Server and play with Hadoop on your own Windows machine.
• Access Windows Azure HDInsight Service and play with Hadoop in the cloud.

I encourage you to go to http://hortonworks.com/partners/microsoft/ in order to learn more and get started!

Finally, check out Microsoft’s announcement for more information! http://blogs.technet.com/b/dataplatforminsider/archive/2012/10/22/simplifying-big-data-for-the-enterprise.aspx