Category Archives: Uncategorized


Week in Review: SQL IN Hadoop and Hive, Beyond Batch with YARN, NFS access to HDFS and HBase MTTR

Or as it’s more commonly being called: Week-ish in Review. Let’s recap on the latest – there’s some juicy technology goodness here.

Delivering on Stinger: Phase 1Just this week, Hive 0.11 has been released. Owen (@owen_omalley) brought us the news that 55 – yes, fifty-five – developers from across the community have addressed 386 JIRA tickets and have delivered significant improvements to Hive along with an awesome demonstration of the power of community open-source development. Thanks to everyone! This release of Hive means that we’ve delivered on the first phase of the Stinger Initiative too – aiming to deliver 100x performance increases to Hive.

Taking Hadoop Beyond Batch with YARN. All of which means we step closer to delivering SQL-in-Hadoop and respond to the needs of enterprises for multi-application operating systems for their big data. Arun (@arunmurthy) gives a terrific update on Hadoop 2.0 and YARN and how that development will move Hadoop Beyond Batch. Stay tuned!

Delivering Enterprise Hadoop through MTTR for HBase and NFS access to HDFS. Meanwhile, Nicolas Liochon (@nkeywal) and Devaraj Das (@ddraj) provide an introduction on how HBase availability is being improved through work on Mean Time To Recover (MTTR) capabilities. And then Brandon Li (@brandonli11) and Suresh Srinivas (@suresh_m_s) updated us on progress to simplify data management through NFS access to HDFS. All critical stuff for the enterprise, and all driven through the community.

Microsoft love for .NET Hadoop fans. If you’re a .NET developer and have been missing out on a little Hadoop fun, then Microsoft has started pushing out SDKs and tutorials for its Hadoop-in-the-Cloud service – HDInsight – so you can fire up Visual Studio and get rocking on that big data.

Hadoop Summit Meetups. We only announced them this week, and they’re nearly full already. Still time to try and squeeze into one of the Meetups: Hive, Pig, HBase, YARN, Accumulo, Ambari, Oozie, Data Science and Architecture or maybe attend Big Data Camp or Machine Learning Evening on 25th June as part of Hadoop Summit.

Now it’s time to go play. Have a great weekend.

Week in Review: Hadoop Summit, Value of Big Data, and more Ambari

And we are just about done with this week. But not quite – dig into the conversation from the past few days.

Hadoop Summit. We published the vast majority of sessions (70 so far) for the Hadoop Summit in San Jose, 26-27 June. The sessions stretch across 7 tracks from Architecture to Economics and we hope you can join us for THE Hadoop community event of the year. You can register here, and the schedule is here.

Big Data Defined Part Deux: Value Definition. Jim picked up from the last Big Data definition and talked about it here. Regardless of your views on volume, variety and velocity there is one V to rule them all: Value.

Enterprise Data Analytics with Hortonworks and Datameer. I’ve been having a ton of fun with Datameer visualizations this week. If you want to learn a little more about enterprise analytics and how to better unlock the insights in your own data (with cool graphics) then take a look here.

Get Started with Ambari. We published a fun tutorial on setting up Ambari to provision, manage and monitor your Hadoop cluster. Better automation of management and monitoring means more time in the garden.

Until next week – stay frosty.

Field Report: OpenStack Summit – The Hadoop Bizarro World

portland2PORTLAND – The Rose city is a great place and this week it got even more interesting with the OpenStack Summit in town. I am more a data geek and very rarely do I venture down the stack into infrastructure, but wow, there is something cool going on with the OpenStack community.  I couldn’t help but to get wrapped up in the excitement.  Not only was the enthusiasm palpable, it was also very familiar. I don’t know if it was the organic buzz of Portland or not, but I felt a little like I was in Hadoop bizarro world.

Hadoop on OpenStack

Hortonworks was the only “app” vendor on the show floor and our story was well received.  When you partner with the leading code contributor (Red Hat) and the leading system integrator (Mirantis) and have existing relationships with the founders (Rackspace) of OpenStack, you get some relative street cred. But honestly, the attendees I spoke with were incredibly happy to see us at the event because they saw our joining the community was about contributing serious code and Hadoop experience to Project Savanna.  This is characteristic of a vibrant community of developers.

It really didn’t take a lot of explaining to open the eyes of the audience to the reality that “Hadoop is the Perfect App for OpenStack”.  These guys and gals get it.  They are looking for the right application to drive adoption of OpenStack and Hadoop with its new workloads for an enterprise fits the bill. We look forward to seeing some crossover audience at Hadoop Summit when we roll out the first wave of our efforts by demonstrating the ease of deployment of Hadoop on OpenStack via the new Savanna project.

We were pretty busy on the show floor and were also invited by our friends at theCube (@furrier & @jefffrick ) to speak about Savanna and how Hadoop is good for OpenStack.  The video and corresponding article were great coverage.  Also, among a range of other press outlets picking up the story, the Register had a great summary of Project Savanna from the show floor.

Socialism v Capitalism

Being an Apache guy, I was curious to how the OpenStack community is governed.  With all these vendors in the building, it seemed there was a lot of powerful players involved.  Who is in charge? I had a few conversations about this and it seemed to me that there is a healthy democracy with some very powerful parties and lobbyists involved.  Sounded to be a bit like capitalism to me, which led me to a comparison with Apache….  Perhaps we are Socialism and OpenStack is Capitalism.  ;)

I met and spoke to a few of the committee members for OpenStack, including Devin Carlen (Nebula) and Josh McKenty (@jmckenty & PistonCloud).  Both are founders of OpenStack, founders of companies and have contributed significantly to the project.  They were amused by my theory.

OpenStack Summit Growth: Enter Sales and Marketing into the Community

The show has historically been mostly a “real” summit where developers got together to discuss, design and code.  There is still a lot of that going on, but the influx of “business” was overwhelming.  The growth of the show demonstrates the importance of the project. To quote Rackspace, “Between OpenStack’s Folsom and Grizzly releases, OpenStack experienced a more than 50 percent growth in contributions. According to some of the businesses closest to the project, OpenStack isn’t just about writing code; it’s about creating an infrastructure everyone can use. It’s about creating something amazing.”  Enter business.

Screen Shot 2013-04-19 at 8.56.06 AMWith some help from Chris Horne (@fpcguru) at CloudScaling and Fresh Perspective Consulting I was able to analyze (no data science here, just marketing guy stuff) the attendee list.  Out of 3000 registered, I would say close to one third were from the leading vendors in this space.  This seems to be a pretty mix for the show (and the community for that matter) and shows a vibrant range of adoption beyond the large players.  There are some big names involved and we can only expect the countdown has started and OpenStack is set to take off.

The Third Coming

One of my most interesting conversations this week was with a financial analyst at the show who characterized OpenStack as the “The Coming of The Third Generation of IT”. (Oh, I forgot to mention that they were all over the show as well.  It seems everyone wants to know who this helps or hurts and which small company is gonna crush it.) This led me to explore what exactly were gen 1 and 2.  Perhaps the old world of mainframes and PC in the 70s, 80s and early 90s was the first generation IT team.  They were a group of pencil protected, flannel shirt wearing guys with big glasses who walked around with disks and screwdrivers.  Mid nineties, we shifted into the second generation with client server and the Internet.  Data centers grew up and a shift towards SaaA started.

Today, the third generation is becoming reality.  The Cloud hype over the past few years provided us with PaaS and now with OpenStack, we may really see widespread adoption of IaaS.  We know one thing, in order to fuel adoption of OpenStack and this new infrastructure, an application must come along to spur adoption.  Funny enough, at the same time, Hadoop has established itself as the driver of net new workloads in an organization.  This is the exact greenfield opportunity for the OpenStack enthusiast to help drive adoption.  Hadoop is the Perfect App for OpenStack in this “Third Generation of IT”.

Hadoop, The Perfect App for OpenStack

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud). [

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Screen Shot 2013-04-15 at 10.17.46 PMBecause big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Big Data and Cloud Computing are Top Initiatives for IT Executives

A year ago, Barclays published its April 2012 CIO Survey where they stated “most CIOs rated the challenges of data growth and “Big Data” as the No. 1 trend driving IT spending decisions. We believe that “Big Data” is quickly becoming one of the biggest challenges within IT infrastructures given the shift to cloud computing and growth in unstructured data.”

Fast forward to CIO magazine’s 2013 State of the CIO Survey, we see that cloud computing and big data continue as major themes. Of the IT executives surveyed, 39% expect to complete cloud computing initiatives and 37% expect to complete big data initiatives this year. On top of that, 59% of those surveyed classify their organization as late majority or laggards when it comes to adoption of big data initiatives.

Translation? Big data and Hadoop have crossed the chasm and are accelerating into the mainstream enterprise, which reinforces our assertion made back in January.

Hadoop and OpenStack Sitting in a Tree…

Hadoop is the leading open source platform for storing, processing and accessing large data sets across clusters of computers, and OpenStack is the leading open source framework for building and managing private, public and hybrid Infrastructure-as-a-Service (IaaS) clouds.

According to Gartner, big data will drive $232 billion in IT spending through 2016. The benefits to organizations for adding big data to their information management and analytics infrastructure will force a more rapid cycle of replacing existing solutions. Since Hadoop is net-new workload for most organizations, Hadoop on OpenStack provides the perfect “greenfield” use case for those looking to start anew on a platform that makes sense.

Moreover, Hadoop and OpenStack are open source technologies designed for scale-out architectures that can be cost effectively deployed. Finally, since Hadoop can be complex to get started with, taking advantage of the operational agility and deployment choice that OpenStack enables will go a long way to jumpstarting those interested in deploying big data on cloud.

…First Comes Love, then Comes Marriage

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software, and when it comes to platform technologies like Hadoop and OpenStack, we believe that community-driven open source will always outpace the innovation of a single group of people or single company.  We feel this is why both Hadoop and OpenStack have attracted major ecosystem players such as IBM, Red Hat, HP, Rackspace, Intel, and many others.

Our news today means we intend to marry, if you will, two of the largest open source movements in order to accelerate the perfect use case of big data on cloud. Specifically, we will be working within the OpenStack community on Project Savanna, originally introduced as an OpenStack project by Mirantis. The goal of the project is to enable OpenStack users to easily provision and manage elastic Hadoop clusters on OpenStack.

Screen Shot 2013-04-15 at 3.40.19 PM

An important design point for Savanna is to provide an integration point for Hadoop provisioning and management frameworks, so we will focus on making sure Apache Ambari is well integrated. This will enable our enterprise customers to provision the Hortonworks Data Platform very quickly via OpenStack APIs and OpenStack’s Horizon Dashboard while managing their Hadoop cluster in a familiar way.

Next Steps?

For those who want to learn more about Savanna, Mirantis has a blog post and video that provides a great overview.

If you want to join the community effort, we encourage you visit the Savanna project and get involved.

And finally, we will be demonstrating the initial fruits of our labor at the Hadoop Summit on June 26, 2013 in San Jose California, so we encourage you to sign up for the conference!

 

Big Data Defined

‘Big Data’ has become a hot buzzword, but a poorly defined one. Here we will define it.

Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets:

In information technology, big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

It is better to define ‘Big Data’ in terms of opportunity, in terms of transformative economics. Big Data is the opportunity space created by new open source, distributed systems from the consumer internet space.

Specifically, a Big Data system has four properties:

  • It uses local storage to be fast but inexpensive
  • It uses clusters of commodity hardware to be inexpensive
  • It uses free software to be inexpensive
  • It is open source to avoid expensive vendor lock-in

Cheap storage means logging enormous volumes of data to many disks is easy. Processing this data is less so. Distributed systems which have the above four properties are disruptive because they are approximately 100 times cheaper than other systems for processing large volumes of data, and because they deliver high I/O performance for the buck.

Apache Hadoop is one such system. Hadoop ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of any other system.

SAN Storage NAS Filers Local Storage
$2-10/GB $1-5/GB $0.05/GB

It is out of this cost differential that our opportunity arises: to log every shred of data we can in the cheapest place possible. To provide access to this data across the organization. To mine our data for value. To undergo the transformative processes that unabridged access to data provides, enabling bigger, better, faster more profound insight than ever before.

This is a working definition of Big Data.

What do you think? What is your definition of Big Data?

Hadoop Summit North America 2013: Community Choice Results

And the voting is over and the results are in for the Community Choice program of the Hadoop Summit San Jose 2013.

With over 300 sessions, and around 6000 users casting more than 15000 votes there was a lot of excitement to participate and influence the results - thanks to everyone for your contribution. At the end of the process, the selectees are:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

Congratulations to the selectees for each track, and a further honorable mention to Sears for winning the ‘Longest Session Title So Far’ which was a surprisingly hard fought contest!

The content selection committee will now be working hard to select the remaining sessions for the tracks, and we’ll cover those participants in more depth later.

With the Community Choice program complete we’re one step closer to a great event! Thanks again to everyone for taking part and stand by for more updates.

Hadoop Summit 2013 Amsterdam – It’s A Wrap!

We want to take a moment to thank everyone who attended the Hadoop Summit in Amsterdam - THANK YOU! With nearly 500 people registered for the event we think we can safely say is was a big success. We’ve had overwhelming support to do it again next year – so watch this space.

The awesome Beurs Van Berlage venue set us up for a series of fantastic conversations and really well attended sessions and talks as Hadoop continues to explode onto the enterprise scene . Outside of the main tracks, there was great attendance for NLHUG and BoF talks, and kudos to the 10 presenters who ran those lightning talks. Finally, the customer panel was also well received, with great practical advice on adopting Hadoop from HSBC, Neustar and eBay.

But of course it wouldn’t be an event without a party, and we had a great time at the Heineken Experience (from what we can remember).  We put some photos on our Facebook page, but @timoelliott did a much better job than us with this fantastic set on Flickr. This one shows the awesome venue:

hadoop summit exhibition hall

So did you enjoy the summit?  Head over to Facebook  and let us know your favorite part and why: keynotes, tracks, lightning talks, the sandbox experience in the dev cafe, or the party.

And here is a tiny selection of some of the most recent Tweets closing out the show:

Hadoop Summit Tweet

Hadoop Summit Tweet

Hadoop Summit Tweet

Hadoop Summit Tweet

With the community voting just about complete - you still have a few hours to take part – for Hadoop Summit San Jose we are barely 3 months away from a whole bunch of new content and connections and we hope you join us there too!

Thanks again!

Separating Open Source Signal from Enterprise Hadoop Noise

There have been many Apache Hadoop-related announcements the past few weeks, making it difficult to separate the signal from the marketing noise. One thing is crystal clear however… there is a large and growing appetite for Enterprise Hadoop because it helps unlock new insights and business opportunities in a way that was not previously technologically or economically feasible.

Enterprise and Open Source are NOT Mutually Exclusive

forbesWoodsDan Woods from Forbes, recently penned an article entitled Why SQL Matters, the Limits of Open Source, and Other Lessons of EMC Greenplum’s Pivotal HD” where he paints a picture of enterprise and open source in opposite corners. As an example, he closes his article with:

 “If you are a CIO what do you choose? Open source ideology or products that are made to solve enterprise problems by enterprise companies?”

I take issue with that either/or stance; just look at Red Hat, JBoss, SpringSource, MySQL as well as the broad enterprise use of Apache Web Server and Apache Tomcat for examples of enterprise-class open source software. Our approach at Hortonworks is very much about providing a healthy mix of enterprise AND open source – with emphasis on the “AND”.  Specifically, we identify and introduce enterprise requirements into the pubic domain (i.e. open source), we work with the community and partners to advance and incubate open source projects, and we apply enterprise rigor to provide the most stable and reliable distribution that our customers and partners can rely on.

While I take issue with the sentiment of the Forbes article, I agree with one of its thematic points: in order for Hadoop to flourish, it needs to factor in traditional enterprise “use-value participants”.

At Hortonworks, we work very closely with Teradata and Microsoft as “use-value participants” (to use the Forbes term) that are highly relevant to enterprise customers adopting big data strategies.

Why? For Enterprise Hadoop to be as impactful as it can be, our approach to the market needs to be BOTH direct and indirect. Working with partners like Teradata and Microsoft helps pull Enterprise Hadoop into the market in ways that are meaningful and valuable to enterprise customers.

Spotlight: Microsoft Adds Value By Working WITHIN The Community

Hortonworks and Microsoft engineers have worked side-by-side within the Apache community for the past 16 months. The focus has been on making Enterprise Hadoop easier to use and consume by mainstream enterprises. Specifically, the focus has been on Apache Hadoop and more recently Apache Hive (a la our Stinger Initiative aimed at making Hive 100X faster. We’ve also collaborated on making Hadoop applications faster and more secure by introducing new incubator projects such as Apache Tez and Apache Knox Gateway.

windoweleMoreover, a great example of the fruits of our joint efforts is our recent launch of the Hortonworks Data Platform for Windows, aimed at bringing the power of Hadoop to the large Windows ecosystem.

My point here is that Microsoft engineers have been spending serious time and energy working within the Apache Software Foundation on making various open source projects better.  A perfect example of this is a fact that many people may not be aware of. Chris Douglas, an engineer from Microsoft, was recently voted the V.P. of Hadoop. Chris earned this position by demonstrating leadership within the community.

We Feel One Of The Elephants Is Not Like The Others

By now, you’ve gotten the point that we believe enterprise and open source are NOT mutually exclusive. There are go-to-market approaches that can propagate or dispel this myth, however.

  1. Fork / Fragment: One approach is to forego working within the open source community and simply choose to harvest the open source work of others and then modify/bend that technology for specific commercial interests. Changes to the open source technology are intentionally done outside of the community and held back as “important enterprise value add”.  EMC and their Pivotal HD offering is an example of a strategy aimed at fragmenting the market in order to control a portion of the potential customers. See my recent blog post for more thoughts on this topic.
  2. Unite / Coalesce: Another approach is to work within the community on making the open source projects better and more capable of integrating seamlessly with enterprise-focused commercial offerings. Contributing all “value add” changes that should be in the open source projects directly into those projects helps ensure they become easier to use and consume by all. This approach is intended to enable a very large ecosystem to form around a common and consistent open source foundation. Hortonworks partnerships with Teradata and Microsoft are examples of how enterprise-focused solutions can be built on a common and interoperable base.

Both approaches are certainly valid…but with different consequences not only for the technology, but also the broader market / ecosystem. How so? Well, I will simply leave it as an exercise to you, the reader, to consider lessons learned from the UNIX wars (fragmented market) versus Linux (unified market on top of common Linux kernel).

At Hortonworks we are clearly encouraging the second approach, and we are excited to work with partners like Microsoft and others to add value directly into the open source projects in ways that make them easier to use and consume by enterprises.

We also believe that any company that thinks they are “all in” on making open source Apache Hadoop into an enterprise-viable platform needs to have key committers working on the open source technologies (Hortonworks has 50+ committers) or partner with a company like Hortonworks who is focused on working with the ecosystem on ensuring Hadoop integrates and interoperates well with existing enterprise systems and tools.

There Is Still Much Work To Be Done…So Join Us On The Journey!

Hortonworks engineers have been privileged to help Hadoop mature from the domain of a small number of web monsters (including Yahoo!) to a technology that has crossed the chasm and onto a large number of CIO’s agendas across mainstream enterprises. And as I noted in a recent blog post, there is an interesting road ahead of us.

The rise of Enterprise Hadoop offers a refreshing opportunity for our customers to benefit from a data platform that provides a compelling combination of technology, economic and business benefits. And delivering that enterprise value directly as well as indirectly through partners is what we are focused on.

Did EMC Just Say Fork You To The Hadoop Community?

 

In Derrick Harris’ article on GigaOM entitled “EMC to Hadoop competition: See ya, wouldn’t wanna be ya.”, EMC unveiled their new Pivotal HD offering which effectively re-architects the Greenplum analytic database so it sits on top of the Hadoop Distributed File System (HDFS). Scott Yara, Greenplum cofounder, is excited about the new product. Since a key focus for us at Hortonworks is to deeply integrate Hadoop with other data systems (a la our efforts with Teradata, Microsoft, MarkLogic, and others), I’m always excited to see data system providers like Greenplum decide to store their data natively in HDFS. And I can’t argue with Scott Yara’s sentiment that “I do think the center of gravity will move toward HDFS”.

But putting HDFS under a proprietary database does not make it Hadoop, however.

All in on Hadoop?

Glancing at the Pivotal HD diagram in the GigaOM article, they’ve made it easy to distinguish the EMC proprietary components in Blue from the Apache Hadoop-related components in Green. And based on what Scott Yara says “We literally have over 300 engineers working on our Hadoop platform”.

Wow, that’s a lot of engineers focusing on Hadoop! Since Scott Yara admitted that “We’re all in on Hadoop, period.”, a large number of those engineers must be working on the open source Apache Hadoop-related projects labeled in Green in the diagram, right?

So a simple question is worth asking: How many of those 300 engineers are actually committers* to the open source projects Apache Hadoop, Apache Hive, Apache Pig, and Apache HBase?

furrierTweetJohn Furrier actually asked this question on Twitter and got a reply from Donald Miner from the Greenplum team. The thread is as follows:

Since I agree with John Furrier that understanding the number of committers is kinda related to the context of Scott Yara’s claim, I did a quick scan through the committers pages for Hadoop, Hive, Pig and HBase to seek out the large number of EMC engineers spending their time improving these open source projects. Hmmm….my quick scan yielded a curious absence of EMC engineers directly contributing to these Apache projects. Oh well, I guess the vast majority of those 300 engineers are working on the EMC proprietary technology in the blue boxes.

Why Do Committers Matter?

Simply put: Just because you can read Moby-Dick doesn’t make you talented enough to have authored it.

Committers matter because they are the talented authors who devote their time and energy on working within the Apache Software Foundation community adding features, fixing bugs, and reviewing and approving changes submitted by the other committers. At Hortonworks, we have over 50 committers, across the various Hadoop-related projects, authoring code and working with the community to make their projects better.

This is simply how the community-driven open source model works. And believe it or not, you actually have to be in the community before you can claim you are leading the community and authoring the code!

So when EMC says they are “all-in on Hadoop” but have nary a committer in sight, then that must mean they are “all-in for harvesting the work done by others in the Hadoop community”.  Kind of a neat marketing trick, don’t you think?

Scott Yara effectively says that it would take about $50 to $100 million dollars and 300 engineers to do what they’ve done. Sounds expensive, hard, and untouchable doesn’t it? Well, let’s take a close look at the Apache Hadoop community in comparison.  Over the lifetime of just the Apache Hadoop project, there have been over 1200 people across more than 80 different companies or entities who have contributed code to Hadoop.  Mr. Yara, I’ll see your 300 and raise you a community!

Are You Forking With Me?

So, assuming EMC has little or no committers on the relevant Apache open source projects, then one can only assume their strategy is to fork the Hadoop-related code and maintain their own proprietary version. If they are not actively authoring code within the community, then how else are they able to add important new features or fix critical bugs for enterprise customers?

Looking at the Pivotal HD diagram closely, I also wonder why the box at the foundation lists “HDFS or Isilon OneFS”. Doesn’t that just make you wonder how committed to HDFS EMC actually is? And how long it will take them to start throwing HDFS under the bus from a marketing perspective so they can sell more Isilon? They have to pay for those expensive 300 engineers somehow, right?

And Are They Forking EMC Customers and MapR Technologies While They’re At It?

For EMC customers, another important tidbit to note in the GigaOm article:

“Yara said Greenplum had known for a while that Hadoop was the key to any big data strategy going forward, but that it would take some time to build up its own technology. So, in 2011, it entered into a reseller agreement with Hadoop startup MapR to offer a premium product to appease enterprise customers while Greenplum’s engineers got to work on what would become Pivotal HD. That deal with MapR is still in place, but it’s no longer the focal point of Greenplum’s Hadoop strategy.”

Yep, as confusing as it sounds, EMC had two Hadoop-like offerings, Greenplum HD and Greenplum MR. Pivotal HD appears to be a reswizzled rendition of Greenplum HD (with magical Hawq dust sprinkled on top).  And if you were one of those enterprise customers who EMC “appeased” by buying into Greenplum MR (with the OEM’d MapR distribution inside), then you’re either being abandoned and kicked to the curb or being presented with a fork in the road.

Either way, you are faced with a choice: do you ride out EMC’s changing course yet again or do you look for safer harbor elsewhere…

Choose Community Driven Open Source and Avoid Proprietary Lock-in

At Hortonworks, we believe in the relentless march of community driven open source as the fastest path to innovation and adoption of Apache Hadoop. We believe the most effective path is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs. I encourage you to read more.

We also believe that community driven open source offers the safest path forward since you’re not locked into the whims of a single vendor.

At Hortonworks, when we say “we are ALL IN on Hadoop”, we actually mean it!

And while my post may sound a little harsh, it’s important to note that we’d love to see EMC engineers, and anyone else for that matter, participate in the Apache community and make real contributions.  After all, at the end of the day, community rules!

 

NOTE: A committer is someone who has “earned their stripes” within the Apache community and has the ability to commit code directly to their corresponding Apache project source code tree. The Apache Hive project has a wiki page that provides a nice explanation of how this process works.

Apache HBase 0.94.5 is out!

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Most of the bug fixes went against the REST server, replication, region assignment, secure client, flaky unit tests, 0.92 compatibility and various stability improvements. Some of the interesting patches in this release are:
[HBASE-3996] – Support multiple tables and scanners as input to the mapper in map/reduce jobs
[HBASE-5416] – Improve performance of scans with some kind of filters.
[HBASE-7757] – Add web UI to REST server and Thrift server
[HBASE-7748] – Add DelimitedKeyPrefixRegionSplitPolicy
[HBASE-6669] – Add BigDecimalColumnInterpreter for doing aggregations using AggregationClient
[HBASE-7728] – Deadlock occurs between hlog roller and blog syncer’

The release candidate has been extensively tested by Hortonworks and many others in the community. You can roll out the 0.94.5 bits using rolling upgrade on top of 0.92 or 0.94 releases. In addition, Apache HBase 0.94.5 will be incorporated into an upcoming update to HDP 1.2.

You can download the new release from here, and find full release notes here.

Last, but not least, we would like to thank Lars Hofhansl, who is the release manager of 0.94 branch for driving the release train, and all 30 individuals, who have contributed to this release.

Philosophy behind YARN Resource Management

YARN is part of the next generation Hadoop cluster compute environment. It creates a generic and flexible resource management framework to administer the compute resources in a Hadoop cluster. The YARN application framework allows multiple applications to negotiate resources for themselves and perform their application specific computations on a shared cluster. Thus, resource allocation lies at the heart of YARN.

YARN ultimately opens up Hadoop to additional compute frameworks, like Tez, so that an application can optimize compute for their specific requirements.

The YARN Resource Manager service is the central controlling authority for resource management and makes allocation decisions. It exposes a Scheduler API that is specifically designed to negotiate resources and not schedule tasks. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.

Currently, there are two sharing policies – fair scheduling and capacity scheduling. Thus, the API reflects the Resource Manager’s role as the resource allocator. This API design is also crucial for Resource Manager scalability because it limits the complexity of the operations to the size of the cluster and not the size of the tasks running on the cluster.The actual task scheduling decisions are delegated to the application manager that runs the application logic. It decides when, where and how many tasks to run within the resources allocated to it. It has the flexibility to choose its locality, co-scheduling, co-location and other scheduling strategies.

 

Screen Shot 2013-02-22 at 7.31.53 AM

 

Fundamentally, YARN resource scheduling is a 2-step framework with resource allocation done by YARN and task scheduling done by the application. This allows YARN to be a generic compute platform while still allowing flexibility of scheduling strategies. An analogy would be general purpose operating systems that allocate computer resources among concurrent processes.

We envision YARN to be the cluster operating system. It may be the case that this 2-step approach is slower than a custom scheduling logic but we believe that such problems can be alleviated by careful design and engineering. Having the custom scheduling logic reside inside the application allows the application to be run on any YARN cluster. This is important for creating a vibrant YARN application ecosystem (tez is a good example of this) that can be easily deployed on any YARN cluster. Developing YARN scheduling libraries will alleviate the developer effort needed to create application specific schedulers and YARN-103 is a step in that direction.

The Fastest Path to Innovation: Community Driven Open Source

 

blogpicLast week, we outlined our approach for delivering an enterprise viable Apache Hadoop distribution in the open.  Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.

In support of our approach, this week we’ve announced the submission of two new incubation projects to the Apache Software foundation together with the launch of the “Stinger Initiative”, all aimed at enhancing the security and performance of Hadoop applications.  These efforts focus on enterprise requirements that are essential to enable broad adoption across the Hadoop ecosystem.

  • The Stinger initiative aims to dramatically speed up Apache Hive in support of interactive query use cases.
  • The Knox Gateway addresses the need for a single point of authentication and secure access for Apache Hadoop services in a cluster.
  • The Tez framework provides an alternative next-generation runtime built on Hadoop YARN that significantly improves latency and throughput of Hadoop applications.

We feel these efforts are strong examples of our commitment to driving innovation from within the open source community, and as stated in our approach blog, we do this by::

  • identifying and articulating the enterprise requirements within the community,
  • taking an active role in addressing those requirements within the community, and
  • applying enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

Since it takes a community to build enterprise-class platforms like Hadoop, if you have interest in helping with Knox, Tez, or Stinger, we encourage you to work with us and the others in the Apache community!

Introducing… Tez: Accelerating processing of data stored in HDFS

 

MapReduce has served us well.  For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created.  While it is here to stay, new paradigms are also needed in order to enable Hadoop to serve an even greater number of usage patterns.  A key and emerging example is the need for interactive query, which today is challenged by the batch-oriented nature of MapReduce.  A key step to enabling this new world was Apache YARN and today the community proposes the next step… Tez

What is Tez?

Tez – Hindi for “speed” – (currently under incubation vote within Apache) provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others.  The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

The below graphic illustrates the advantages provided by Tez for complex SQL queries in Apache Hive or complex Apache Pig scripts.

pighivetez

Tez is critical to the Stinger Initiative and goes a long way in helping Hive support both interactive queries and batch queries. Tez provides a single underlying framework to support both latency and throughput sensitive applications, there-by obviating the need for multiple frameworks and systems to be installed, maintained and supported, a key advantage to enterprises looking to rationalize their data architectures. .

Essentially, Tez is the logical next step for Apache Hadoop after Apache Hadoop YARN. With YARN the community generalized Hadoop MapReduce to provide a general-purpose resource management framework (YARN) where-in MapReduce became merely one of the applications that could process data in your Hadoop cluster. With Tez, we build on YARN and our experience with the MapReduce to provide a more general data-processing application to the benefit of the entire ecosystem i.e. Apache Hive, Apache Pig etc.

What has been completed? Where can Tez go?

An early version of the project has been donated to the ASF as part of the initial code grant to establish the Incubation project.   Through the work done in the Stinger initiative, it is already clear that Tez enables and order of magnitude increase in the performance of Apache Hive.

The community is also designing a re-usable set of libraries of data-processing primitives such as sorting, merging, data-shuffling, intermediate data management etc. which are necessary for Tez and may be used directly by other projects.  This is just the beginning.  It is an extensible architecture that will undoubtedly be contributed to widely.

For the community, by the community

At Hortonworks we believe that innovation happens fastest by working with a community of like-minded individuals to address the requirements for Hadoop without being bounded by artificial boundaries such as employment. As such, even though the Hortonworks MapReduce/Hive/Pig team seeded the project, we’ve had the benefit of both positive feedback and constructive criticism from several leading contributors and committers across the Apache Hadoop MapReduce, Apache Hive & Apache Pig projects.  These inventors and peers are employed at Hortonworks, Yahoo, Facebook, Microsoft, Twitter and many others.  The initial committer list has 22 participants with deep domain expertise in these unique challenges and comprises a who’s who in the Hadoop world.  Of course, now that we are nearly in a position where we can co-develop via the Apache Software Foundation where we have proposed Tez as an Incubator project, we expect a very quick acceleration of project development.

When will it be available?

We plan to donate the code from our internal repository to the ASF as part of the Incubator proposal.  Also, Hortonworks will ship Tez in the next alpha release of Hortonworks Data Platform 2 (HDP2), based on Apache Hadoop 2.0, very soon to showcase some of the very exciting advances we have made for Apache Hive via Project Stinger.

We are very excited by the reception Tez has received so far, and we do hope you can join us in this initiative via the Apache Software Foundation project to make Hadoop better!

The Stinger Initiative: Making Apache Hive 100 Times Faster

 

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop.  Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface.  The who’s who of business analytics have already adopted Hive.

Apache Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases.  These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

  • stingerRoadFirst, we are making Hive a more suitable tool for the decision support queries people want to perform on Hadoop.  This includes adding analytics features like the OVER clause, support for subqueries in WHERE, and aligning Hive’s type system more with the standard SQL model.
  • Second, we are optimizing Hive’s query execution plans and based on our initial changes, we have already seen query time drop by 90% on some of our test queries. We are also looking at additional changes inside Hive’s execution engine that we believe will significantly increase the number of records per second that a Hive task can process.
  • Third, we have introduced a new columnar file format (i.e. ORCFile) within the Hive community to provide a more modern, efficient, and high performing way to store Hive data.
  • And lastly, we’ve introduced a new runtime framework, called Tez, which aims to eliminate Hive’s latency and throughput constraints that result from its reliance on MapReduce. Tez optimizes Hive job execution by eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS.  This optimizes the execution chain within Hadoop and drastically speeds Hive’s workload processing.

All of these modifications to Hive are underway in the open and an initial preview will be available in advance of Hadoop Summit Amsterdam in March.

Embrace the community, Embrace Hive…

A diverse group of individuals within the Hive community are collaborating on these efforts. As part f the community, a wide group of people contributed to this effort, including resources from SAP, Microsoft, Facebook and Hortonworks.

Harish Butani from SAP has led an effort to add analytics and windowing functions to Hive.  This will add the OVER clause for use with existing aggregate functions as well as adding analytics functions like RANK and NTILE and windowing functions like LEAD and LAG; you can see this work at HIVE-896.  Namit Jain from Facebook has been spending a lot of time lately optimizing Hive’s query execution planning so that it performs joins and other operations more efficiently and with less need for hints from the user.  Hortonworks engineers have been collaborating on these and other community efforts to improve Hive.

Owen O’Malley, a Hortonworks co-founder and early Hadoop developer, has been working with Facebook on the new ORCFile in order to greatly improve performance when Hive is reading, writing, and processing data; you can see this work at HIVE-3874. We are also working on farther reaching changes and optimizations such as reworking Hive’s operators to process records in blocks of a thousand or more and thus be much more efficient than it is today.

We believe the performance changes we are making today, along with the work being done in Tez will transform Hive into a single tool that Hadoop users can use to do report generation, ad hoc queries, and large batch jobs spanning 10s or 100s of terabytes.

Why reinvent the wheel?

Go to page:1234