Category Archives: Uncategorized


Hortonworks Sandbox: Stinger, Visualizations and Virtualization

A couple of weeks ago, we released several new Hadoop tutorials showcasing real-life uses cases and you can read about them here.Today, we’re delighted to bring to you the newest release of the Hortonworks Sandbox 1.3. The Hortonworks Sandbox allows you to go from Zero to Big Data in 15 Minutes through step-by-step hands-on Hadoop tutorials. The Sandbox is a fully functional single node personal Hadoop environment, where you can add your own data sets, validate your Hadoop use cases and build a small proof-of-concept.

The new release, posted today, contains a number of enhancements:

Hortonworks Data Platform 1.3

A new release of the Hortonworks Sandbox will always follow the new release of the Hortonworks Data Platform (HDP). A few weeks ago, we released the Hortonworks Data Platform 1.3  and you can read about it here. This includes our belief  in relentless march of community-driven open source as the fastest path to innovation and our contributions to speeding Hadoop queries through our improvements to Hive 0.11,  SQL-in-Hadoop, also known as the Stinger Initiative, which offers a 50x improvement in performance for queries.

Visualizations

New Visualization Functionality in Sandbox

New Visualization Functionality in Sandbox

We continue to improve the Sandbox user experience.

  1. With this release, we provide some basic visualizations for your Hive queries built in to the Sandbox. You can access this functionality the Hive interface, after you have run your query. You’ll see a new tab called “Visualizations”.  This new feature will help ensure that your basic queries are correct before you surface your Sandbox data in other tools.
  2. When importing data into the Sandbox, the column type (string, float, etc.) and delimiter type are auto-detected. You can always over-ride the default values.

New Virtualization Platforms

We’ve listened to your survey responses, tweets and emails and we’ve made some changes to make your experience better:

  1. HyperV Support. Running Windows 8 or Windows Server 2012 with a system that is enabled with virtualization support? We have a Sandbox for you!
  2. 32-bit Operating System Support. Have a Windows machine with a 32-bit OS? We’ve enabled the VirtualBox image to run on a 32-bit OS — including Windows 7, Windows 8 and Windows XP. The Sandbox still requires 4Gb of RAM and requires virtualization enabled on the BIOS but you should be able to run the Sandbox on these environments.
  3. Improved VirtualBox implementation. The VirtualBox implementation has been modified so that the set up and installation is much easier. You no longer have to configure two network adapters. Simply accept the default settings when you import the appliance.

What do you need to do to get these new features? Download the Sandbox!

Looking for interesting datasets to play with? Check out these datasets:

As always, we’re eager to hear your feedback and uses of the Sandbox.

10 Reasons To Put “Hadoop Summit 2013” In Your Calendar

hadoop_summit_logoHadoop Summit 2013 in San Jose is approaching quickly and in just a few weeks attendees will have the opportunity to learn all of the up and coming advances in the world of Apache Hadoop and Big Data. You can still register here!

Here are ten great reasons to pencil “Hadoop Summit 2013” into your calendar:

  1. Informative and exciting keynotes
    Keynotes will be given by Jer Thorpe, an artist and educator known for exploring the many-folded boundaries between science, data, art and culture and Merv Adrian, VP of research at Gartner who follows database, big data, NoSQL and adjacent technologies.
  2. Lightning talks
    These quick-hit informative talks will give you a broad perspective on the various applications of Hadoop.
  3. Expert panels
    Live panel discussions with industry leaders will include topics like SQL on Hadoop and Hadoop in the Enterprise.
  4. More than 90 sessions!
    The event will span two full days and will include 90+ sessions and speeches by over 50 organizations.
  5. Meet ups
    Socialize and learn at the pre-conference meet ups and attend the Birds of a Feather sessions, the Big Data Science “Machine Learning Evening”, or the Big Data Camp.
  6. Business use cases and reference architecture
    Get updated with the latest Hadoop use cases and reference architectures to gain insight for your business.
  7. Training classes
    Take a Hadoop Training class and become extra prepared for Hadoop Summit.
  8. First annual Hadoop Summit Bike Ride
    Cruise around the Silicon Valley on two wheels and get some fun exercise in at the first ever Summit bike ride.
  9. Hadoop Summit Party
    Celebrate your newly acquired knowledge and have some post-conference fun at the Tech Museum Summit party. Enjoy the museum’s many interactive exhibits as well as music, food and cocktails.
  10. Community
    Make new connections and share ideas with the rest of the Hadoop community.

Don’t miss this exciting opportunity and register now! See you there!

Week in Review: SQL IN Hadoop and Hive, Beyond Batch with YARN, NFS access to HDFS and HBase MTTR

Or as it’s more commonly being called: Week-ish in Review. Let’s recap on the latest – there’s some juicy technology goodness here.

Delivering on Stinger: Phase 1Just this week, Hive 0.11 has been released. Owen (@owen_omalley) brought us the news that 55 – yes, fifty-five – developers from across the community have addressed 386 JIRA tickets and have delivered significant improvements to Hive along with an awesome demonstration of the power of community open-source development. Thanks to everyone! This release of Hive means that we’ve delivered on the first phase of the Stinger Initiative too – aiming to deliver 100x performance increases to Hive.

Taking Hadoop Beyond Batch with YARN. All of which means we step closer to delivering SQL-in-Hadoop and respond to the needs of enterprises for multi-application operating systems for their big data. Arun (@arunmurthy) gives a terrific update on Hadoop 2.0 and YARN and how that development will move Hadoop Beyond Batch. Stay tuned!

Delivering Enterprise Hadoop through MTTR for HBase and NFS access to HDFS. Meanwhile, Nicolas Liochon (@nkeywal) and Devaraj Das (@ddraj) provide an introduction on how HBase availability is being improved through work on Mean Time To Recover (MTTR) capabilities. And then Brandon Li (@brandonli11) and Suresh Srinivas (@suresh_m_s) updated us on progress to simplify data management through NFS access to HDFS. All critical stuff for the enterprise, and all driven through the community.

Microsoft love for .NET Hadoop fans. If you’re a .NET developer and have been missing out on a little Hadoop fun, then Microsoft has started pushing out SDKs and tutorials for its Hadoop-in-the-Cloud service – HDInsight – so you can fire up Visual Studio and get rocking on that big data.

Hadoop Summit Meetups. We only announced them this week, and they’re nearly full already. Still time to try and squeeze into one of the Meetups: Hive, Pig, HBase, YARN, Accumulo, Ambari, Oozie, Data Science and Architecture or maybe attend Big Data Camp or Machine Learning Evening on 25th June as part of Hadoop Summit.

Now it’s time to go play. Have a great weekend.

Week in Review: Hadoop Summit, Value of Big Data, and more Ambari

And we are just about done with this week. But not quite – dig into the conversation from the past few days.

Hadoop Summit. We published the vast majority of sessions (70 so far) for the Hadoop Summit in San Jose, 26-27 June. The sessions stretch across 7 tracks from Architecture to Economics and we hope you can join us for THE Hadoop community event of the year. You can register here, and the schedule is here.

Big Data Defined Part Deux: Value Definition. Jim picked up from the last Big Data definition and talked about it here. Regardless of your views on volume, variety and velocity there is one V to rule them all: Value.

Enterprise Data Analytics with Hortonworks and Datameer. I’ve been having a ton of fun with Datameer visualizations this week. If you want to learn a little more about enterprise analytics and how to better unlock the insights in your own data (with cool graphics) then take a look here.

Get Started with Ambari. We published a fun tutorial on setting up Ambari to provision, manage and monitor your Hadoop cluster. Better automation of management and monitoring means more time in the garden.

Until next week – stay frosty.

Field Report: OpenStack Summit – The Hadoop Bizarro World

portland2PORTLAND – The Rose city is a great place and this week it got even more interesting with the OpenStack Summit in town. I am more a data geek and very rarely do I venture down the stack into infrastructure, but wow, there is something cool going on with the OpenStack community.  I couldn’t help but to get wrapped up in the excitement.  Not only was the enthusiasm palpable, it was also very familiar. I don’t know if it was the organic buzz of Portland or not, but I felt a little like I was in Hadoop bizarro world.

Hadoop on OpenStack

Hortonworks was the only “app” vendor on the show floor and our story was well received.  When you partner with the leading code contributor (Red Hat) and the leading system integrator (Mirantis) and have existing relationships with the founders (Rackspace) of OpenStack, you get some relative street cred. But honestly, the attendees I spoke with were incredibly happy to see us at the event because they saw our joining the community was about contributing serious code and Hadoop experience to Project Savanna.  This is characteristic of a vibrant community of developers.

It really didn’t take a lot of explaining to open the eyes of the audience to the reality that “Hadoop is the Perfect App for OpenStack”.  These guys and gals get it.  They are looking for the right application to drive adoption of OpenStack and Hadoop with its new workloads for an enterprise fits the bill. We look forward to seeing some crossover audience at Hadoop Summit when we roll out the first wave of our efforts by demonstrating the ease of deployment of Hadoop on OpenStack via the new Savanna project.

We were pretty busy on the show floor and were also invited by our friends at theCube (@furrier & @jefffrick ) to speak about Savanna and how Hadoop is good for OpenStack.  The video and corresponding article were great coverage.  Also, among a range of other press outlets picking up the story, the Register had a great summary of Project Savanna from the show floor.

Socialism v Capitalism

Being an Apache guy, I was curious to how the OpenStack community is governed.  With all these vendors in the building, it seemed there was a lot of powerful players involved.  Who is in charge? I had a few conversations about this and it seemed to me that there is a healthy democracy with some very powerful parties and lobbyists involved.  Sounded to be a bit like capitalism to me, which led me to a comparison with Apache….  Perhaps we are Socialism and OpenStack is Capitalism.  ;)

I met and spoke to a few of the committee members for OpenStack, including Devin Carlen (Nebula) and Josh McKenty (@jmckenty & PistonCloud).  Both are founders of OpenStack, founders of companies and have contributed significantly to the project.  They were amused by my theory.

OpenStack Summit Growth: Enter Sales and Marketing into the Community

The show has historically been mostly a “real” summit where developers got together to discuss, design and code.  There is still a lot of that going on, but the influx of “business” was overwhelming.  The growth of the show demonstrates the importance of the project. To quote Rackspace, “Between OpenStack’s Folsom and Grizzly releases, OpenStack experienced a more than 50 percent growth in contributions. According to some of the businesses closest to the project, OpenStack isn’t just about writing code; it’s about creating an infrastructure everyone can use. It’s about creating something amazing.”  Enter business.

Screen Shot 2013-04-19 at 8.56.06 AMWith some help from Chris Horne (@fpcguru) at CloudScaling and Fresh Perspective Consulting I was able to analyze (no data science here, just marketing guy stuff) the attendee list.  Out of 3000 registered, I would say close to one third were from the leading vendors in this space.  This seems to be a pretty mix for the show (and the community for that matter) and shows a vibrant range of adoption beyond the large players.  There are some big names involved and we can only expect the countdown has started and OpenStack is set to take off.

The Third Coming

One of my most interesting conversations this week was with a financial analyst at the show who characterized OpenStack as the “The Coming of The Third Generation of IT”. (Oh, I forgot to mention that they were all over the show as well.  It seems everyone wants to know who this helps or hurts and which small company is gonna crush it.) This led me to explore what exactly were gen 1 and 2.  Perhaps the old world of mainframes and PC in the 70s, 80s and early 90s was the first generation IT team.  They were a group of pencil protected, flannel shirt wearing guys with big glasses who walked around with disks and screwdrivers.  Mid nineties, we shifted into the second generation with client server and the Internet.  Data centers grew up and a shift towards SaaA started.

Today, the third generation is becoming reality.  The Cloud hype over the past few years provided us with PaaS and now with OpenStack, we may really see widespread adoption of IaaS.  We know one thing, in order to fuel adoption of OpenStack and this new infrastructure, an application must come along to spur adoption.  Funny enough, at the same time, Hadoop has established itself as the driver of net new workloads in an organization.  This is the exact greenfield opportunity for the OpenStack enthusiast to help drive adoption.  Hadoop is the Perfect App for OpenStack in this “Third Generation of IT”.

Hadoop, The Perfect App for OpenStack

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud). [

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Screen Shot 2013-04-15 at 10.17.46 PMBecause big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Big Data and Cloud Computing are Top Initiatives for IT Executives

A year ago, Barclays published its April 2012 CIO Survey where they stated “most CIOs rated the challenges of data growth and “Big Data” as the No. 1 trend driving IT spending decisions. We believe that “Big Data” is quickly becoming one of the biggest challenges within IT infrastructures given the shift to cloud computing and growth in unstructured data.”

Fast forward to CIO magazine’s 2013 State of the CIO Survey, we see that cloud computing and big data continue as major themes. Of the IT executives surveyed, 39% expect to complete cloud computing initiatives and 37% expect to complete big data initiatives this year. On top of that, 59% of those surveyed classify their organization as late majority or laggards when it comes to adoption of big data initiatives.

Translation? Big data and Hadoop have crossed the chasm and are accelerating into the mainstream enterprise, which reinforces our assertion made back in January.

Hadoop and OpenStack Sitting in a Tree…

Hadoop is the leading open source platform for storing, processing and accessing large data sets across clusters of computers, and OpenStack is the leading open source framework for building and managing private, public and hybrid Infrastructure-as-a-Service (IaaS) clouds.

According to Gartner, big data will drive $232 billion in IT spending through 2016. The benefits to organizations for adding big data to their information management and analytics infrastructure will force a more rapid cycle of replacing existing solutions. Since Hadoop is net-new workload for most organizations, Hadoop on OpenStack provides the perfect “greenfield” use case for those looking to start anew on a platform that makes sense.

Moreover, Hadoop and OpenStack are open source technologies designed for scale-out architectures that can be cost effectively deployed. Finally, since Hadoop can be complex to get started with, taking advantage of the operational agility and deployment choice that OpenStack enables will go a long way to jumpstarting those interested in deploying big data on cloud.

…First Comes Love, then Comes Marriage

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software, and when it comes to platform technologies like Hadoop and OpenStack, we believe that community-driven open source will always outpace the innovation of a single group of people or single company.  We feel this is why both Hadoop and OpenStack have attracted major ecosystem players such as IBM, Red Hat, HP, Rackspace, Intel, and many others.

Our news today means we intend to marry, if you will, two of the largest open source movements in order to accelerate the perfect use case of big data on cloud. Specifically, we will be working within the OpenStack community on Project Savanna, originally introduced as an OpenStack project by Mirantis. The goal of the project is to enable OpenStack users to easily provision and manage elastic Hadoop clusters on OpenStack.

Screen Shot 2013-04-15 at 3.40.19 PM

An important design point for Savanna is to provide an integration point for Hadoop provisioning and management frameworks, so we will focus on making sure Apache Ambari is well integrated. This will enable our enterprise customers to provision the Hortonworks Data Platform very quickly via OpenStack APIs and OpenStack’s Horizon Dashboard while managing their Hadoop cluster in a familiar way.

Next Steps?

For those who want to learn more about Savanna, Mirantis has a blog post and video that provides a great overview.

If you want to join the community effort, we encourage you visit the Savanna project and get involved.

And finally, we will be demonstrating the initial fruits of our labor at the Hadoop Summit on June 26, 2013 in San Jose California, so we encourage you to sign up for the conference!

 

Big Data Defined

‘Big Data’ has become a hot buzzword, but a poorly defined one. Here we will define it.

Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets:

In information technology, big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

It is better to define ‘Big Data’ in terms of opportunity, in terms of transformative economics. Big Data is the opportunity space created by new open source, distributed systems from the consumer internet space.

Specifically, a Big Data system has four properties:

  • It uses local storage to be fast but inexpensive
  • It uses clusters of commodity hardware to be inexpensive
  • It uses free software to be inexpensive
  • It is open source to avoid expensive vendor lock-in

Cheap storage means logging enormous volumes of data to many disks is easy. Processing this data is less so. Distributed systems which have the above four properties are disruptive because they are approximately 100 times cheaper than other systems for processing large volumes of data, and because they deliver high I/O performance for the buck.

Apache Hadoop is one such system. Hadoop ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of any other system.

SAN Storage NAS Filers Local Storage
$2-10/GB $1-5/GB $0.05/GB

It is out of this cost differential that our opportunity arises: to log every shred of data we can in the cheapest place possible. To provide access to this data across the organization. To mine our data for value. To undergo the transformative processes that unabridged access to data provides, enabling bigger, better, faster more profound insight than ever before.

This is a working definition of Big Data.

What do you think? What is your definition of Big Data?

Hadoop Summit North America 2013: Community Choice Results

And the voting is over and the results are in for the Community Choice program of the Hadoop Summit San Jose 2013.

With over 300 sessions, and around 6000 users casting more than 15000 votes there was a lot of excitement to participate and influence the results - thanks to everyone for your contribution. At the end of the process, the selectees are:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

Congratulations to the selectees for each track, and a further honorable mention to Sears for winning the ‘Longest Session Title So Far’ which was a surprisingly hard fought contest!

The content selection committee will now be working hard to select the remaining sessions for the tracks, and we’ll cover those participants in more depth later.

With the Community Choice program complete we’re one step closer to a great event! Thanks again to everyone for taking part and stand by for more updates.

Hadoop Summit 2013 Amsterdam – It’s A Wrap!

We want to take a moment to thank everyone who attended the Hadoop Summit in Amsterdam - THANK YOU! With nearly 500 people registered for the event we think we can safely say is was a big success. We’ve had overwhelming support to do it again next year – so watch this space.

The awesome Beurs Van Berlage venue set us up for a series of fantastic conversations and really well attended sessions and talks as Hadoop continues to explode onto the enterprise scene . Outside of the main tracks, there was great attendance for NLHUG and BoF talks, and kudos to the 10 presenters who ran those lightning talks. Finally, the customer panel was also well received, with great practical advice on adopting Hadoop from HSBC, Neustar and eBay.

But of course it wouldn’t be an event without a party, and we had a great time at the Heineken Experience (from what we can remember).  We put some photos on our Facebook page, but @timoelliott did a much better job than us with this fantastic set on Flickr. This one shows the awesome venue:

hadoop summit exhibition hall

So did you enjoy the summit?  Head over to Facebook  and let us know your favorite part and why: keynotes, tracks, lightning talks, the sandbox experience in the dev cafe, or the party.

And here is a tiny selection of some of the most recent Tweets closing out the show:

Hadoop Summit Tweet

Hadoop Summit Tweet

Hadoop Summit Tweet

Hadoop Summit Tweet

With the community voting just about complete - you still have a few hours to take part – for Hadoop Summit San Jose we are barely 3 months away from a whole bunch of new content and connections and we hope you join us there too!

Thanks again!

Separating Open Source Signal from Enterprise Hadoop Noise

There have been many Apache Hadoop-related announcements the past few weeks, making it difficult to separate the signal from the marketing noise. One thing is crystal clear however… there is a large and growing appetite for Enterprise Hadoop because it helps unlock new insights and business opportunities in a way that was not previously technologically or economically feasible.

Enterprise and Open Source are NOT Mutually Exclusive

forbesWoodsDan Woods from Forbes, recently penned an article entitled Why SQL Matters, the Limits of Open Source, and Other Lessons of EMC Greenplum’s Pivotal HD” where he paints a picture of enterprise and open source in opposite corners. As an example, he closes his article with:

 “If you are a CIO what do you choose? Open source ideology or products that are made to solve enterprise problems by enterprise companies?”

I take issue with that either/or stance; just look at Red Hat, JBoss, SpringSource, MySQL as well as the broad enterprise use of Apache Web Server and Apache Tomcat for examples of enterprise-class open source software. Our approach at Hortonworks is very much about providing a healthy mix of enterprise AND open source – with emphasis on the “AND”.  Specifically, we identify and introduce enterprise requirements into the pubic domain (i.e. open source), we work with the community and partners to advance and incubate open source projects, and we apply enterprise rigor to provide the most stable and reliable distribution that our customers and partners can rely on.

While I take issue with the sentiment of the Forbes article, I agree with one of its thematic points: in order for Hadoop to flourish, it needs to factor in traditional enterprise “use-value participants”.

At Hortonworks, we work very closely with Teradata and Microsoft as “use-value participants” (to use the Forbes term) that are highly relevant to enterprise customers adopting big data strategies.

Why? For Enterprise Hadoop to be as impactful as it can be, our approach to the market needs to be BOTH direct and indirect. Working with partners like Teradata and Microsoft helps pull Enterprise Hadoop into the market in ways that are meaningful and valuable to enterprise customers.

Spotlight: Microsoft Adds Value By Working WITHIN The Community

Hortonworks and Microsoft engineers have worked side-by-side within the Apache community for the past 16 months. The focus has been on making Enterprise Hadoop easier to use and consume by mainstream enterprises. Specifically, the focus has been on Apache Hadoop and more recently Apache Hive (a la our Stinger Initiative aimed at making Hive 100X faster. We’ve also collaborated on making Hadoop applications faster and more secure by introducing new incubator projects such as Apache Tez and Apache Knox Gateway.

windoweleMoreover, a great example of the fruits of our joint efforts is our recent launch of the Hortonworks Data Platform for Windows, aimed at bringing the power of Hadoop to the large Windows ecosystem.

My point here is that Microsoft engineers have been spending serious time and energy working within the Apache Software Foundation on making various open source projects better.  A perfect example of this is a fact that many people may not be aware of. Chris Douglas, an engineer from Microsoft, was recently voted the V.P. of Hadoop. Chris earned this position by demonstrating leadership within the community.

We Feel One Of The Elephants Is Not Like The Others

By now, you’ve gotten the point that we believe enterprise and open source are NOT mutually exclusive. There are go-to-market approaches that can propagate or dispel this myth, however.

  1. Fork / Fragment: One approach is to forego working within the open source community and simply choose to harvest the open source work of others and then modify/bend that technology for specific commercial interests. Changes to the open source technology are intentionally done outside of the community and held back as “important enterprise value add”.  EMC and their Pivotal HD offering is an example of a strategy aimed at fragmenting the market in order to control a portion of the potential customers. See my recent blog post for more thoughts on this topic.
  2. Unite / Coalesce: Another approach is to work within the community on making the open source projects better and more capable of integrating seamlessly with enterprise-focused commercial offerings. Contributing all “value add” changes that should be in the open source projects directly into those projects helps ensure they become easier to use and consume by all. This approach is intended to enable a very large ecosystem to form around a common and consistent open source foundation. Hortonworks partnerships with Teradata and Microsoft are examples of how enterprise-focused solutions can be built on a common and interoperable base.

Both approaches are certainly valid…but with different consequences not only for the technology, but also the broader market / ecosystem. How so? Well, I will simply leave it as an exercise to you, the reader, to consider lessons learned from the UNIX wars (fragmented market) versus Linux (unified market on top of common Linux kernel).

At Hortonworks we are clearly encouraging the second approach, and we are excited to work with partners like Microsoft and others to add value directly into the open source projects in ways that make them easier to use and consume by enterprises.

We also believe that any company that thinks they are “all in” on making open source Apache Hadoop into an enterprise-viable platform needs to have key committers working on the open source technologies (Hortonworks has 50+ committers) or partner with a company like Hortonworks who is focused on working with the ecosystem on ensuring Hadoop integrates and interoperates well with existing enterprise systems and tools.

There Is Still Much Work To Be Done…So Join Us On The Journey!

Hortonworks engineers have been privileged to help Hadoop mature from the domain of a small number of web monsters (including Yahoo!) to a technology that has crossed the chasm and onto a large number of CIO’s agendas across mainstream enterprises. And as I noted in a recent blog post, there is an interesting road ahead of us.

The rise of Enterprise Hadoop offers a refreshing opportunity for our customers to benefit from a data platform that provides a compelling combination of technology, economic and business benefits. And delivering that enterprise value directly as well as indirectly through partners is what we are focused on.

Did EMC Just Say Fork You To The Hadoop Community?

 

In Derrick Harris’ article on GigaOM entitled “EMC to Hadoop competition: See ya, wouldn’t wanna be ya.”, EMC unveiled their new Pivotal HD offering which effectively re-architects the Greenplum analytic database so it sits on top of the Hadoop Distributed File System (HDFS). Scott Yara, Greenplum cofounder, is excited about the new product. Since a key focus for us at Hortonworks is to deeply integrate Hadoop with other data systems (a la our efforts with Teradata, Microsoft, MarkLogic, and others), I’m always excited to see data system providers like Greenplum decide to store their data natively in HDFS. And I can’t argue with Scott Yara’s sentiment that “I do think the center of gravity will move toward HDFS”.

But putting HDFS under a proprietary database does not make it Hadoop, however.

All in on Hadoop?

Glancing at the Pivotal HD diagram in the GigaOM article, they’ve made it easy to distinguish the EMC proprietary components in Blue from the Apache Hadoop-related components in Green. And based on what Scott Yara says “We literally have over 300 engineers working on our Hadoop platform”.

Wow, that’s a lot of engineers focusing on Hadoop! Since Scott Yara admitted that “We’re all in on Hadoop, period.”, a large number of those engineers must be working on the open source Apache Hadoop-related projects labeled in Green in the diagram, right?

So a simple question is worth asking: How many of those 300 engineers are actually committers* to the open source projects Apache Hadoop, Apache Hive, Apache Pig, and Apache HBase?

furrierTweetJohn Furrier actually asked this question on Twitter and got a reply from Donald Miner from the Greenplum team. The thread is as follows:

Since I agree with John Furrier that understanding the number of committers is kinda related to the context of Scott Yara’s claim, I did a quick scan through the committers pages for Hadoop, Hive, Pig and HBase to seek out the large number of EMC engineers spending their time improving these open source projects. Hmmm….my quick scan yielded a curious absence of EMC engineers directly contributing to these Apache projects. Oh well, I guess the vast majority of those 300 engineers are working on the EMC proprietary technology in the blue boxes.

Why Do Committers Matter?

Simply put: Just because you can read Moby-Dick doesn’t make you talented enough to have authored it.

Committers matter because they are the talented authors who devote their time and energy on working within the Apache Software Foundation community adding features, fixing bugs, and reviewing and approving changes submitted by the other committers. At Hortonworks, we have over 50 committers, across the various Hadoop-related projects, authoring code and working with the community to make their projects better.

This is simply how the community-driven open source model works. And believe it or not, you actually have to be in the community before you can claim you are leading the community and authoring the code!

So when EMC says they are “all-in on Hadoop” but have nary a committer in sight, then that must mean they are “all-in for harvesting the work done by others in the Hadoop community”.  Kind of a neat marketing trick, don’t you think?

Scott Yara effectively says that it would take about $50 to $100 million dollars and 300 engineers to do what they’ve done. Sounds expensive, hard, and untouchable doesn’t it? Well, let’s take a close look at the Apache Hadoop community in comparison.  Over the lifetime of just the Apache Hadoop project, there have been over 1200 people across more than 80 different companies or entities who have contributed code to Hadoop.  Mr. Yara, I’ll see your 300 and raise you a community!

Are You Forking With Me?

So, assuming EMC has little or no committers on the relevant Apache open source projects, then one can only assume their strategy is to fork the Hadoop-related code and maintain their own proprietary version. If they are not actively authoring code within the community, then how else are they able to add important new features or fix critical bugs for enterprise customers?

Looking at the Pivotal HD diagram closely, I also wonder why the box at the foundation lists “HDFS or Isilon OneFS”. Doesn’t that just make you wonder how committed to HDFS EMC actually is? And how long it will take them to start throwing HDFS under the bus from a marketing perspective so they can sell more Isilon? They have to pay for those expensive 300 engineers somehow, right?

And Are They Forking EMC Customers and MapR Technologies While They’re At It?

For EMC customers, another important tidbit to note in the GigaOm article:

“Yara said Greenplum had known for a while that Hadoop was the key to any big data strategy going forward, but that it would take some time to build up its own technology. So, in 2011, it entered into a reseller agreement with Hadoop startup MapR to offer a premium product to appease enterprise customers while Greenplum’s engineers got to work on what would become Pivotal HD. That deal with MapR is still in place, but it’s no longer the focal point of Greenplum’s Hadoop strategy.”

Yep, as confusing as it sounds, EMC had two Hadoop-like offerings, Greenplum HD and Greenplum MR. Pivotal HD appears to be a reswizzled rendition of Greenplum HD (with magical Hawq dust sprinkled on top).  And if you were one of those enterprise customers who EMC “appeased” by buying into Greenplum MR (with the OEM’d MapR distribution inside), then you’re either being abandoned and kicked to the curb or being presented with a fork in the road.

Either way, you are faced with a choice: do you ride out EMC’s changing course yet again or do you look for safer harbor elsewhere…

Choose Community Driven Open Source and Avoid Proprietary Lock-in

At Hortonworks, we believe in the relentless march of community driven open source as the fastest path to innovation and adoption of Apache Hadoop. We believe the most effective path is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs. I encourage you to read more.

We also believe that community driven open source offers the safest path forward since you’re not locked into the whims of a single vendor.

At Hortonworks, when we say “we are ALL IN on Hadoop”, we actually mean it!

And while my post may sound a little harsh, it’s important to note that we’d love to see EMC engineers, and anyone else for that matter, participate in the Apache community and make real contributions.  After all, at the end of the day, community rules!

 

NOTE: A committer is someone who has “earned their stripes” within the Apache community and has the ability to commit code directly to their corresponding Apache project source code tree. The Apache Hive project has a wiki page that provides a nice explanation of how this process works.

Apache HBase 0.94.5 is out!

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Most of the bug fixes went against the REST server, replication, region assignment, secure client, flaky unit tests, 0.92 compatibility and various stability improvements. Some of the interesting patches in this release are:
[HBASE-3996] – Support multiple tables and scanners as input to the mapper in map/reduce jobs
[HBASE-5416] – Improve performance of scans with some kind of filters.
[HBASE-7757] – Add web UI to REST server and Thrift server
[HBASE-7748] – Add DelimitedKeyPrefixRegionSplitPolicy
[HBASE-6669] – Add BigDecimalColumnInterpreter for doing aggregations using AggregationClient
[HBASE-7728] – Deadlock occurs between hlog roller and blog syncer’

The release candidate has been extensively tested by Hortonworks and many others in the community. You can roll out the 0.94.5 bits using rolling upgrade on top of 0.92 or 0.94 releases. In addition, Apache HBase 0.94.5 will be incorporated into an upcoming update to HDP 1.2.

You can download the new release from here, and find full release notes here.

Last, but not least, we would like to thank Lars Hofhansl, who is the release manager of 0.94 branch for driving the release train, and all 30 individuals, who have contributed to this release.

Philosophy behind YARN Resource Management

YARN is part of the next generation Hadoop cluster compute environment. It creates a generic and flexible resource management framework to administer the compute resources in a Hadoop cluster. The YARN application framework allows multiple applications to negotiate resources for themselves and perform their application specific computations on a shared cluster. Thus, resource allocation lies at the heart of YARN.

YARN ultimately opens up Hadoop to additional compute frameworks, like Tez, so that an application can optimize compute for their specific requirements.

The YARN Resource Manager service is the central controlling authority for resource management and makes allocation decisions. It exposes a Scheduler API that is specifically designed to negotiate resources and not schedule tasks. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.

Currently, there are two sharing policies – fair scheduling and capacity scheduling. Thus, the API reflects the Resource Manager’s role as the resource allocator. This API design is also crucial for Resource Manager scalability because it limits the complexity of the operations to the size of the cluster and not the size of the tasks running on the cluster.The actual task scheduling decisions are delegated to the application manager that runs the application logic. It decides when, where and how many tasks to run within the resources allocated to it. It has the flexibility to choose its locality, co-scheduling, co-location and other scheduling strategies.

 

Screen Shot 2013-02-22 at 7.31.53 AM

 

Fundamentally, YARN resource scheduling is a 2-step framework with resource allocation done by YARN and task scheduling done by the application. This allows YARN to be a generic compute platform while still allowing flexibility of scheduling strategies. An analogy would be general purpose operating systems that allocate computer resources among concurrent processes.

We envision YARN to be the cluster operating system. It may be the case that this 2-step approach is slower than a custom scheduling logic but we believe that such problems can be alleviated by careful design and engineering. Having the custom scheduling logic reside inside the application allows the application to be run on any YARN cluster. This is important for creating a vibrant YARN application ecosystem (tez is a good example of this) that can be easily deployed on any YARN cluster. Developing YARN scheduling libraries will alleviate the developer effort needed to create application specific schedulers and YARN-103 is a step in that direction.

The Fastest Path to Innovation: Community Driven Open Source

 

blogpicLast week, we outlined our approach for delivering an enterprise viable Apache Hadoop distribution in the open.  Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.

In support of our approach, this week we’ve announced the submission of two new incubation projects to the Apache Software foundation together with the launch of the “Stinger Initiative”, all aimed at enhancing the security and performance of Hadoop applications.  These efforts focus on enterprise requirements that are essential to enable broad adoption across the Hadoop ecosystem.

  • The Stinger initiative aims to dramatically speed up Apache Hive in support of interactive query use cases.
  • The Knox Gateway addresses the need for a single point of authentication and secure access for Apache Hadoop services in a cluster.
  • The Tez framework provides an alternative next-generation runtime built on Hadoop YARN that significantly improves latency and throughput of Hadoop applications.

We feel these efforts are strong examples of our commitment to driving innovation from within the open source community, and as stated in our approach blog, we do this by::

  • identifying and articulating the enterprise requirements within the community,
  • taking an active role in addressing those requirements within the community, and
  • applying enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

Since it takes a community to build enterprise-class platforms like Hadoop, if you have interest in helping with Knox, Tez, or Stinger, we encourage you to work with us and the others in the Apache community!

Go to page:1234