Posts by Shaun Connolly:


Hadoop, The Perfect App for OpenStack

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud). [

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Screen Shot 2013-04-15 at 10.17.46 PMBecause big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Big Data and Cloud Computing are Top Initiatives for IT Executives

A year ago, Barclays published its April 2012 CIO Survey where they stated “most CIOs rated the challenges of data growth and “Big Data” as the No. 1 trend driving IT spending decisions. We believe that “Big Data” is quickly becoming one of the biggest challenges within IT infrastructures given the shift to cloud computing and growth in unstructured data.”

Fast forward to CIO magazine’s 2013 State of the CIO Survey, we see that cloud computing and big data continue as major themes. Of the IT executives surveyed, 39% expect to complete cloud computing initiatives and 37% expect to complete big data initiatives this year. On top of that, 59% of those surveyed classify their organization as late majority or laggards when it comes to adoption of big data initiatives.

Translation? Big data and Hadoop have crossed the chasm and are accelerating into the mainstream enterprise, which reinforces our assertion made back in January.

Hadoop and OpenStack Sitting in a Tree…

Hadoop is the leading open source platform for storing, processing and accessing large data sets across clusters of computers, and OpenStack is the leading open source framework for building and managing private, public and hybrid Infrastructure-as-a-Service (IaaS) clouds.

According to Gartner, big data will drive $232 billion in IT spending through 2016. The benefits to organizations for adding big data to their information management and analytics infrastructure will force a more rapid cycle of replacing existing solutions. Since Hadoop is net-new workload for most organizations, Hadoop on OpenStack provides the perfect “greenfield” use case for those looking to start anew on a platform that makes sense.

Moreover, Hadoop and OpenStack are open source technologies designed for scale-out architectures that can be cost effectively deployed. Finally, since Hadoop can be complex to get started with, taking advantage of the operational agility and deployment choice that OpenStack enables will go a long way to jumpstarting those interested in deploying big data on cloud.

…First Comes Love, then Comes Marriage

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software, and when it comes to platform technologies like Hadoop and OpenStack, we believe that community-driven open source will always outpace the innovation of a single group of people or single company.  We feel this is why both Hadoop and OpenStack have attracted major ecosystem players such as IBM, Red Hat, HP, Rackspace, Intel, and many others.

Our news today means we intend to marry, if you will, two of the largest open source movements in order to accelerate the perfect use case of big data on cloud. Specifically, we will be working within the OpenStack community on Project Savanna, originally introduced as an OpenStack project by Mirantis. The goal of the project is to enable OpenStack users to easily provision and manage elastic Hadoop clusters on OpenStack.

Screen Shot 2013-04-15 at 3.40.19 PM

An important design point for Savanna is to provide an integration point for Hadoop provisioning and management frameworks, so we will focus on making sure Apache Ambari is well integrated. This will enable our enterprise customers to provision the Hortonworks Data Platform very quickly via OpenStack APIs and OpenStack’s Horizon Dashboard while managing their Hadoop cluster in a familiar way.

Next Steps?

For those who want to learn more about Savanna, Mirantis has a blog post and video that provides a great overview.

If you want to join the community effort, we encourage you visit the Savanna project and get involved.

And finally, we will be demonstrating the initial fruits of our labor at the Hadoop Summit on June 26, 2013 in San Jose California, so we encourage you to sign up for the conference!

 

Separating Open Source Signal from Enterprise Hadoop Noise

There have been many Apache Hadoop-related announcements the past few weeks, making it difficult to separate the signal from the marketing noise. One thing is crystal clear however… there is a large and growing appetite for Enterprise Hadoop because it helps unlock new insights and business opportunities in a way that was not previously technologically or economically feasible.

Enterprise and Open Source are NOT Mutually Exclusive

forbesWoodsDan Woods from Forbes, recently penned an article entitled Why SQL Matters, the Limits of Open Source, and Other Lessons of EMC Greenplum’s Pivotal HD” where he paints a picture of enterprise and open source in opposite corners. As an example, he closes his article with:

 “If you are a CIO what do you choose? Open source ideology or products that are made to solve enterprise problems by enterprise companies?”

I take issue with that either/or stance; just look at Red Hat, JBoss, SpringSource, MySQL as well as the broad enterprise use of Apache Web Server and Apache Tomcat for examples of enterprise-class open source software. Our approach at Hortonworks is very much about providing a healthy mix of enterprise AND open source – with emphasis on the “AND”.  Specifically, we identify and introduce enterprise requirements into the pubic domain (i.e. open source), we work with the community and partners to advance and incubate open source projects, and we apply enterprise rigor to provide the most stable and reliable distribution that our customers and partners can rely on.

While I take issue with the sentiment of the Forbes article, I agree with one of its thematic points: in order for Hadoop to flourish, it needs to factor in traditional enterprise “use-value participants”.

At Hortonworks, we work very closely with Teradata and Microsoft as “use-value participants” (to use the Forbes term) that are highly relevant to enterprise customers adopting big data strategies.

Why? For Enterprise Hadoop to be as impactful as it can be, our approach to the market needs to be BOTH direct and indirect. Working with partners like Teradata and Microsoft helps pull Enterprise Hadoop into the market in ways that are meaningful and valuable to enterprise customers.

Spotlight: Microsoft Adds Value By Working WITHIN The Community

Hortonworks and Microsoft engineers have worked side-by-side within the Apache community for the past 16 months. The focus has been on making Enterprise Hadoop easier to use and consume by mainstream enterprises. Specifically, the focus has been on Apache Hadoop and more recently Apache Hive (a la our Stinger Initiative aimed at making Hive 100X faster. We’ve also collaborated on making Hadoop applications faster and more secure by introducing new incubator projects such as Apache Tez and Apache Knox Gateway.

windoweleMoreover, a great example of the fruits of our joint efforts is our recent launch of the Hortonworks Data Platform for Windows, aimed at bringing the power of Hadoop to the large Windows ecosystem.

My point here is that Microsoft engineers have been spending serious time and energy working within the Apache Software Foundation on making various open source projects better.  A perfect example of this is a fact that many people may not be aware of. Chris Douglas, an engineer from Microsoft, was recently voted the V.P. of Hadoop. Chris earned this position by demonstrating leadership within the community.

We Feel One Of The Elephants Is Not Like The Others

By now, you’ve gotten the point that we believe enterprise and open source are NOT mutually exclusive. There are go-to-market approaches that can propagate or dispel this myth, however.

  1. Fork / Fragment: One approach is to forego working within the open source community and simply choose to harvest the open source work of others and then modify/bend that technology for specific commercial interests. Changes to the open source technology are intentionally done outside of the community and held back as “important enterprise value add”.  EMC and their Pivotal HD offering is an example of a strategy aimed at fragmenting the market in order to control a portion of the potential customers. See my recent blog post for more thoughts on this topic.
  2. Unite / Coalesce: Another approach is to work within the community on making the open source projects better and more capable of integrating seamlessly with enterprise-focused commercial offerings. Contributing all “value add” changes that should be in the open source projects directly into those projects helps ensure they become easier to use and consume by all. This approach is intended to enable a very large ecosystem to form around a common and consistent open source foundation. Hortonworks partnerships with Teradata and Microsoft are examples of how enterprise-focused solutions can be built on a common and interoperable base.

Both approaches are certainly valid…but with different consequences not only for the technology, but also the broader market / ecosystem. How so? Well, I will simply leave it as an exercise to you, the reader, to consider lessons learned from the UNIX wars (fragmented market) versus Linux (unified market on top of common Linux kernel).

At Hortonworks we are clearly encouraging the second approach, and we are excited to work with partners like Microsoft and others to add value directly into the open source projects in ways that make them easier to use and consume by enterprises.

We also believe that any company that thinks they are “all in” on making open source Apache Hadoop into an enterprise-viable platform needs to have key committers working on the open source technologies (Hortonworks has 50+ committers) or partner with a company like Hortonworks who is focused on working with the ecosystem on ensuring Hadoop integrates and interoperates well with existing enterprise systems and tools.

There Is Still Much Work To Be Done…So Join Us On The Journey!

Hortonworks engineers have been privileged to help Hadoop mature from the domain of a small number of web monsters (including Yahoo!) to a technology that has crossed the chasm and onto a large number of CIO’s agendas across mainstream enterprises. And as I noted in a recent blog post, there is an interesting road ahead of us.

The rise of Enterprise Hadoop offers a refreshing opportunity for our customers to benefit from a data platform that provides a compelling combination of technology, economic and business benefits. And delivering that enterprise value directly as well as indirectly through partners is what we are focused on.

Did EMC Just Say Fork You To The Hadoop Community?

 

In Derrick Harris’ article on GigaOM entitled “EMC to Hadoop competition: See ya, wouldn’t wanna be ya.”, EMC unveiled their new Pivotal HD offering which effectively re-architects the Greenplum analytic database so it sits on top of the Hadoop Distributed File System (HDFS). Scott Yara, Greenplum cofounder, is excited about the new product. Since a key focus for us at Hortonworks is to deeply integrate Hadoop with other data systems (a la our efforts with Teradata, Microsoft, MarkLogic, and others), I’m always excited to see data system providers like Greenplum decide to store their data natively in HDFS. And I can’t argue with Scott Yara’s sentiment that “I do think the center of gravity will move toward HDFS”.

But putting HDFS under a proprietary database does not make it Hadoop, however.

All in on Hadoop?

Glancing at the Pivotal HD diagram in the GigaOM article, they’ve made it easy to distinguish the EMC proprietary components in Blue from the Apache Hadoop-related components in Green. And based on what Scott Yara says “We literally have over 300 engineers working on our Hadoop platform”.

Wow, that’s a lot of engineers focusing on Hadoop! Since Scott Yara admitted that “We’re all in on Hadoop, period.”, a large number of those engineers must be working on the open source Apache Hadoop-related projects labeled in Green in the diagram, right?

So a simple question is worth asking: How many of those 300 engineers are actually committers* to the open source projects Apache Hadoop, Apache Hive, Apache Pig, and Apache HBase?

furrierTweetJohn Furrier actually asked this question on Twitter and got a reply from Donald Miner from the Greenplum team. The thread is as follows:

Since I agree with John Furrier that understanding the number of committers is kinda related to the context of Scott Yara’s claim, I did a quick scan through the committers pages for Hadoop, Hive, Pig and HBase to seek out the large number of EMC engineers spending their time improving these open source projects. Hmmm….my quick scan yielded a curious absence of EMC engineers directly contributing to these Apache projects. Oh well, I guess the vast majority of those 300 engineers are working on the EMC proprietary technology in the blue boxes.

Why Do Committers Matter?

Simply put: Just because you can read Moby-Dick doesn’t make you talented enough to have authored it.

Committers matter because they are the talented authors who devote their time and energy on working within the Apache Software Foundation community adding features, fixing bugs, and reviewing and approving changes submitted by the other committers. At Hortonworks, we have over 50 committers, across the various Hadoop-related projects, authoring code and working with the community to make their projects better.

This is simply how the community-driven open source model works. And believe it or not, you actually have to be in the community before you can claim you are leading the community and authoring the code!

So when EMC says they are “all-in on Hadoop” but have nary a committer in sight, then that must mean they are “all-in for harvesting the work done by others in the Hadoop community”.  Kind of a neat marketing trick, don’t you think?

Scott Yara effectively says that it would take about $50 to $100 million dollars and 300 engineers to do what they’ve done. Sounds expensive, hard, and untouchable doesn’t it? Well, let’s take a close look at the Apache Hadoop community in comparison.  Over the lifetime of just the Apache Hadoop project, there have been over 1200 people across more than 80 different companies or entities who have contributed code to Hadoop.  Mr. Yara, I’ll see your 300 and raise you a community!

Are You Forking With Me?

So, assuming EMC has little or no committers on the relevant Apache open source projects, then one can only assume their strategy is to fork the Hadoop-related code and maintain their own proprietary version. If they are not actively authoring code within the community, then how else are they able to add important new features or fix critical bugs for enterprise customers?

Looking at the Pivotal HD diagram closely, I also wonder why the box at the foundation lists “HDFS or Isilon OneFS”. Doesn’t that just make you wonder how committed to HDFS EMC actually is? And how long it will take them to start throwing HDFS under the bus from a marketing perspective so they can sell more Isilon? They have to pay for those expensive 300 engineers somehow, right?

And Are They Forking EMC Customers and MapR Technologies While They’re At It?

For EMC customers, another important tidbit to note in the GigaOm article:

“Yara said Greenplum had known for a while that Hadoop was the key to any big data strategy going forward, but that it would take some time to build up its own technology. So, in 2011, it entered into a reseller agreement with Hadoop startup MapR to offer a premium product to appease enterprise customers while Greenplum’s engineers got to work on what would become Pivotal HD. That deal with MapR is still in place, but it’s no longer the focal point of Greenplum’s Hadoop strategy.”

Yep, as confusing as it sounds, EMC had two Hadoop-like offerings, Greenplum HD and Greenplum MR. Pivotal HD appears to be a reswizzled rendition of Greenplum HD (with magical Hawq dust sprinkled on top).  And if you were one of those enterprise customers who EMC “appeased” by buying into Greenplum MR (with the OEM’d MapR distribution inside), then you’re either being abandoned and kicked to the curb or being presented with a fork in the road.

Either way, you are faced with a choice: do you ride out EMC’s changing course yet again or do you look for safer harbor elsewhere…

Choose Community Driven Open Source and Avoid Proprietary Lock-in

At Hortonworks, we believe in the relentless march of community driven open source as the fastest path to innovation and adoption of Apache Hadoop. We believe the most effective path is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs. I encourage you to read more.

We also believe that community driven open source offers the safest path forward since you’re not locked into the whims of a single vendor.

At Hortonworks, when we say “we are ALL IN on Hadoop”, we actually mean it!

And while my post may sound a little harsh, it’s important to note that we’d love to see EMC engineers, and anyone else for that matter, participate in the Apache community and make real contributions.  After all, at the end of the day, community rules!

 

NOTE: A committer is someone who has “earned their stripes” within the Apache community and has the ability to commit code directly to their corresponding Apache project source code tree. The Apache Hive project has a wiki page that provides a nice explanation of how this process works.

The Fastest Path to Innovation: Community Driven Open Source

 

blogpicLast week, we outlined our approach for delivering an enterprise viable Apache Hadoop distribution in the open.  Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.

In support of our approach, this week we’ve announced the submission of two new incubation projects to the Apache Software foundation together with the launch of the “Stinger Initiative”, all aimed at enhancing the security and performance of Hadoop applications.  These efforts focus on enterprise requirements that are essential to enable broad adoption across the Hadoop ecosystem.

  • The Stinger initiative aims to dramatically speed up Apache Hive in support of interactive query use cases.
  • The Knox Gateway addresses the need for a single point of authentication and secure access for Apache Hadoop services in a cluster.
  • The Tez framework provides an alternative next-generation runtime built on Hadoop YARN that significantly improves latency and throughput of Hadoop applications.

We feel these efforts are strong examples of our commitment to driving innovation from within the open source community, and as stated in our approach blog, we do this by::

  • identifying and articulating the enterprise requirements within the community,
  • taking an active role in addressing those requirements within the community, and
  • applying enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

Since it takes a community to build enterprise-class platforms like Hadoop, if you have interest in helping with Knox, Tez, or Stinger, we encourage you to work with us and the others in the Apache community!

We Believe… in community driven Enterprise Apache Hadoop

 

HadoopLogo

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software. In the spirit of openness, we think it’s important to share our perspectives around the broader context of how Apache Hadoop and Hortonworks came to be, what we are doing now, and why we believe our unique focus is good for Apache Hadoop, the ecosystem of Hadoop users, and for Hortonworks as well.

How did we get here? 

The core team here at Hortonworks started at Yahoo! where in 2005 Eric Baldeschwieler (aka “E14” and Hortonworks CTO) challenged Owen O’Malley (Hortonworks co-founder) and several others to solve a really hard problem: store and process the data on the internet in a simple, scalable and economically feasible way.  They looked at traditional storage approaches but quickly realized they just weren’t going to work for the type of data (much of it unstructured) and the sheer quantity Yahoo! would have to deal with.

The team’s first reaction, as is the norm, was to lock themselves in a room and come up with a prototype of a closed, proprietary system. With fantastic vision and oversight from E14 and Raymie Stata (former CTO, Yahoo), however, the team turned to the open-source community and in particular the Apache Software Foundation. This also included growing a large development team that included Doug Cutting, Arun Murthy (Hortonworks co-founder) and others who began to work with the community on what became known as Apache Hadoop – specifically HDFS and MapReduce.

The team quickly realized that by contributing their efforts into a community of like-minded individuals, the technology would innovate far faster.  At the same time, they’d enable other organizations to realize some of the same benefits that they were starting to see from their early efforts.  When organizations such as Facebook, LinkedIn, eBay, Powerset, Quantcast and others began picking up Hadoop and innovating in areas beyond the initial focus, it reinforced the fact that the choice of community driven open source was the right one.

A case in point being when a small startup (Powerset) started working on a project to support tables on HDFS inspired by Google’s BigTable paper; that effort turned into what’s now Apache HBase! Need more? Facebook started an effort to build a SQL layer on top of MapReduce, which became Apache Hive!

Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs. 

Like anything done in a big group, at times it can be a challenge, but it has proven time and again when it comes to platform technologies like Hadoop that community-driven open source will always outpace the innovation of a single group of people or single company.

Apache Hadoop usage at Yahoo! has grown to the point that today Hadoop is a foundational technology underlying a wide range of business-critical applications.  This is captured really well by Sumeet Singh, a Director of Product Management at Yahoo!, who recently outlined just how far their journey has come.

And as the team tasked with architecting and operating that infrastructure over many of those years, our Hortonworks engineers gained critical insights that have been diligently funneled back into the community to be addressed in the appropriate place: the open source projects at the Apache Software Foundation.  That process gave rise to a host of new projects that are now core to Hadoop (such as Apache Hadoop YARN, Apache HCatalog, Apache Ambari to go along with Apache Pig, Apache Hive, Apache HBase and many others).

What are we doing now?

After many years architecting and operating the Hadoop infrastructure at Yahoo! and contributing heavily to the open source community, E14 and 20+ Hadoop architects and engineers spun out of Yahoo! to form Hortonworks in 2011.  Having seen what it could do for Yahoo, Facebook, eBay, LinkedIn and others, our singular objective is to focus on making Apache Hadoop into a platform that is easy to use and consume by the broader market of enterprise customers and partners.

And in doing so we maintain that same unwavering view as to how to approach the challenge:

  • identify and articulate the enterprise requirements within the community,
  • take an active role in addressing those requirements within the community, and
  • apply enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

To help us determine where to focus efforts, we spend a lot of time working with Hadoop users to understand the requirements for broader enterprise adoption, examples of which fall into the following categories:

  • Core Apache Hadoop
    HOR8612_Diag2013_FIN_TextEnsuring the core Apache Hadoop platform moves forward is a critical area of focus. All of the work happening on Apache Hadoop 2.0, including YARN, is aimed at ensuring Hadoop can continue to scale to meet the largest data processing needs as well as efficiently run a mix of workloads that serve batch, interactive, and online application needs. We are also working with others on some interesting incubating technologies in the community aimed at improving the latency and throughput characteristics of Hadoop workloads, so stay tuned!
  • Platform Services
    Addressing business continuity needs such as high availability, data mirroring, replication, and snapshots are critical to the mainstream enterprise.  We continue to invest aggressively in these areas across BOTH the stable Apache Hadoop 1.x line and the emerging Apache Hadoop 2.0 line. And we are also working with others on some interesting incubating technologies aimed at ensuring consistent and secure access to Hadoop services in order to address the security needs of enterprises that are critical to the enterprise, so we’ll have more to say there soon too!
  • Data Services
    Enabling Hadoop to exchange data from or to other systems is important as is improving the performance and simplifying data access for end users of the data.  Apache HCatalog is an incubator project we sponsored in 2011 that is increasingly at the heart of solution architectures that require consistent table access to Hadoop data. Our focus has recently turned towards the need for “more SQL and better performance” for the large community of Apache Hive users. Over the coming weeks, I encourage you to take a look at the work happening in the Hive community to see how those needs are being addressed. Exciting work!
  • Operational Services
    We feel strongly that easy management and monitoring of Hadoop clusters should not be a commercial holdback: it is a core requirement of any Hadoop implementation and should be delivered in the open.  Apache Ambari was established about a year ago to enable operators to manage Hadoop clusters with familiar and easy to use tools. Ambari is as much an operational fabric with complete REST APIs as it is a tool for managing Hadoop clusters. If you need to integrate Ambari with your own “pane of glass”, then you can do so. If you want a modern user interface to simplify Hadoop management, then Ambari has that as well.

Applying Enterprise Rigor to Open Source

Today, eight years into its development, there are numerous open source projects that augment core Hadoop to address these critical operational, data and platform requirements.  Hortonworks Data Platform (HDP) packages up a dozen or so distinct open source projects into a single integrated distribution that provides the enterprise services businesses can rely on.  Not only do Hortonworkers play key roles in the test and release process for each of those various projects, but we also take great pains to test and certify a consolidated distribution on large and complex clusters running across a range of operating platforms.

In fact, before we release any version of HDP, we first work with our colleagues at Yahoo! to test it at scale on their infrastructure – every time.  This means that by the time HDP sees any customer environment it has been validated at Yahoo!, which has arguably the richest test suite for Hadoop on the planet. Case in point – with help from Yahoo, YARN has been significantly battle-tested – to the tune of nearly 14 million applications and 80,000 jobs per day per cluster.

Good for the ecosystem

Our mission when we started Hortonworks was to accelerate the adoption of Hadoop by providing a 100% open source, enterprise grade distribution in order to provide a truly open platform. The key reason partners such as Microsoft and Teradata choose Hortonworks as their strategic partner for Hadoop is this: our engineers are committed to working within the 100% open source Apache Software Foundation projects with no commercial holdbacks.  This is really in contrast to other vendors who are taking a proprietary approach that can lead to closed interfaces and vendor lock-in.

And we ensure that the work we do with our partners makes it back into the community.  For instance, our work on the Apache HCatalog project has been adopted and extended by Teradata with their SQL-H offering.  And we have worked extensively with Microsoft to enable Hadoop to run on Windows, and contributed this work back to the broad community so that others can pick up and continue the work in ways that benefit everyone. Even better, it is really great to see partners like Microsoft contribute significantly to the open-source project to ensure Apache Hadoop is fully supported on key platforms like Microsoft Azure – another illustration of the rising tide that is the open-source model.

Good for Hortonworks

We are pretty passionate about the journey we are on.  By staying true to our 100% open source philosophy and applying Enterprise software rigor to the test and release process, we believe that we can accelerate the adoption of Hadoop in the ecosystem.

We love what we are doing, are committed to the approach, and can’t wait to see what the next chapter brings.

The Road Ahead for Hortonworks and Hadoop

I recently delivered a webinar entitled “Hortonworks State of the Union”. For those new to Apache Hadoop, I covered a brief history of Hadoop and Hortonworks’ role within the open source community. We also covered how the platform services, data services, and operational services required to enable Hadoop as an enterprise-viable platform evolved in 2012.

Finally, we discussed the important progress made on deeply integrating Hadoop within next-generation data architectures in a way that makes sense for the enterprise. Our partnership with Teradata provides a great example of how deep integration of BOTH the data services (via Apache HCatalog) AND the operational services (via Apache Ambari’s REST APIs) can deliver value in a way that addresses mainstream enterprise needs while preserving existing investments.

What’s next?

If 2012 was a big year for Hadoop and big data, then 2013 should be HUGE.

As we enter 2013, I believe Hadoop has “crossed the chasm” from a framework for early adopters and technology enthusiasts to a strategic data platform embraced by early majority and pragmatic adopters. CTOs and CIOs across mainstream enterprises want to improve the performance of their companies and unlock new business opportunities, and they realize that including Hadoop as a deeply integrated “plus 1” to their data architectures provides them the fastest path to their goals while maximizing their existing investments.

The other side of the chasm is where vertical solutions (or “bowling pins” as Geoffrey Moore refers to them in his book) emerge in earnest. While we, Hortonworks, are interested in serving the needs of these vertical solutions, as an open source software infrastructure company we are keenly interested in identifying and enabling horizontal patterns of use that unlock Hadoop’s value for the widest range of use cases.

Refine, Explore, Enrich

This graphic illustrates the Refine, Explore, and Enrich patterns of use that we have seen emerge in the market:

  • Refine is about capturing all sorts of data sources into a platform where that data can then be refined into formats that are more easily shared with downstream systems such as a Data Warehouse.
  • Explore is about interactively surfing through these new lakes of data and unlocking opportunities for business value through the use of new and existing Business Intelligence (BI) tools.
  • Enrich is about creating and deploying advanced analytics in a way that makes online applications, such as mobile commerce applications, more “intelligent” with respect to the experience delivered.

The key point to reiterate is that Hadoop is an important “plus 1” in next-generation data architectures powering these use cases.

So What’s in Store for 2013?

Our focus from 2012 continues into 2013: a) make Hadoop an enterprise-viable platform that’s easy to use and consume by the enterprise while b) ensuring the platform is interoperable with the broader data ecosystem. With that said, I outlined a range of initiatives that we, Hortonworks, will be focused on in our efforts within the open source community: Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase). We will be working in other areas, of course, but these are the key focus areas that our enterprise customers are interested in.

Since the topic of Interactive Query is fairly popular these days, let me share some quick thoughts. Over the past few years, Apache Hive has matured into the de-facto SQL interface to Hadoop data. Many of the top BI vendors support Hive today, and based on our customer interactions, more than 50% of Hadoop use cases depend on Hive for operational data processing and BI use cases. That said, Hive needs work to support human interactive BI use cases such as visualization and parameterized reporting.

Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.

Over the coming weeks, we will roll out webinars and blog posts that cover each of our initiatives in more detail. Also, we expect to demonstrate some of the fruits of the labor at the Hadoop Summit in Amsterdam in March.

2013 should prove to be a fun and productive year!

Apache Ambari: Hadoop Operations, Innovation, and Enterprise Readiness

Over the course of 2012, through Hortonworks’ leadership within the Apache Ambari community we have seen the rapid creation of an enterprise-class management platform required for enabling Apache Hadoop to be an enterprise viable data platform.  Hortonworks engineers and the broader Ambari community have been working hard on their latest release, and we’d like to highlight the exciting progress that’s been made to Ambari, a 100% open and free solution that delivers the features required from an enterprise-class management platform for Apache Hadoop.

Why is the open source Ambari management platform important?

For Apache Hadoop to be an enterprise viable platform it not only needs the Data Services that sit atop core Hadoop (such as Pig, Hive, and HBase), but it also needs the Management Platform to be developed in an open and free manner. Ambari is a key operational component within the Hortonworks Data Platform (HDP), which helps make Hadoop deployments for our customers and partners easier and more manageable.

Stability and ease of management are two key requirements for enterprise adoption of Hadoop and Ambari delivers on both of these. Moreover, the rate at which this project is innovating is very exciting.  In under a year, the community has accomplished what has taken years to complete for other solutions. As expected the “ship early and often” philosophy demonstrates innovation and helps encourage a vibrant and widespread following.

Recent and exciting enhancements to Apache Ambari include:

  • Simplified cluster provisioning with a step-by-step install wizard
  • Pre-configured key operational metrics for instant insight into the health of Hadoop Core (Hadoop Distributed File System and MapReduce) and related projects such as HBase, Hive and HCatalog
  • Visualization and analysis of job and task execution to gain a better view into dependencies and performance
  • A complete RESTful API for exposing monitoring information and integrating with existing operational tools
  • An intuitive user interface that makes viewing information and controlling a cluster easy and productive

Hortonworks Data Platform is all about enterprise-ready Hadoop and Ambari is a key project included in our distribution. Our focus as an organization is to innovate throughout all of the Hadoop-related projects and then package the most stable and enterprise ready components into HDP, and Ambari is an important component for users of HDP that are betting their business on Hadoop.

The Ambari project is a perfect example of what is important to us. First and foremost, we are focused on a 100% open source development and delivery model for HDP and second, we are dedicated to making sure Hadoop is reliable and can be trusted by the enterprise and our ecosystem of partners.

We are committed to the mission that HDP is the MOST stable, reliable and enterprise-ready Apache Hadoop distribution available. And that is why we invest in community-driven and enterprise-focused projects such as Ambari.

 

To learn more about Apache Ambari and the latest project updates or to download the source code, visit the Apache Ambari home page. http://incubator.apache.org/ambari/

Enabling Big Data Insight for Millions of Windows Developers

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

Since one of the world’s most widely used business intelligence tools is Microsoft Excel, and since Microsoft is arguably one of the best companies at enabling and mobilizing large and vibrant developer communities, needless to say we at Hortonworks are excited and bullish on the expansion of our partnership with Microsoft.

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

This news represents a significant inflection point for the big data market in general and for the importance of open source Apache Hadoop in particular. Unlocking the Windows Server and Windows Azure markets for Hadoop means more businesses will be able to tap into its benefits.

Moreover, these new offerings represent months of joint engineering work across both the Microsoft and Hortonworks engineering and product teams. Microsoft’s commitment to doing this work in a way that improves open source Apache Hadoop and related Apache projects has been unwavering; which translates into goodness for the open source community.

I encourage you to try out the fruits of our labors in one of two ways:

• Download Microsoft HDInsight Server and play with Hadoop on your own Windows machine.
• Access Windows Azure HDInsight Service and play with Hadoop in the cloud.

I encourage you to go to http://hortonworks.com/partners/microsoft/ in order to learn more and get started!

Finally, check out Microsoft’s announcement for more information! http://blogs.technet.com/b/dataplatforminsider/archive/2012/10/22/simplifying-big-data-for-the-enterprise.aspx

Balancing Community Innovation and Enterprise Stability

Having worked at JBoss and Red Hat from 2004 to 2008 and SpringSource and VMware from 2008 to 2011, I’ve been focused on the world of open source software for a long while. I’ve been blessed to be able to serve enterprise customer needs with high quality open source software such as JBoss Application Server, Hibernate, Drools, Apache Web Server, Apache Tomcat, Spring … and now Apache Hadoop.

As specific open source technologies mature and their use becomes mainstream, it becomes increasingly important to understand and communicate the balancing act that needs to happen between community innovation and enterprise stability.

Community innovation needs to have a fast pace, where “ship early and often” is a key tenet.  Open source projects need to visibly improve and keep innovating if they are to attract a vibrant following. As the open source project’s community grows, they will expect big improvements and will be fine with early, buggy releases, etc. After all, that’s part of the process

Read More

Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of all your data are belong to us. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled 7 Key Drivers for the Big Data Market, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. This new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

Read More

7 Key Drivers for the Big Data Market

I attended the Goldman Sachs Cloud Conference and participated on a panel focused on “Data: The New Competitive Advantage”. The panel covered a wide range of questions, but kicked off covering two basic questions:

“What is Big Data?” and “What are the drivers behind the Big Data market?”

While most definitions of Big Data focus on the new forms of unstructured data flowing through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to answer the question using a simple equation:

Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:

Read More

Solving the Data Problem in a Big Way

I recently joined Hortonworks as VP of Corporate Strategy, and I wanted to share my thoughts as to what attracted me to Hortonworks.

For me, it’s important to 1) work with a top-notch team and 2) focus on unique market-changing business opportunities.

Hortonworks has a strong team of technical founders (Eric14, Alan, Arun, Deveraj, Mahadev, Owen, Sanjay, and Suresh) doing impressive work within the Apache Hadoop community. Hortonworks also has an impressive Board of Directors that includes folks like Peter Fenton, Mike Volpi, Jay Rossiter, Rob Bearden, as well as our most recent board member Paul Cormier (Red Hat’s President of Products and Technology).

Read More