At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software. In the spirit of openness, we think it’s important to share our perspectives around the broader context of how Apache Hadoop and Hortonworks came to be, what we are doing now, and why we believe our unique focus is good for Apache Hadoop, the ecosystem of Hadoop users, and for Hortonworks as well.
How did we get here?
The core team here at Hortonworks started at Yahoo! where in 2005 Eric Baldeschwieler (aka “E14” and Hortonworks CTO) challenged Owen O’Malley (Hortonworks co-founder) and several others to solve a really hard problem: store and process the data on the internet in a simple, scalable and economically feasible way. They looked at traditional storage approaches but quickly realized they just weren’t going to work for the type of data (much of it unstructured) and the sheer quantity Yahoo! would have to deal with.
The team’s first reaction, as is the norm, was to lock themselves in a room and come up with a prototype of a closed, proprietary system. With fantastic vision and oversight from E14 and Raymie Stata (former CTO, Yahoo), however, the team turned to the open-source community and in particular the Apache Software Foundation. This also included growing a large development team that included Doug Cutting, Arun Murthy (Hortonworks co-founder) and others who began to work with the community on what became known as Apache Hadoop – specifically HDFS and MapReduce.
The team quickly realized that by contributing their efforts into a community of like-minded individuals, the technology would innovate far faster. At the same time, they’d enable other organizations to realize some of the same benefits that they were starting to see from their early efforts. When organizations such as Facebook, LinkedIn, eBay, Powerset, Quantcast and others began picking up Hadoop and innovating in areas beyond the initial focus, it reinforced the fact that the choice of community driven open source was the right one.
A case in point being when a small startup (Powerset) started working on a project to support tables on HDFS inspired by Google’s BigTable paper; that effort turned into what’s now Apache HBase! Need more? Facebook started an effort to build a SQL layer on top of MapReduce, which became Apache Hive!
Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.
Like anything done in a big group, at times it can be a challenge, but it has proven time and again when it comes to platform technologies like Hadoop that community-driven open source will always outpace the innovation of a single group of people or single company.
Apache Hadoop usage at Yahoo! has grown to the point that today Hadoop is a foundational technology underlying a wide range of business-critical applications. This is captured really well by Sumeet Singh, a Director of Product Management at Yahoo!, who recently outlined just how far their journey has come.
And as the team tasked with architecting and operating that infrastructure over many of those years, our Hortonworks engineers gained critical insights that have been diligently funneled back into the community to be addressed in the appropriate place: the open source projects at the Apache Software Foundation. That process gave rise to a host of new projects that are now core to Hadoop (such as Apache Hadoop YARN, Apache HCatalog, Apache Ambari to go along with Apache Pig, Apache Hive, Apache HBase and many others).
What are we doing now?
After many years architecting and operating the Hadoop infrastructure at Yahoo! and contributing heavily to the open source community, E14 and 20+ Hadoop architects and engineers spun out of Yahoo! to form Hortonworks in 2011. Having seen what it could do for Yahoo, Facebook, eBay, LinkedIn and others, our singular objective is to focus on making Apache Hadoop into a platform that is easy to use and consume by the broader market of enterprise customers and partners.
And in doing so we maintain that same unwavering view as to how to approach the challenge:
- identify and articulate the enterprise requirements within the community,
- take an active role in addressing those requirements within the community, and
- apply enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.
To help us determine where to focus efforts, we spend a lot of time working with Hadoop users to understand the requirements for broader enterprise adoption, examples of which fall into the following categories:
- Core Apache Hadoop
Ensuring the core Apache Hadoop platform moves forward is a critical area of focus. All of the work happening on Apache Hadoop 2.0, including YARN, is aimed at ensuring Hadoop can continue to scale to meet the largest data processing needs as well as efficiently run a mix of workloads that serve batch, interactive, and online application needs. We are also working with others on some interesting incubating technologies in the community aimed at improving the latency and throughput characteristics of Hadoop workloads, so stay tuned!
- Platform Services
Addressing business continuity needs such as high availability, data mirroring, replication, and snapshots are critical to the mainstream enterprise. We continue to invest aggressively in these areas across BOTH the stable Apache Hadoop 1.x line and the emerging Apache Hadoop 2.0 line. And we are also working with others on some interesting incubating technologies aimed at ensuring consistent and secure access to Hadoop services in order to address the security needs of enterprises that are critical to the enterprise, so we’ll have more to say there soon too!
- Data Services
Enabling Hadoop to exchange data from or to other systems is important as is improving the performance and simplifying data access for end users of the data. Apache HCatalog is an incubator project we sponsored in 2011 that is increasingly at the heart of solution architectures that require consistent table access to Hadoop data. Our focus has recently turned towards the need for “more SQL and better performance” for the large community of Apache Hive users. Over the coming weeks, I encourage you to take a look at the work happening in the Hive community to see how those needs are being addressed. Exciting work!
- Operational Services
We feel strongly that easy management and monitoring of Hadoop clusters should not be a commercial holdback: it is a core requirement of any Hadoop implementation and should be delivered in the open. Apache Ambari was established about a year ago to enable operators to manage Hadoop clusters with familiar and easy to use tools. Ambari is as much an operational fabric with complete REST APIs as it is a tool for managing Hadoop clusters. If you need to integrate Ambari with your own “pane of glass”, then you can do so. If you want a modern user interface to simplify Hadoop management, then Ambari has that as well.
Applying Enterprise Rigor to Open Source
Today, eight years into its development, there are numerous open source projects that augment core Hadoop to address these critical operational, data and platform requirements. Hortonworks Data Platform (HDP) packages up a dozen or so distinct open source projects into a single integrated distribution that provides the enterprise services businesses can rely on. Not only do Hortonworkers play key roles in the test and release process for each of those various projects, but we also take great pains to test and certify a consolidated distribution on large and complex clusters running across a range of operating platforms.
In fact, before we release any version of HDP, we first work with our colleagues at Yahoo! to test it at scale on their infrastructure – every time. This means that by the time HDP sees any customer environment it has been validated at Yahoo!, which has arguably the richest test suite for Hadoop on the planet. Case in point – with help from Yahoo, YARN has been significantly battle-tested – to the tune of nearly 14 million applications and 80,000 jobs per day per cluster.
Good for the ecosystem
Our mission when we started Hortonworks was to accelerate the adoption of Hadoop by providing a 100% open source, enterprise grade distribution in order to provide a truly open platform. The key reason partners such as Microsoft and Teradata choose Hortonworks as their strategic partner for Hadoop is this: our engineers are committed to working within the 100% open source Apache Software Foundation projects with no commercial holdbacks. This is really in contrast to other vendors who are taking a proprietary approach that can lead to closed interfaces and vendor lock-in.
And we ensure that the work we do with our partners makes it back into the community. For instance, our work on the Apache HCatalog project has been adopted and extended by Teradata with their SQL-H offering. And we have worked extensively with Microsoft to enable Hadoop to run on Windows, and contributed this work back to the broad community so that others can pick up and continue the work in ways that benefit everyone. Even better, it is really great to see partners like Microsoft contribute significantly to the open-source project to ensure Apache Hadoop is fully supported on key platforms like Microsoft Azure – another illustration of the rising tide that is the open-source model.
Good for Hortonworks
We are pretty passionate about the journey we are on. By staying true to our 100% open source philosophy and applying Enterprise software rigor to the test and release process, we believe that we can accelerate the adoption of Hadoop in the ecosystem.
We love what we are doing, are committed to the approach, and can’t wait to see what the next chapter brings.