Posts by Eric Baldeschwieler:


Thinking about the HDFS vs. Other Storage Technologies

As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS.  Here is why…

To compare HDFS to other technologies one must first ask the question, what is HDFS good at:

  • Extreme low cost per byte
    HDFS uses commodity direct attached storage and shares the cost of the network & computers it runs on with the MapReduce / compute layers of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs. This cost advantage lets organizations store and process orders of magnitude more data per dollar than tradition SAN or NAS systems, which is the price point of many of these other systems.  In big data deployments, the cost of storage often determines the viability of the system.
  • Very high bandwidth to support MapReduce workloads
    HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per computer into the map-reduce layer, on a very low cost shared network. Hadoop can go much faster on higher speed networks, but 10gigE, IB, SAN and other high-end technologies double the cost of a deployed cluster. These technologies are optional for HDFS.  2+ gigabits per second per computer may not sound like a lot, but this means that today’s large Hadoop clusters can easily read/write more than a terabyte of data per second continuously to the MapReduce layer.
  • Rock solid data reliability
    When deploying large distributed systems like Hadoop, the laws of probability are not on your side. Things will break every day, often in new and creative ways.  Devices will fail and data will be lost or subtly mutated. The design of HDFS is focused on taming this beast. It was designed from the ground up to correctly store and deliver data while under constant assault from the gremlins that huge scale out unleashes in your data center. And it does this in software, again at low cost. Smart design is the easy part; the difficult part is hardening a system in real use cases.  The only way you can prove a system is reliable is to run it for years against a variety of production applications at full scale.  Hadoop has been proven in thousands of different use cases and cluster sizes, from startups to Internet giants and governments.

How does the HDFS competition stack up?  
This is an article about Hadoop, so I’m not going to call out the other systems by name, but I assert that all of the systems listed in the “8 ways” article don’t compare well to Hadoop in one of the above dimensions. Let me list some of the failure modes:

  • System not designed for Hadoop’s scale
    Many systems simply don’t work at Hadoop scale. They haven’t been designed or proven to work with very large data or many commodity nodes. They often will not scale up to petabytes of data or thousands of nodes. If you have a small use-case and value other attributes, such as integration with existing apps in your enterprise, maybe this is a good trade-off, but something that works well in a 10 node test system may fail utterly as your system scales up. Other systems don’t scale operationally or rely on non-scalable hardware. Traditional NAS storage is a simple example of this problem. A NAS can replace Hadoop in a small cluster. But as the cluster scales up, cost and bandwidth issues come to the fore.
  • System that don’t use commodity hardware or open source software
    Many proprietary software / non-commodity hardware solutions are well tested and great at what they were designed to do. But, these solutions cost more than free software on commodity hardware. For small projects, this may be ok, but most activities have a finite budget and a system that allows much more data to be stored and used at the same cost often becomes the obvious choice. The disruptive cost advantage of Hadoop & HDFS is fundamental to the current success and growing popularity of the platform. Many Hadoop competitors simply don’t offer the same cost advantage.  Vendor price lists speak for themselves in this area (where the prices are even published).
  • Not designed for MapReduce’s I/O patterns
    Many of these systems are not designed from the ground up for Hadoop’s big sequential scans & writes.  Sometimes the limitation is in hardware. Sometimes it is in software. Systems that don’t organize their data for large reads cannot keep up with MapReduce’s data rates. Many databases and NoSql stores are simply not optimized for pumping data into MapReduce.
  • Unproven technology
    Hadoop is interesting because it is used in production at extreme scale in the most demanding big data use cases in the world. As a result thousands of issues have been identified and fixed. This represents several hundred person-centuries of software development investment. It is easy to design a novel alternative system. A paper, a prototype or even a history of success in a related domain or a small set of use cases does not prove that a system is ready to take on Hadoop. Tellingly, along with listing some new and interesting systems, the “8 ways” article says goodbye to some systems that have previously been considered HDFS contenders by vocal advocates. I’ve got a rolodex full of folks who used to work on such systems who are now major players in the Apache Hadoop community.

It is easy to find example use cases where some other storage system is a better choice than Hadoop. But I assert that HDFS is the best system available today to do exactly what it was built for, being Hadoop’s storage system. It delivers rock solid data reliability and very high sequential read/write bandwidth, at the lowest cost possible. As a result, HDFS is, and I predict it will remain THE storage infrastructure for the vast majority of Hadoop clusters.

~E14

Happy Birthday Hortonworks!

Last week was an important milestone for Hortonworks: our one year anniversary. Given all of the activity around Apache Hadoop and Hortonworks, it’s hard to believe it’s only been one year. In honor of our birthday, I thought I would look back to contrast our original intentions with what we delivered over the past year.

Hortonworks was officially announced at Hadoop Summit 2011. At that time, I published a blog on the Hortonworks Manifesto. This blog told our story, including where we came from, what motivated the original founders and what our plans were for the company. I wanted to address many of the important statements from this blog here:

Hortonworks was formed to “accelerate the development and adoption of Apache Hadoop”. I returned to this point often throughout the manifesto. We committed to working with the community to accelerate the development and adoption of Apache Hadoop and we absolutely delivered on this promise. Over the past year, Apache Hadoop released Hadoop-1.0, the most stable line of Apache Hadoop ever. Hadoop-2.0, including the next generations architectures for both MapReduce and HDFS, was also released in alpha form. Apache Hadoop continues to gain momentum as proven by every important metric (downloads, web traffic, press & analyst coverage, conference and Meetup attendance, etc.). It was a banner year for Apache Hadoop and we are proud to have played an important role in making it happen.

We are “committed to open source” and commit that “all core code will remain open source”. This commitment is as solid today as it was a year ago. All code developed by Hortonworks has been contributed back to open source. In addition to our significant contributions to core Hadoop projects (MapReduce and HDFS), we have also made significant contributions to other Hadoop ecosystem projects including Ambari, HCatalog, Pig and ZooKeeper. We will continue to be a leader in the Hadoop community process and will continue to contribute all of our Hadoop development efforts back into the Apache community development process.

We will “make Apache Hadoop easier to install, manage and use”. This was a key focus for Hortonworks over the past year. We quickly learned that it would be beneficial to the market to offer a Hortonworks distribution of Apache Hadoop that delivered on this promise. Hortonworks Data Platform, which we recently made available to the entire ecosystem, addresses each of these areas. We have included an installer that greatly simplifies the installation process for Apache Hadoop. We included, for the first time, Apache Ambari, which allows organizations to manage and monitor their Hadoop clusters. We also tightly integrated Hortonworks Data Platform with Talend Open Studio for Big Data, which provides a visual design environment for connecting Hadoop with hundreds of enterprise data systems in order to make Hadoop easier to use. The result is a greatly simplified process for organizations that are looking for a pure Apache Hadoop distribution.

We will “make Apache Hadoop more robust”. Again, I’m pleased that we delivered on this promise. We were instrumental in the re-architectures of MapReduce and HDFS to address the enterprise needs of each of these core components. Our team has written a number of blogs and presentations on these topics that I strongly recommend you read if you haven’t already. Among the most significant are the following: NextGen MapReduce presentation, NextGen MapReduce Hits Mainline, Delivering on Hadoop .NEXT, Benchmarking Performance, Apache Hadoop 2.0 (Alpha) Released, Data Integrity and Availability in Apache Hadoop HDFS, An Introduction to HDFS Federation, NameNode HA Reaches an Important Milestone, Snapshots for HDFS and High Availability and Hadoop 1.0 – Perfect Together . The last post covers the ability to add new HA capabilities to the stable and proven Hadoop-1.0 line.

We will “make Apache Hadoop easier to integrate and extend”. We have made some important advancements in this area that may have gone unnoticed. Much of this work is related to HCatalog, an Apache project that provides a metadata and table management system for Hadoop. We feel strongly that HCatalog is the preferred path for simplifying data sharing between Hadoop and other enterprise data systems and have invested heavily into advancing this project and related APIs for HCatalog. By tightly integrating Talend Open Studio for Big Data, we have also made it much easier for a broader audience to integrate Hadoop with hundreds of existing data systems. We have also formed important partnerships with leaders such as Microsoft and Teradata to ensure that their platforms and applications are tightly integrated and optimized to work with Apache Hadoop.

We will “deliver an ever-increasing array of services aimed at improving the Hadoop experience and support in the growing needs of enterprises, systems integrators and technology vendors”. Over the past year, we have made available Hortonworks University, an exceptional Hadoop training program for developers, administrators and analysts; and Hortonworks Services, which leverages the deep domain experience of the Hortonworks technical staff to provide technical support to enterprises, systems integrators and technology vendors. Our training courses, in particularly, have been very well received by students who have continually praised our hands-on lab exercises as the best in the industry. We have recently expanded our training schedule, so check it out.
There were certainly many other notable achievements over the past year including

  • The Hortonworks team grew significantly and now numbers around 90 people. We are hiring too!
  • We established partnerships with major enterprise software vendors including Microsoft and Teradata that are changing the way Hadoop will be consumed.
  • We hosted the 5th annual Hadoop Summit with great success and rave reviews and over 2250 attendees.

As you can see, we are very proud of our accomplishments in our first year. We were also glad to be recognized by Forrester as a leader in the Forrester Wave on Enterprise Hadoop Solutions. Really, how often do companies get recognized as leaders by Forrester in their very first year of existence?

While this blog took a look back at last year, stay tuned for another blog that looks forward to what we have planned for year two.

~ E14

 

High Availability and Hadoop 1.0 – Perfect Together

In Shaun Connolly’s post about balancing community innovation and enterprise stability, he discussed the importance of an open source project forging ahead with big improvements that are expected to be initially buggy and incomplete functionally but then stabilize over time. In the case of Apache Hadoop 2.0, currently in community Alpha release, the big improvements have been underway for the past 3 years and include such things as:

  1. Next-gen MapReduce (aka YARN) that opens up Hadoop’s job processing architecture to other application workloads beyond MapReduce,
  2. New HDFS pipe-line to support append and flush,
  3. HDFS federation and performance improvements that enable Hadoop to scale to 1000’s more nodes in a cluster, and
  4. High availability improvements that address some of the single point of failure issues that are often used as examples of how Hadoop may not be as enterprise-ready as it could be.

In the case of high availability (HA), it can take many months or years to get these types of solutions rock solid. While Hadoop 2.0 contains important HA-related features such as HDFS hot standby, we want to make sure we give it time to complete its community release process and allow extra time after that for bugs to be found and fixed to harden it for broad enterprise production use.

Read More

Hortonworks Welcomes Citrix and CloudStack to the Apache Community

We are pleased to support today’s announcement from Citrix that they have contributed CloudStack to the Apache community. For those new to CloudStack, it is an open source cloud computing software that helps organizations build and manage cloud infrastructures. It is similar to Amazon Web Services EC2 environment except that it enables organizations to build public, private or hybrid cloud environments using their own pooled computing resources.

Citrix announced today that they were reaffirming their commitment to open source by working with the Apache Software Foundation to make CloudStack 3 an Apache project, released under Apache Software License 2.0. This is yet further acknowledgement that Apache is the logical home for open source projects that are transforming the enterprise software industry. As a Gold Sponsor of the ASF and major contributor to Apache projects, Hortonworks is pleased that leading vendors such as Citrix are recognizing the value that Apache can provide in terms of accelerating development and innovation and driving adoption as the preferred destination for enterprise-class open source software.

Read More

Announcing the Hadoop Summit Community Choice Winners

Thank you to the community members that cast over 8,000 votes during the Hadoop Summit Community Choice voting process. The turnout far exceeded our expectations and is further evidence that the momentum behind Apache Hadoop has never been stronger.

As we announced, the sessions with the most votes in each track are automatically accepted into the Hadoop Summit agenda. As such, I am pleased to announce the winners of the Hadoop Summit Community Choice vote and the first confirmed sessions in the Hadoop Summit program:

Future of Apache Hadoop track: Dynamic Namespace Partitioning with Giraffa File System, Konstantin Shvachko (eBay)

Deployment and Operations track: Dynamic Reconfiguration of Apache Zookeeper, Alexander Shraer and Benjamin Reed (Yahoo!)

Enterprise Data Architecture track: iMStor: Hadoop Storage-based Tiering Platform, Vishal Malik (Cognizant Technology Solutions)

Applications and Data Science track: Hadoop & Cloud @Netflix: Taming the Social Data Firehose, Mohammad Sabah (Netflix)

Analytics and Business Intelligence track: Mapping and Reducing Passenger Turbulence using Big Data, Farhan Hussain and Saad Patel (Open Source Architect)

Hadoop in Action track: The Merchant Lookup Service at Intuit, Vrushali Channapattan (Intuit)

Read More

Hadoop Summit Community Choice

As I first mentioned when we announced Hadoop Summit 2012, we are focused on making Hadoop Summit the preeminent conference for the Apache Hadoop community. Today I’m pleased to tell you about Community Choice, a public online voting system that enables the entire Apache Hadoop community to have a say in the sessions chosen for Hadoop Summit. Anybody can vote and the top vote getters in each track will automatically be included in the Hadoop Summit agenda.

One of the things you will notice when you vote is the large number of abstracts that were submitted for the conference. In fact, there were 267 submissions for Hadoop Summit, more than double the number of submissions from last year’s highly successful event. There are six tracks; each of which has a wide selection of compelling topics. Another interesting fact is that there were submissions from 120 different organizations (companies, universities and government agencies). It’s becoming even clearer that Apache Hadoop is having a significant impact in the data industry.

In addition to Community Choice, there is also a content selection committee in place that will identify the other sessions for Hadoop Summit. This is also a community effort. The content selection committee is made up of 36 leaders from the ecosystem representing 27 different organizations (vendors, end users and universities). The committee is hard at work reviewing sessions and we expect to be able to publish the final agenda before the end of March.

Please remember to vote in the Community Choice process. If you ever wanted to have input into a conference, this is your chance. Voting ends March 20th, so please vote today.

~E14

Open Source Data Integration for Apache Hadoop

Today we announced an important strategic partnership with Talend, provider of the world’s most popular open source data integration platform. This is another win for both Hortonworks customers and the larger Apache Hadoop community. There were two key aspects of the announcement that I wanted to highlight:

Talend releases Talend Open Studio for Big Data

Based upon Talend’s very popular open source data integration platform, Talend Open Studio for Big Data adds connectors for HDFS, HBase, Pig, Sqoop and Hive. It allows organizations to move data into and out of Hadoop much more easily. It also leverages the MapReduce architecture to generate native Hadoop code and run data transformations directly inside Hadoop, in a highly scalable fashion. Talend Open Studio for Big Data will also be released with Apache licensing, which is a good match for the Apache Hadoop community.

Read More

Extending Apache Hadoop to Millions of New Microsoft Users

Today we announced  that we were delivering on our earlier promise to help Microsoft bring Apache Hadoop to Windows. I’m pleased to share that Microsoft, with our collaboration and guidance, has now submitted a series of patches to Apache aimed at overcoming the challenges of running Apache Hadoop in Windows Server environments.

These patches, once vetted and approved by the community, will become part of the core Hadoop code base. They will also become available in the two major Apache Hadoop branches: hadoop-1.0 (the current stable branch, which is available as part of Hortonworks Data Platform v1.0) and hadoop-0.23 (the next generation of Apache Hadoop, which will be available as part of Hortonworks Data Platform v2.0).

Read More

Reaffirming our Commitment to 100% Pure Open Source

I’ve been surprised by a couple of recent articles highlighting our recent leadership change.  These articles imply that our business model may be changing. Let me be clear, WE ARE NOT CHANGING OUR BUSINESS MODEL. We are committed to providing training and support of a 100% open source distribution of Apache Hadoop and related projects.

What has changed?

Rob Bearden has agreed to take on the role of CEO. I am moving from CEO to the role of CTO.

Read More

Hortonworks Recognized as a Leader in Forrester Wave Report

I am pleased to report that Hortonworks has been named a leader in the recently released Forrester Wave report on Enterprise Hadoop Solutions. We scored well across all three rating areas: current offering, market presence and strategy.

We appreciate the recognition, particularly this sentence that highlighted our role in the marketplace: ”(Hortonworks) is the technology leader and ecosystem building for the entire Hadoop industry and has recently released its Hortonworks Data Platform, which incorporates purely open-source Apache Hadoop software.”

Being named a Leader in the Forrester Wave on Enterprise Hadoop Solutions is one of many achievements for Hortonworks over the past seven months (stay tuned for a blog on this topic). While we proud of our past, we are much more focused on our future. We know that we must continue to drive innovation and work with the community to deliver high-quality Apache Hadoop releases. It’s important that the Apache Hadoop core remains strong in order to avoid forking. A strong code base, rapid innovation and a vibrant ecosystem will ensure Apache Hadoop remains unified and well positioned to become the foundation for the next generation data platform. This has always been our focus and we appreciate Forrester’s recognition for this commitment.

~E14

Paul Cormier Joins Hortonworks Board of Directors

I am pleased to announce that Paul Cormier has joined the Hortonworks Board of Directors. Paul is currently President, Products and Technologies at Red Hat, where he leads the company’s engineering and products business units. Paul has an exceptional background in building enterprise-class open source software. He also has helped Red Hat achieve tremendous revenue growth by enabling a rich ecosystem of partners. We are pleased to add such a talented and experienced open source professional to our board. His insights and guidance will play an important role in helping Hortonworks achieve our stated objective of enabling Apache Hadoop to become the foundation for the next generation enterprise data platform.

Welcome Paul!

~E14

Hadoop Summit 2012 is Coming

Hi Folks,

I’m happy to report that Hadoop Summit will be back for it’s 5th year. This year, Hortonworks and Yahoo are jointly hosting the conference, which will take place on June 13th and 14th at the San Jose Convention Center.

This year’s event promises to be bigger and better than ever. We have extended the conference to a second day, added additional session tracks and expect to showcase even more compelling and useful presentations. You will be really impressed when you see what we have planned.

Read More

Delivering the Next Generation of Apache Hadoop

Today we announced our plans to release a public preview of the Hortonworks Data Platform (HDP) version 2. HDP v2 will leverage Apache Hadoop 0.23, which is the first major update to Hadoop in more than three years. Among other advancements, HDP v2 will include the NextGen MapReduce architecture, HDFS NameNode HA and HDFS Federation. It will also include the most up-to-date stable components including HCatalog, HBase, Hive and Pig; all fully integrated and tested at scale.

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available in the second half of 2012.

Read More

Shaun Connolly Joins Hortonworks

I’m pleased to announce that Shaun Connolly has joined our executive management team as VP of Corporate Strategy. Shaun is a veteran enterprise software and open source executive that comes to us from VMware and previously held positions at SpringSource and JBoss.

As VP of Corporate Strategy, Shaun will be responsible for helping us to achieve our business objectives by guiding corporate strategy and identifying new market opportunities for Apache Hadoop.  Shaun will also play a critical role in helping us position and grow the Hortonworks Data Platform (HDP) as a next-generation enterprise data management solution, helping organizations maximize the value from the wealth of data flowing throughout their enterprise.

Welcome aboard Shaun!

~E14

Good Times at ApacheCon 2011

I spent some time last week at ApacheCon NA 2011 in Vancouver, BC. It was a good experience and I enjoyed catching up with friends and colleagues involved in the Hadoop project and also meeting some of the executives of the Apache Software Foundation in person. It is clear that the Apache community is thriving and that interest in Hadoop remains very high.

Hortonworks is committed to supporting Apache and we are pleased to have been a gold sponsor of this event. I delivered the day two keynote at ApacheCon on the success of Apache Hadoop. To view my presentation please visit Slideshare.net.

~E14
@jeric14, @hortonworks 

Go to page:12