Spotlight on the early history of Hadoop

Today, Hadoop and big data are known by just about every IT professional and business executive, but the open source analytics tool was not always such a high-profile project. Stemming from test programs aimed at improving Yahoo's search engine, the earliest iterations of what would become Hadoop arose around 10 years ago. A recent GigaOM feature celebrated the anniversary by looking back on the platform's early days.

The original project that would become Hadoop was a web indexing software called Nutch, GigaOM explained. Following the publication of Google's File System and MapReduce papers, in 2003 and 2004 respectively, Nutch's developers realized they could port it on top of a processing framework that relied on tens of computers rather than a single machine. With the end goal of rebuilding its web search infrastructure, Yahoo used Nutch's storage and processing ideas to form the backbone of Hadoop.

At the time, however, Hadoop was not ready to handle the kind of workloads required for enabling Yahoo's search functions. Eric Baldeschwieler, co-founder and CEO of Hortonworks, who was VP of Hadoop software development at Yahoo, told GigaOM that Hadoop only worked on five to 20 nodes in its earliest implementations. According to former Yahoo CTO Raymie Stata, building Hadoop's scalability was a lengthy process.

"It was just an ongoing slog … every factor of 2 or 1.5 even was serious engineering work," he told GigaOM.

Reaching an effective multi-node implementation
Yahoo's decision to set up a "research grid" for its data scientists helped the research team gradually scale up Hadoop clusters from dozens to hundreds of nodes. By 2008, Yahoo was ready to debut Hadoop as the engine of its web search. Using a Hadoop cluster with around 10,000 nodes, the company was able to increase its search speeds. More importantly though, the distribution of processing power added new reliability that had not been present before.

"When it's running perfectly, the old system does outperform the new one," Baldeschwieler explained at the time. "But of course hardware fails and there are all sorts of scenarios under which the old system doesn't perform perfectly. Hadoop gives us much more flexibility. It's built around the idea of running commodity hardware that runs all the time."

In 2011, when much of the original research team formed Hortonworks, Yahoo was running its search engine across 42,000 nodes. With many more players involved in the open source project than in its early days, Hadoop continues to evolve and branch off in new directions. However, building on the initial architecture innovations of its early years, its core promises of speed and reliability for handling enormous data sets remain the same.

Categorized by :

Leave a Reply

Your email address will not be published. Required fields are marked *

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.