Category Archives: Uncategorized


Securing Hadoop with Knox Gateway

 

Back in the day, in order to secure a Hadoop cluster all you needed was a firewall that restricted network access to only authorized users. This eventually evolved into a more robust security layer in Hadoop… a layer that could augment firewall access with strong authentication. Enter Kerberos.  Around 2008, Owen O’Malley and a team of committers led this first foray into security and today, Kerberos is still the primary way to secure a Hadoop cluster.

Fast-forward to today… Widespread adoption of Hadoop is upon us.  The enterprise has placed requirements on the platform to not only provide perimeter security, but to also integrate with all types of authentication mechanisms. Oh yeah, and all the while, be easy to manage and to integrate with the rest of the secured corporate infrastructure. Kerberos can still be a great provider of the core security technology but with all the touch-points that a user will have with Hadoop, something more is needed.

The time has come for Knox.

The only path to security in Hadoop is the community

Screen Shot 2013-02-19 at 6.16.28 AM

The Knox Gateway aims to provide perimeter security that will integrate easily into existing security infrastructure.  Delivering this key component of the Apache Hadoop ecosystem is a critical community project.  Security is not an afterthought.  It needs to be woven into the very fabric of Hadoop in order to be effective. Being a part of the community will allow Knox to accomplish just that.

Already the community has rallied around the project and the vote has been positive thus far.  Tomorrow we should see community approval of a new incubation project in the Apache Software Foundation for Knox, a security layer for the Hadoop ecosystem.  The initial mentor list contains resources from Hortonworks, Microsoft and NASA among others.

What comprises the Knox Gateway?

The Knox Gateway (“Gateway” or “Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The Gateway runs as a server (or cluster of servers) that serve one or more Hadoop clusters.  It has few key functions:

  • Provide perimeter security to make Hadoop security setup easier
  • Support authentication and token verification security scenarios
  • Deliver users a single cluster end-point that aggregates capabilities for data and jobs
  • Enable integration with enterprise and cloud identity management environments
  • Manage security across multiple clusters and multiple versions of Hadoop

Knox will be able to provide a security layer for multiple clusters and multiple versions of Hadoop simultaneously and will deliver a simple intuitive management interface.  Playing nice with others is always a security imperative, so Knox will integrate with the existing frameworks for Active Directory /LDAP and it will allow for extensions for custom authentication mechanisms.

Availability

The short term plan for the Knox team is to deliver a solid, working release in late March so that early adopters can begin to evaluate and provide valuable feedback.  This critical step will ensure that the gateway fits nicely into customers’ infrastructure and makes Hadoop easier to use… and more secure.

Announcing Apache Hadoop 2.0.3 Release and Roadmap

 

As the Release Manager for hadoop-2.x, I’m very pleased to announce the next major milestone for the Apache Hadoop community, the release of hadoop-2.0.3-alpha!

2.0 Enhancements in this Alpha Release

This release delivers significant major enhancements and stability over previous releases in hadoop-2.x series. Notably, it includes:

  • QJM for HDFS HA for NameNode (HDFS-3077) and related stability fixes to HDFS HA
  • Multi-resource scheduling (CPU and memory) for YARN (YARN-2, YARN-3 & friends)
  • YARN ResourceManager Restart (YARN-230)
  • Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see more details from folks at Yahoo! here)

Where is hadoop-2 and What is Left?

It is important to note that the this release is still considered alpha as there are a few items that still need to be addressed before we enter beta in the next couple months. Most importantly some of APIs, particularly the HDFS & YARN protobuf-based protocols aren’t fully-baked. Also note that there are some API changes from the previous hadoop-2.0.2-alpha release and that your applications will need to recompile against the new hadoop-2.0.3-alpha. Please see the Hadoop 2.0.3-alpha release notes for details.

We are converging fast on ironing out the API issues (both in HDFS & YARN/MapReduce) and, currently, plan to cut a hadoop-2.0.4-beta release in the next couple of months after this effort. It also helps to have a major presence like Yahoo! test out hadoop-2 HDFS HA over the course of the coming months as they’ve noted in their blog. To this end, the code base has also gone through significant churn and as with any alpha we expect to uncover some further issues as we endure this ongoing test.

There is still a lot of work ahead of us, but we believe that hadoop-2.0.4-beta will be a major step to then release a fully stable, supported hadoop-2 release, exciting times! Stay tuned!

Acknowledgements

As always, it’s a pleasure to work with everyone in the community – thank *you*, this goes to everyone who has contributed to this release. A special mention for Todd Lipcon for his contributions to QJM for HDFS HA and the Yahoo Hadoop team (Robert Evans, Thomas Graves, Daryn Sharp, Jason Lowe and everyone else) for their efforts in getting YARN to stability and large-scale deployments on their clusters.

Arun C. Murthy

We Believe… in community driven Enterprise Apache Hadoop

 

HadoopLogo

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software. In the spirit of openness, we think it’s important to share our perspectives around the broader context of how Apache Hadoop and Hortonworks came to be, what we are doing now, and why we believe our unique focus is good for Apache Hadoop, the ecosystem of Hadoop users, and for Hortonworks as well.

How did we get here? 

The core team here at Hortonworks started at Yahoo! where in 2005 Eric Baldeschwieler (aka “E14” and Hortonworks CTO) challenged Owen O’Malley (Hortonworks co-founder) and several others to solve a really hard problem: store and process the data on the internet in a simple, scalable and economically feasible way.  They looked at traditional storage approaches but quickly realized they just weren’t going to work for the type of data (much of it unstructured) and the sheer quantity Yahoo! would have to deal with.

The team’s first reaction, as is the norm, was to lock themselves in a room and come up with a prototype of a closed, proprietary system. With fantastic vision and oversight from E14 and Raymie Stata (former CTO, Yahoo), however, the team turned to the open-source community and in particular the Apache Software Foundation. This also included growing a large development team that included Doug Cutting, Arun Murthy (Hortonworks co-founder) and others who began to work with the community on what became known as Apache Hadoop – specifically HDFS and MapReduce.

The team quickly realized that by contributing their efforts into a community of like-minded individuals, the technology would innovate far faster.  At the same time, they’d enable other organizations to realize some of the same benefits that they were starting to see from their early efforts.  When organizations such as Facebook, LinkedIn, eBay, Powerset, Quantcast and others began picking up Hadoop and innovating in areas beyond the initial focus, it reinforced the fact that the choice of community driven open source was the right one.

A case in point being when a small startup (Powerset) started working on a project to support tables on HDFS inspired by Google’s BigTable paper; that effort turned into what’s now Apache HBase! Need more? Facebook started an effort to build a SQL layer on top of MapReduce, which became Apache Hive!

Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs. 

Like anything done in a big group, at times it can be a challenge, but it has proven time and again when it comes to platform technologies like Hadoop that community-driven open source will always outpace the innovation of a single group of people or single company.

Apache Hadoop usage at Yahoo! has grown to the point that today Hadoop is a foundational technology underlying a wide range of business-critical applications.  This is captured really well by Sumeet Singh, a Director of Product Management at Yahoo!, who recently outlined just how far their journey has come.

And as the team tasked with architecting and operating that infrastructure over many of those years, our Hortonworks engineers gained critical insights that have been diligently funneled back into the community to be addressed in the appropriate place: the open source projects at the Apache Software Foundation.  That process gave rise to a host of new projects that are now core to Hadoop (such as Apache Hadoop YARN, Apache HCatalog, Apache Ambari to go along with Apache Pig, Apache Hive, Apache HBase and many others).

What are we doing now?

After many years architecting and operating the Hadoop infrastructure at Yahoo! and contributing heavily to the open source community, E14 and 20+ Hadoop architects and engineers spun out of Yahoo! to form Hortonworks in 2011.  Having seen what it could do for Yahoo, Facebook, eBay, LinkedIn and others, our singular objective is to focus on making Apache Hadoop into a platform that is easy to use and consume by the broader market of enterprise customers and partners.

And in doing so we maintain that same unwavering view as to how to approach the challenge:

  • identify and articulate the enterprise requirements within the community,
  • take an active role in addressing those requirements within the community, and
  • apply enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

To help us determine where to focus efforts, we spend a lot of time working with Hadoop users to understand the requirements for broader enterprise adoption, examples of which fall into the following categories:

  • Core Apache Hadoop
    HOR8612_Diag2013_FIN_TextEnsuring the core Apache Hadoop platform moves forward is a critical area of focus. All of the work happening on Apache Hadoop 2.0, including YARN, is aimed at ensuring Hadoop can continue to scale to meet the largest data processing needs as well as efficiently run a mix of workloads that serve batch, interactive, and online application needs. We are also working with others on some interesting incubating technologies in the community aimed at improving the latency and throughput characteristics of Hadoop workloads, so stay tuned!
  • Platform Services
    Addressing business continuity needs such as high availability, data mirroring, replication, and snapshots are critical to the mainstream enterprise.  We continue to invest aggressively in these areas across BOTH the stable Apache Hadoop 1.x line and the emerging Apache Hadoop 2.0 line. And we are also working with others on some interesting incubating technologies aimed at ensuring consistent and secure access to Hadoop services in order to address the security needs of enterprises that are critical to the enterprise, so we’ll have more to say there soon too!
  • Data Services
    Enabling Hadoop to exchange data from or to other systems is important as is improving the performance and simplifying data access for end users of the data.  Apache HCatalog is an incubator project we sponsored in 2011 that is increasingly at the heart of solution architectures that require consistent table access to Hadoop data. Our focus has recently turned towards the need for “more SQL and better performance” for the large community of Apache Hive users. Over the coming weeks, I encourage you to take a look at the work happening in the Hive community to see how those needs are being addressed. Exciting work!
  • Operational Services
    We feel strongly that easy management and monitoring of Hadoop clusters should not be a commercial holdback: it is a core requirement of any Hadoop implementation and should be delivered in the open.  Apache Ambari was established about a year ago to enable operators to manage Hadoop clusters with familiar and easy to use tools. Ambari is as much an operational fabric with complete REST APIs as it is a tool for managing Hadoop clusters. If you need to integrate Ambari with your own “pane of glass”, then you can do so. If you want a modern user interface to simplify Hadoop management, then Ambari has that as well.

Applying Enterprise Rigor to Open Source

Today, eight years into its development, there are numerous open source projects that augment core Hadoop to address these critical operational, data and platform requirements.  Hortonworks Data Platform (HDP) packages up a dozen or so distinct open source projects into a single integrated distribution that provides the enterprise services businesses can rely on.  Not only do Hortonworkers play key roles in the test and release process for each of those various projects, but we also take great pains to test and certify a consolidated distribution on large and complex clusters running across a range of operating platforms.

In fact, before we release any version of HDP, we first work with our colleagues at Yahoo! to test it at scale on their infrastructure – every time.  This means that by the time HDP sees any customer environment it has been validated at Yahoo!, which has arguably the richest test suite for Hadoop on the planet. Case in point – with help from Yahoo, YARN has been significantly battle-tested – to the tune of nearly 14 million applications and 80,000 jobs per day per cluster.

Good for the ecosystem

Our mission when we started Hortonworks was to accelerate the adoption of Hadoop by providing a 100% open source, enterprise grade distribution in order to provide a truly open platform. The key reason partners such as Microsoft and Teradata choose Hortonworks as their strategic partner for Hadoop is this: our engineers are committed to working within the 100% open source Apache Software Foundation projects with no commercial holdbacks.  This is really in contrast to other vendors who are taking a proprietary approach that can lead to closed interfaces and vendor lock-in.

And we ensure that the work we do with our partners makes it back into the community.  For instance, our work on the Apache HCatalog project has been adopted and extended by Teradata with their SQL-H offering.  And we have worked extensively with Microsoft to enable Hadoop to run on Windows, and contributed this work back to the broad community so that others can pick up and continue the work in ways that benefit everyone. Even better, it is really great to see partners like Microsoft contribute significantly to the open-source project to ensure Apache Hadoop is fully supported on key platforms like Microsoft Azure – another illustration of the rising tide that is the open-source model.

Good for Hortonworks

We are pretty passionate about the journey we are on.  By staying true to our 100% open source philosophy and applying Enterprise software rigor to the test and release process, we believe that we can accelerate the adoption of Hadoop in the ecosystem.

We love what we are doing, are committed to the approach, and can’t wait to see what the next chapter brings.

Hadoop Summit Europe 2013 Reveals Strong Ecosystem Support

Hadoop Summit Europe 2013, the European extension of the original and world’s largest Apache Hadoop community conference, today announced its official program, featuring a keynote address from 451 Group Analyst and Research Manager for Data Management and Analytics Matt Aslett and 40 use cases and educational sessions from leading industry and community experts. In addition, Hadoop Summit Europe 2013 boasts an impressive list of Platinum, Gold and Silver sponsors, demonstrating ecosystem support for Apache Hadoop from leading producers of software and services for the enterprise.

Hadoop Summit Europe will be the first and largest European conference focused exclusively on accelerating the enterprise adoption of Apache Hadoop, held at the historic Beurs van Berlage in Amsterdam on March 20-21, 2013. The event features sponsors ranging from traditional software companies to open source analytics vendors, confirming strong European interest in Hadoop.

Registration for Hadoop Summit Europe 2013 remains open, however, the conference is filling up fast. Don’t miss the opportunity to attend, register here: http://hadoopsummit.org/amsterdam/register/. More information on the show’s program can be found at: http://hadoopsummit.org/amsterdam/schedule/. For press and analysts, please contact Kim Rose, Director of Corporate Marketing at krose@hortonworks.com.

Hortonworks Joins OpenStack Foundation

By contributing to the OpenStack ecosystem, Hortonworks is supporting the open source community and facilitating adoption of 100-percent open source Apache Hadoop-based solutions in the cloud.  Now customers will be able to access an enterprise-ready Hortonworks Data Platform built for the cloud that alleviates the time and complexities of manually deploying a big data solution.

The Road Ahead for Hortonworks and Hadoop

I recently delivered a webinar entitled “Hortonworks State of the Union”. For those new to Apache Hadoop, I covered a brief history of Hadoop and Hortonworks’ role within the open source community. We also covered how the platform services, data services, and operational services required to enable Hadoop as an enterprise-viable platform evolved in 2012.

Finally, we discussed the important progress made on deeply integrating Hadoop within next-generation data architectures in a way that makes sense for the enterprise. Our partnership with Teradata provides a great example of how deep integration of BOTH the data services (via Apache HCatalog) AND the operational services (via Apache Ambari’s REST APIs) can deliver value in a way that addresses mainstream enterprise needs while preserving existing investments.

What’s next?

If 2012 was a big year for Hadoop and big data, then 2013 should be HUGE.

As we enter 2013, I believe Hadoop has “crossed the chasm” from a framework for early adopters and technology enthusiasts to a strategic data platform embraced by early majority and pragmatic adopters. CTOs and CIOs across mainstream enterprises want to improve the performance of their companies and unlock new business opportunities, and they realize that including Hadoop as a deeply integrated “plus 1” to their data architectures provides them the fastest path to their goals while maximizing their existing investments.

The other side of the chasm is where vertical solutions (or “bowling pins” as Geoffrey Moore refers to them in his book) emerge in earnest. While we, Hortonworks, are interested in serving the needs of these vertical solutions, as an open source software infrastructure company we are keenly interested in identifying and enabling horizontal patterns of use that unlock Hadoop’s value for the widest range of use cases.

Refine, Explore, Enrich

This graphic illustrates the Refine, Explore, and Enrich patterns of use that we have seen emerge in the market:

  • Refine is about capturing all sorts of data sources into a platform where that data can then be refined into formats that are more easily shared with downstream systems such as a Data Warehouse.
  • Explore is about interactively surfing through these new lakes of data and unlocking opportunities for business value through the use of new and existing Business Intelligence (BI) tools.
  • Enrich is about creating and deploying advanced analytics in a way that makes online applications, such as mobile commerce applications, more “intelligent” with respect to the experience delivered.

The key point to reiterate is that Hadoop is an important “plus 1” in next-generation data architectures powering these use cases.

So What’s in Store for 2013?

Our focus from 2012 continues into 2013: a) make Hadoop an enterprise-viable platform that’s easy to use and consume by the enterprise while b) ensuring the platform is interoperable with the broader data ecosystem. With that said, I outlined a range of initiatives that we, Hortonworks, will be focused on in our efforts within the open source community: Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase). We will be working in other areas, of course, but these are the key focus areas that our enterprise customers are interested in.

Since the topic of Interactive Query is fairly popular these days, let me share some quick thoughts. Over the past few years, Apache Hive has matured into the de-facto SQL interface to Hadoop data. Many of the top BI vendors support Hive today, and based on our customer interactions, more than 50% of Hadoop use cases depend on Hive for operational data processing and BI use cases. That said, Hive needs work to support human interactive BI use cases such as visualization and parameterized reporting.

Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.

Over the coming weeks, we will roll out webinars and blog posts that cover each of our initiatives in more detail. Also, we expect to demonstrate some of the fruits of the labor at the Hadoop Summit in Amsterdam in March.

2013 should prove to be a fun and productive year!

Don’t be Tardy for This Hadoop BINGO Party!

Happy New Year, everyone!

I’m excited to kick-off our first webinar series for 2013: The True Value of Apache Hadoop.

Get all your friends, co-workers together and be prepared to geek out to Hadoop!

This 4-part series will have a mixture of amazing guest speakers covering topics such as Hortonworks 2013 vision and roadmaps for Apache Hadoop and Big Data, What’s new with Hortonworks Data Platform v1.2, How Luminar (an Entravision company) adopted Apache Hadoop, and use case on Hadoop, R and GoogleVis. This series will provide organizations an opportunity to gain a better understanding of Apache Hadoop and Big Data landscape and practical guidance on how to leverage Hadoop as part of your Big Data strategy.

How is that a party?

We’re going to incorporate a game of BINGO! That’s right folks/potential attendees/registrants, a game of B-I-N-G-O for this webinar series.

Download your bingo card and join us! (instructions below in case you need them)

It’s easy as 1.2.3…4!

1) Pick a Hortonworks BINGO card and print one out: card 1 or card 2

2) When you hear a word that’s on your card, remember to mark it with a a THICK PEN (I have bad eyes)

3) To win, you must make a horizontal, diagonal OR vertical line to get a BINGO

4) Take a picture of your winning card and email it to Kim@hortonworks.com where you will be entered to win a $50 gift card to newegg.com.

Easy enough, right?

So don’t be tardy for this Hadoop party y’all! We’re starting the first one next Tuesday, January 22 @ 10am PST with Shaun Connolly, VP of Strategy at Hortonworks as he highlights key Hortonworks accomplishments from 2012, provides insight into upcoming initiatives and projects for 2013 and talks about  our contribution to Apache open source community.

Other featured webinars in this program include:

  • Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data Platform v1.2
  • Break Through the Traditional Advertisement Services with Big Data and Apache Hadoop
  • Process & Visualize Your Data with Revolution R, Hadoop and GoogleVis

Register now to reserve your seat. We would love to have you join us!

“State of the Union” Webinar Features Hortonworks Executive Delivering 2012 Year-in-Review, Mapping Out Strategic Direction for 2013 and Highlighting Key Product Offerings

What:             “Hortonworks State of the Union and Vision for Apache Hadoop in 2013” webinar

Who:               Shaun Connolly, Vice President of Corporate Strategy, Hortonworks

When:             Tuesday, January 22, 2013 at 1:00 p.m. ET/10:00am PT

Where:           http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html

Click to Tweet: #Hortonworks hosting “State of the Union” webinar to discuss 2013 vision for #Hadoop, 1/22 at 1 pm ET. Register here: http://bit.ly/VYJxKX

The “State of the Union” webinar is the first in a four-part Hortonworks webinar series titled, “The True Value of Apache Hadoop,” designed to inform attendees of key trends, future roadmaps, best practices and the tools necessary for the successful enterprise adoption of Apache Hadoop.

During the “State of the Union,” Connolly will look at key company highlights from 2012, including the release of the Hortonworks Data Platform (HDP)—the industry’s online 100-percent open source platform powered by Apache Hadoop—and the further development of the Hadoop ecosystem through  partnerships with  leading software vendors, such as Microsoft and Teradata. Connolly will also provide insight into upcoming initiatives and projects that the Company plans to focus on this year as well as topical advances in the Apache Hadoop community.

Attendees will learn:

  • How Hortonworks’ focus contributes to innovation within the Apache open source community while addressing enterprise requirements and ecosystem interoperability;
  • About the latest releases in the Hortonworks product offering; and
  • About Hortonworks’ roadmap and major areas of investment across core platform, data and operational services for productive operations and management.

For more information, or to register for the “State of the Union” webinar, please visit: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html.

Hortonworks Data Platform 1.2 Available Now!

Hortonworks Data Platform 1.2 is now available for download at: http://hortonworks.com/products/hortonworksdataplatform/.

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

“We are pleased to see the continued evolution of the Hortonworks Data Platform – a key component for capturing and refining data in the Teradata Unified Data Architecture, which provides deeper analytical insights across all data for any end-user or application,” said Scott Gnau, president, Teradata Labs. “The focus on system management with this release allows for seamless integration with Teradata Viewpoint, allowing our customers to have a single administration view across the Teradata data warehouse, Teradata Aster discovery platform, and Apache Hadoop in their enterprise architecture, resulting in a lower cost of operation.”

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

  • Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.
  • Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.
  • Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.
  • HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

“The inclusion of the newest version of Apache Ambari in the Hortonworks Data Platform represents a major step forward in the open source community and the Apache Hadoop ecosystem,” said Herb Cunitz, president of Hortonworks. “With a number of new features designed to improve the ease of use, management and security of Apache Hadoop, our newest release of the Hortonworks Data Platform is helping solidify Hadoop’s position as the de facto next-generation enterprise data platform. Hortonworks remains solely committed to developing a 100-percent open source, stable and reliable Apache Hadoop-based platform that will help grow and expand the ecosystem around Hadoop, driving the collection, storage, management and analysis of big data at leading enterprise organizations worldwide.”

Proper Care and Feeding of Drives in a Hadoop Cluster: A Conversation with StackIQ’s Dr. Bruno

In a recent blog post, Hortonworks’ Steve Loughran discussed Apache Hadoop’s preference for JBOD-configured storage vs. the allure of RAID-0. As more enterprises are beginning to move beyond the science experiment stage and begin deploying Hadoop into their production environments, they are learning that Hadoop is quite different than other services in their data centers, such as web, mail, and database servers.They are learning that to achieve optimal performance, you need to pay particular attention to configuring the underlying hardware.

To find out more, we had a chat with Dr. Greg Bruno, VP of Engineering, and co-founder of StackIQ, a Hortonworks partner, about the real life implications of managing hard drives (HDDs) in a modern Hadoop cluster.

Q. Why isn’t it considered good practice to configure drives in Hadoop clusters as RAID-0 disk arrays?

A. Hadoop prefers a set of separate disks to the same set managed as a RAID-0 disk array. Read speeds are particularly important to the performance of a Hadoop cluster, and in his post, Steve makes the point that since drive speeds vary, and RAID-0 reads occur at the speed of the slowest disk in the array, a RAID-0 configuration may well be slower than a non-RAID configuration. The bigger issue, in my opinion, is reliability. If a set of disks is configured as a RAID-0 array, then one disk failure in that array will take that entire volume down, and if all the disks in a node are configured as a single RAID-0 array, then a single disk failure will take all the node’s data down. By configuring multiple disks in a RAID-0 array, you magnify the probability of that volume going offline due to a single disk failure and you maximize the amount of data that goes offline when that single failure occurs.

Q: Modern servers have a lot of disks. What’s the impact of losing a single disk when you have 12 3TB drive in each node?

A:  When a single drive fails when Hadoop is configured in its default state, the ENTIRE NODE gets taken offline. Back when servers typically had 6 x 1.5TB drives in them, losing a single disk would cause the loss of 0.02% of total storage in a typical 10PB, three-replica setup. With today’s hardware — typically 12 x 3TB drives per node, losing a single disk results in the loss of five times as much data.

Q: Aren’t today’s HDDs much more reliable than they used to be? Is it worth the extra work to handle the rare cases when a drive fails?

A: While drives are much more reliable than they used to be, they are still the cause of the lion’s share of support tickets in a Hadoop cluster. In fact, according to Bharath Mundlapudi, a Core Hadoop Engineer while working at Yahoo, disk drive failures account for fully 50% of siteops trouble tickets. That’s more than three times the next highest source of tickets.

Q: What does that represent in real terms?

A: It represents a lot of work for systems administrators. How much depends on the size and age of the cluster in question. For example, Facebook, which has some very large clusters, reports that their failure detection and automated repair system is doing the work of approximately 200 full time system administrators.

Q: OK, but not many organizations have clusters that large. What about a typical enterprise setup?

A: Our experience indicates that a 1,000 node cluster containing 12,000 drives for a total raw storage capacity of 48 peta-bytes can expect about 3 drive failures a day in its third year of operation. Drive failure rates rise as the devices age. For a 500 node cluster, you’re looking at a drive failure every 17 hours or so.

Q: Doesn’t this make it hard for the cluster operator to manage? How do they keep up?

A: Without the right tools and methodology, it is very difficult for cluster operators to manage clusters at scale. They typically have to write scripts to scan the cluster, detect disk failures, and report them. Then, once the offending drive has been replaced, commands must be run for the controller to recognize the new drive, OS commands need to be executed to format the drive, and then some Hadoop commands are required to add the disk back to the configuration.

Q: Presumably it’s not quite as challenging for StackIQ customers?

A: StackIQ’s mission is to make cluster operation as painless as possible, which is why we have developed tools to manage the entire lifecycle of the disk. While we haven’t figured out how to get our software to physically pull a bad drive and replace it with a new one, we automate the rest of it — from the initial deployment of the drive, detecting and reporting the error, and re-integrating the replacement drive into the configuration.

One of the features we’ve developed in StackIQ’s management software automatically configures chassis with LSI MegaRaid controllers into “JBODs”, that is, every disk in the chassis will be configured as an individual device.

In addition, a user can specify which disk they want in the chassis to be the boot disk via an attribute (e.g., “bootdisk0″) and if an optional secondary boot disk attribute is specified (“bootdisk1″), then our code will configure both those disks as a “mirror” (RAID1) while still making all the other non-boot disks available to Hadoop as individual disks.  A recent StackIQ customer made their purchasing decision on this feature alone, as they recently went through the painful exercise of changing a mid-size cluster’s RAID configuration by booting each server, one-by-one, catching a key press at the controller prompt, and fixing the configuration by-hand.  Not a fun exercise when you are under the gun by management to get production cluster online.

Q: With that many drive failures, clusters will be chewing through disks at a brisk rate. That could get expensive. That works out to something like 1000 drives/year X $100/drive = $100k per year just for replacement drives.

A: True, which speaks to the need for software which will make the most efficient use of your resources –  intelligent, automated cluster management software can find faulty drives automatically, and bring up a replacement drive quickly.

Q: Doesn’t automation take control out of the hands of the skilled cluster operators?

A: We believe it should be up to the cluster operator to set policies on how much automation to incorporate into their workflows. Our software reflects that philosophy, letting operators choose from a range of policies that go all the way from having the operator run all the commands manually, all the way to a fully automated repair where all the operator needs to do is push in the new drive and let StackIQ’s software do the rest.

Q: Can’t this be done with a simple command script that runs on all nodes?

A: That might be workable in a homogeneous environment, where all the nodes are the same. But in the real world, different nodes require different configurations. Even the disks are likely configured differently in nodes within the clusters. Handling those variables in a static script would be very difficult to accomplish. For example, if your cluster expands over time, you may be adding chassis with different drive configurations. Static scripts wouldn’t be able to deal with this situation. The StackIQ management software has intimate knowledge of the hardware and software in the cluster, so it knows exactly how to handle each drive in each node across the entire cluster, even in a heterogeneous environment.

Conclusion

So there you have it. The folks behind StackIQ cluster management software agree with Steve Loughran’s recommendation to forego RAID-0 for Hadoop clusters. In fact, they provide the management tools to make it easier to do. So take the advice of our experts, and configure your cluster servers as “Just a Bunch of Disks.”

For more information on StackIQ, please visit their website or follow their Twitter handle (@StackIQ). You can also follow Dr. Greg Bruno directly on his Twitter handle (@itsDrBruno).

~ Lisa Sensmeier

Oldest and Largest Apache Hadoop Community Event in North America Opens Call for Papers

Hadoop Summit North America 2013, the premier Apache Hadoop community event, will take place at the San Jose Convention Center, June 26-27, 2013. Hosted by Hortonworks, a leading contributor to Apache Hadoop, and Yahoo!, Hadoop Summit brings together the community of developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending and implementing Apache Hadoop as the next-generation enterprise data platform.

This 6th Annual Hadoop Summit North America will feature seven tracks and more than 80 sessions focused on building, managing and operating Apache Hadoop from some of the most influential speakers in the industry. Growing 30 percent to more than 2,200 attendees last year, Hadoop Summit reached near sell-out crowds. This year, the Summit is expected to be even larger.

Apache Hadoop is the open source technology that enables organizations to more efficiently and cost-effectively store, process, manage and analyze the ever-increasing volume of data being created and collected every day. Yahoo! pioneered Apache Hadoop and is still a leading user of the big data platform. Hortonworks is a core contributor to the Apache Hadoop technology via the company’s key architects and engineers.

The Hadoop Summit tracks include the following:

  • Hadoop-Driven Business / Business Intelligence: Will focus on how Apache Hadoop is powering a new generation of business intelligence solutions, including tools, techniques and solutions for deriving business value and competitive advantage from the large volumes of data flowing through today’s enterprise.
  • Applications and Data Science: Will focus on the practice of data science using Apache Hadoop, including novel applications, tools and algorithms, as well as areas of advanced research and emerging applications that use and extend the Apache Hadoop platform.
  • Deployment and Operations: Will focus on the deployment, operation and administration of Apache Hadoop clusters at scale, with an emphasis on tips, tricks and best practices.
  • Enterprise Data Architecture: Will focus on Apache Hadoop as a data platform and how it fits within broader enterprise data architectures.
  • Future of Apache Hadoop: Will take a technical look at the key projects and research efforts driving innovation in and around the Apache Hadoop platform.
  • Apache Hadoop (Disruptive) Economics: Focusing on business innovation, this track will provide concrete examples of how Apache Hadoop enables businesses across a wide range of industries to become data-driven, deriving value from data in order to achieve competitive advantage and/or new levels of productivity.
  • Reference Architectures: Apache Hadoop impacts every level of the enterprise data architecture from storage and operating systems through end-user tools and applications. This track will focus on how the various components of the enterprise ecosystem integrate and interoperate with Apache Hadoop.

The Hadoop Summit North America 2013 call for papers is now open. The deadline to submit an abstract for consideration is February 22, 2013.  Track sessions will be voted on by all members of the Apache Hadoop ecosystem using a free voting system called Community Choice. The top ranking sessions in each track will automatically be added to the Hadoop Summit agenda. Remaining sessions will be chosen by a committee of industry experts using their experience and feedback from the Community Choice.

Discounted early bird registration is available now through February 1, 2013. To register for the event or to submit a speaking abstract for consideration, please visit: www.hadoopsummit.org/san-jose/

Sponsorship packages are also now available. For more information on how to sponsor this year’s event please visit: www.hadoopsummit.org/san-jose/sponsors/

 

Apache Hadoop: Seven Predictions for 2013

At Thanksgiving we took a moment to reflect on the past and give thanks for all that has happened to Hortonworks the past year.  With the New Year approaching we now take time to look forward and provide our predictions for the Hadoop community in 2013.  To compile this list, we queried and collected big data from our team of Hadoop committers and members of the community.

We asked a few luminaries as well and we surfaced many expert opinions and while we had our hearts set on five predictions, we ended up with SEVEN. So, without further adieu, here are the top Top 7 Predictions for Hadoop in 2013

1. “Big Data” becomes “data”

Over the past 18 months the term “big data” has emerged and has defined a space for swath of new (and existing) technologies.  It has been called transformative and many have even said it will replace everything we do today.  Well, we have a bit of realistic eye on big data.  We feel big data is just data.  As Apache Hadoop has evolved it has become a standard platform for this new world and by the end of 2013, the “big” moniker will no longer be necessary… it is all just data after all.  Big data and all the predictions for this space will collapse into data management by the analysts and all those following, including a lot of the “big” vendors.

2. Emergence of vertically aligned Apache Hadoop “solutions”

At the keynote of Hadoop Summit last year, Geoffrey Moore characterized Apache Hadoop as currently crossing the chasm and that we would know it has landed on the other side and is enjoying adoption by the mainstream when vertical solutions arise.  As more and more companies gain success we will see patterns and solutions arise that are custom-fit for a challenge found in a particular industry.  As the system integrators and consultants become more and more expert on Apache Hadoop, they will wrap solutions in packages and we will see the emergence of these vertical solutions

3. “Right-time” query of Apache Hadoop becomes reality

Much has been made about the batch nature of Apache Hadoop in the past few months.  This is understandable as it was, after all, architected this way.  In 2013 we will see Apache Hadoop v2 finally deemed stable and reliable and with this we will see advances in the surrounding Apache projects to make the platform more interactive.  The enterprise is asking for it and the community will naturally answer.  Some will try to “fix” this with proprietary extensions on Hadoop, but ultimately the community will resolve this challenge.  We will see technology emerge that allows you to get the “right” time applied to the “right” business requirements.

4. More Hadoop startups

As there has been a lot of hype around Apache Hadoop and for as many new business ideas it presents, well there are new companies popping up all around to support these ideas. As the emergence of vertical based solutions progresses so too will the emergence of a new batch of startups ready to take advantage of the mainstream adoption of Apache Hadoop.

5. Apache Hadoop v2 (YARN and MR2) becomes the standard for Hadoop data management

Hadoop has already established itself as the next generation data platform, however, with Apache Hadoop v2, the enterprise will adopt it for more than pilot and small projects.  It will become the data backbone for many because of the advances in Apache Hadoop v2 make it more reliable and more stable.  Personally, our Hortonworkers are excited and proud of this new architecture as our team has been busy building and testing it and can’t wait to see it prove value.

6. The big data ecosystem expands

Related to number four prediction, existing application vendors will all clamor to make their products Hadoop-compatible.  Led by Teradata and Microsoft and many others, application vendors are waking up to the reality that their applications must run on Hadoop.  Already, it seems everyone is building a reference architectures which incorporate Hadoop and HDP to leverage all the goodness they already provide around data lifecycle management, data governance, security, etc. Meanwhile the Hadoop community is doing everything it can to foster adoption by the ISVs.  In 2013, nearly everyone will be speaking big data.

7. Apache Ambari sets the standard for Hadoop operations

This prediction is admittedly a little self-serving as Hortonworks employs the founder and many of the contributors behind Apache Ambari, but we are believers.  We believe that Ambari will set the standard for operational services for Enterprise Hadoop as it allows organizations to more easily easy to consume, deploy and manage a cluster.  It has already reached parity with the proprietary solutions available and with the power of the community it is accelerating and adding new features at an astonishing rate.  Again, this fully dedicated open source approach not only provides the right tools but also showcases the extraordinary rate at which the democracy of the community can innovate.

Now this is just our opinion…  What are YOUR predictions?  please comment!

Happy new year and we look forward to seeing all of these come true!

Community-driven Snapshot for HDFS – Part TWO

This blog is a follow up on our previous blog “Snapshots for HDFS

In June we had posted an early prototype of snapshots that allowed us to experiment with a few ideas in HDFS-2802. Since then we have added more details to the design document and made significant progress on a brand new implementation (over 40 subtasks in HDFS-2802).

Some of the highlights of this new design include:

  • Read-Only Copy-on-Write (COW) snapshots (but can be extended RW later)
  • Snapshots for entire namespace or sub directories
  • Snapshots are managed by Admin, but users are allowed to take snapshots
  • Snapshots are efficient
    • Creation is instantaneous with O(1) cost.
    • Additional memory is used only when modifications are made relative to  a snapshots (memory usage is O(M), where M is the number of modified files/directories)
    • Snapshots do not adversely affect regular HDFS operations

An initial implementation of snapshots with some tests is already completed.  We are now working on the improvements, some new tools for snapshots and adding more tests.

The major work-in-progress items are:

  • Persistent data structure based solution for efficient creation and memory usage
  • Snapshot diff tool
  • Restore/rollback snapshots

Meetup at Hortonworks: rallying the community
Recently we also held a Meetup at Hortonworks office, where over 30 folks attended to discuss the design and some of the features in great detail. A wide range of topics were discussed from Snapshot usage by HBase, administration aspects of snapshots, overhead of creating and maintaining snapshots, and lower level details such as length of files that are open-for-writing. We had  representation from HDFS developers, HBase developers and engineers with deep experience in managing hadoop and other storage systems.   We thank the community for the valuable discussion and feedback on the feature requirements and the open questions.

 

Apache Ambari: Hadoop Operations, Innovation, and Enterprise Readiness

Over the course of 2012, through Hortonworks’ leadership within the Apache Ambari community we have seen the rapid creation of an enterprise-class management platform required for enabling Apache Hadoop to be an enterprise viable data platform.  Hortonworks engineers and the broader Ambari community have been working hard on their latest release, and we’d like to highlight the exciting progress that’s been made to Ambari, a 100% open and free solution that delivers the features required from an enterprise-class management platform for Apache Hadoop.

Why is the open source Ambari management platform important?

For Apache Hadoop to be an enterprise viable platform it not only needs the Data Services that sit atop core Hadoop (such as Pig, Hive, and HBase), but it also needs the Management Platform to be developed in an open and free manner. Ambari is a key operational component within the Hortonworks Data Platform (HDP), which helps make Hadoop deployments for our customers and partners easier and more manageable.

Stability and ease of management are two key requirements for enterprise adoption of Hadoop and Ambari delivers on both of these. Moreover, the rate at which this project is innovating is very exciting.  In under a year, the community has accomplished what has taken years to complete for other solutions. As expected the “ship early and often” philosophy demonstrates innovation and helps encourage a vibrant and widespread following.

Recent and exciting enhancements to Apache Ambari include:

  • Simplified cluster provisioning with a step-by-step install wizard
  • Pre-configured key operational metrics for instant insight into the health of Hadoop Core (Hadoop Distributed File System and MapReduce) and related projects such as HBase, Hive and HCatalog
  • Visualization and analysis of job and task execution to gain a better view into dependencies and performance
  • A complete RESTful API for exposing monitoring information and integrating with existing operational tools
  • An intuitive user interface that makes viewing information and controlling a cluster easy and productive

Hortonworks Data Platform is all about enterprise-ready Hadoop and Ambari is a key project included in our distribution. Our focus as an organization is to innovate throughout all of the Hadoop-related projects and then package the most stable and enterprise ready components into HDP, and Ambari is an important component for users of HDP that are betting their business on Hadoop.

The Ambari project is a perfect example of what is important to us. First and foremost, we are focused on a 100% open source development and delivery model for HDP and second, we are dedicated to making sure Hadoop is reliable and can be trusted by the enterprise and our ecosystem of partners.

We are committed to the mission that HDP is the MOST stable, reliable and enterprise-ready Apache Hadoop distribution available. And that is why we invest in community-driven and enterprise-focused projects such as Ambari.

 

To learn more about Apache Ambari and the latest project updates or to download the source code, visit the Apache Ambari home page. http://incubator.apache.org/ambari/

Hortonworks at Big Analytics 2012, New York City!

For the last couple months, Hortonworks has been excited to be a proud sponsor of the Big Analytics 2012 roadshow.  These roadshows have provided us some great insights into the role of Apache Hadoop in this emerging Big Data market.  We had some great discussions with attendees regarding their current and future plans for the use of Hadoop and other Big Data technologies. Another interesting insight was the need for Data skills, people who know what to ask of that data and how to use tools like Hadoop to provide patterns, answers, interpretations and present the data.

On Wednesday, 12/12/12, Hortonworks will participate in the last leg (SOLD OUT SHOW) of their 4-city roadshow, in New York City. We, along with other Big Data experts, will discuss the new economics of Data and will walk through an array of Hadoop and Big Data use cases. Our very own, Jim Walker, Director of Product Marketing, will also be part of a very interesting keynote program, discussing Apache Hadoop’s role in your big data architecture.

If you’re attending, come by, visit us, we would love to meet you.

There will also be a live simulcast on a panel discussion at Big Analytics New York. This panel discussion will focus on the role of Data Scientists, provide some real-life examples how Data Scientists can improve the business and answer any questions you might have. You can ask questions and follow the discussion thread on www.twitter.com using the hash-tag #BARS12, or follow along on TweetChat at: http://tweetchat.com/room/BARS12

You can register to be part of this live simulcast here.

Go to page:1234