Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of all your data are belong to us. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled 7 Key Drivers for the Big Data Market, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. This new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

Read More

7 Key Drivers for the Big Data Market

I attended the Goldman Sachs Cloud Conference and participated on a panel focused on “Data: The New Competitive Advantage”. The panel covered a wide range of questions, but kicked off covering two basic questions:

“What is Big Data?” and “What are the drivers behind the Big Data market?”

While most definitions of Big Data focus on the new forms of unstructured data flowing through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to answer the question using a simple equation:

Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:

Read More

Executive Video Series: Introduction to HCatalog

We just added a video to the Hortonworks Executive Video library that features Alan Gates, Hortonworks co-founder and Apache PMC member. In this video, Alan discusses HCatalog, one of the most compelling projects in the Apache Hadoop ecosystem.

HCatalog is a metadata and table management system that provides a consistent data model and schema for users of tools such as MapReduce, Hive and Pig. When you consider that there are often users accessing Hadoop clusters using different tools that independently don’t agree on schema, data types, how and where data is stored, etc., then you can understand the value of having a tool such as HCatalog.

In this video, Alan does a good job of not only explaining the role of HCatalog, but also laying out the future direction of the project. He talks about improving the integration with HBase, improving information lifecycle management and expanding the HCatalog data model to address the challenges of unstructured data.

Record Support for Hadoop Summit

In case you didn’t see the news today, Hadoop Summit announced record ecosystem support for this year’s conference. The original and world’s largest Apache Hadoop conference, now in its fifth year, is being sponsored this year by more than 40 traditional and open source software and services companies.

Hortonworks and our co-host Yahoo! would like to thank the following companies for helping to make Hadoop Summit possible:

Read More

Executive Video Series: Apache Hadoop and Next Generation MapReduce

The third installment of the Hortonworks executive video series features Arun C. Murthy, co-founder of Hortonworks and VP of Apache Hadoop for the Apache Software Foundation. In this video, Arun shares his view of the power of Apache Hadoop and provides some insight into the future direction of MapReduce, including the ability to support alternate programming paradigms.

Read More

Hadoop Observations from the U.K.

As part of Big Data Week, Dan Harvey of the London Hadoop User Group organised an afternoon session for the usergroup, which we were glad to sponsor, along with Canonical and Facegroup. I had the pleasure of presenting my view of the current and future status of Apache Hadoop to an audience that ranged from those curious about Hadoop to heavy users.

Every talk of the day was excellent, from the use cases by Datasift, Mendeley and MusicMetric, to the talk by Francine Bennett of MastodonC on the CO2 footprint of different cloud computing infrastructures, including a live dashboard on the current CO2/hour of many cloud infrastructure sites.

In my discussions with attendees, I was impressed how broadly Hadoop is starting to be adopted in the U.K. There is adoption from “pure data” companies like Mendeley, DataSift, MusicMatch, Last.fm, as well as media companies and financial organisations. London is a centre of finance and data and as such, from a Hadoop perspective, it is a source of data waiting to be stored and mined.

Read More

Executive Video Series: Overview of Hortonworks Data Platform

We just released the second video in the Hortonworks Executive Series. This one features Matt Foley, Test and Release Engineering Manager for Hortonworks.

In this video, Matt provides an overview of Hortonworks Data Platform (HDP), including a summary of the Apache Hadoop components included in the distribution and the testing processes involved in the release process. Matt also provides an overview of Apache Ambari, an open source project that is adding monitoring and management capabilities to Apache Hadoop.

Read More

Hortonworks Welcomes Citrix and CloudStack to the Apache Community

We are pleased to support today’s announcement from Citrix that they have contributed CloudStack to the Apache community. For those new to CloudStack, it is an open source cloud computing software that helps organizations build and manage cloud infrastructures. It is similar to Amazon Web Services EC2 environment except that it enables organizations to build public, private or hybrid cloud environments using their own pooled computing resources.

Citrix announced today that they were reaffirming their commitment to open source by working with the Apache Software Foundation to make CloudStack 3 an Apache project, released under Apache Software License 2.0. This is yet further acknowledgement that Apache is the logical home for open source projects that are transforming the enterprise software industry. As a Gold Sponsor of the ASF and major contributor to Apache projects, Hortonworks is pleased that leading vendors such as Citrix are recognizing the value that Apache can provide in terms of accelerating development and innovation and driving adoption as the preferred destination for enterprise-class open source software.

Read More

New Features in Apache Pig 0.10

Another important milestone for Apache Pig was reached this week with the release of Pig 0.10. The purpose of this blog is to summarize the new features in Pig 0.10.

Boolean Data Type

Pig 0.10 introduces boolean data type as a first-class Pig data type. Users can use the keyword “boolean” anywhere where a data type is expected, such as load-as clause, type cast clause, etc.

Here are some sample use cases:

a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);

b = foreach a generate a0, a1, (boolean)a2;

c = group b by a2; — group by a boolean field

When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null. When storing boolean data using PigStorage, true value will emit text “true” and false value will emit text “false”.
Read More

Snapshots for HDFS

This blog covers our on-going work on Snapshots in Apache Hadoop HDFS. In this blog, I will cover the motivations for the work, a high level design and some of the design choices we made. Having seen snapshots in use with various filesystems, I believe that adding snapshots to Apache Hadoop will be hugely valuable to the Hadoop community. With luck this work will be available to Hadoop users in late 2012 or 2013.

snapshot is a point-in-time image of the entire filesystem or a subtree of a filesystem. Some of the scenarios where snapshots are very useful:

  1. Protection against user errors:  Admin sets up a process to take read-only (RO) snapshots periodically in a rolling manner so that there are always x number of RO snapshots on HDFS. If a user accidentally deletes a file, the file can be restored from the latest RO snapshot that contains the file.
  2. Backup: Admin wants backup the entire file system, a subtree in the file system or just a file. Depending on the requirements, admin takes a read-only (henceforth referred to as RO) snapshot and uses this snapshot as the starting point of a full backup. Incremental backups are then taken by doing a diff between two snapshots.
  3. Experimental/Test setups:  A user wants to test an application against the main dataset. Normally, without doing a full copy of the dataset, this is a very risky proposition because the test setup can corrupt/overwrite production data. Admin creates a read-write (henceforth referred to as RW) snapshot of the production dataset and assigns the RW snapshot to the user to be used for experiment. Changes done to the RW snapshot will not be reflected on the production dataset.
  4. Disaster Recovery:  RO Snapshots can be used to create a consistent point in time image for replication and this can be copied over to remote site for Disaster Recovery.

High Level Requirements

  1. Read-only (RO) snapshots: These are immutable copies of underlying elements of the file system.
  2. Read-write (RW) snapshots: RW snaps can be modified by a user.
  3. Support for taking snapshots of the entire namespace, or a subtree.
  4. Support for a reasonable number of snapshots in a single namenode.
  5. Snapshots should be easy to browse using standard commands and tools, and copying of data from a snapshot should work with standard Hadoop commands and API.

High Level approaches

We considered two options for snapshots.

Option #1: Both datanodes and namenode are aware of the snapshots and save state internally about the snapshots. Datanode is aware of the fact that some of the blocks are for the snapshot files.

Option #2: Only namenode is aware of the snapshot. Datanode is not aware of the fact that some of the blocks are owned by snapshots of the original file.

Option #2 is selected to keep the design simple. Additionally, taking snapshots is very fast with option #2. Datanode does not know anything about snapshots and is not aware of block ownership issues between root file system and snapshots. Keeping datanodes free from snapshot information simplifies the design immensely by eliminating the need for distributed co-ordination from the design of the snapshots by restricting the changes to namenode only.

Creating and Deleting Snapshots

A key requirement is to ensure that it is very easy to create and delete snapshots. Snapshot creation and deletion is an admin-only capability. To create a snapshot, one specifies a  snapshot name, a path to the root of the subtree whose snapshot is to be taken, and whether or not the snapshot is read-only or a read-write. Deleting snapshot requires just a snapshot name. A command to list all the snaps in the filesystem will be provided.

Accessing Directories and Files in a Snapshot

Snapshots can be referenced with regular HDFS path names with a reserved string .snapshot_<name>:

 hdfs://host:port/pathOfSnapshot/.snapshot_<name>/restOfPathInSnapshot

This has the benefit that snapshots can be referenced with all existing Hadoop commands and APIs that take a pathname by adding a reserved snapshot string to the pathname.

Examples:  Consider a directory structure of /a/b/c/foo.txt. Admin has created a snapshot hdfs1 at /a/b. To access data related to snapshot hdfs1, some examples of the commands would be:

hadoop dfs -ls /a/b/.snapshot_hdfs1/c/foo.txt 

To copy file from /temp/foo/foo1.txt in snapshot branch to /fooBar would be,

hadoop dfs -cp /a/b/.snapshot_hdfs1/c/foo.txt /foobar/.

Some caveats for RO snapshots include the fact that RO snapshot is immutable. So, operations such as creating a new file, deleting a file, creating a new directory, renaming a file or directory will fail when executed on the snapshot branch. 

Conclusion

Snapshots are a very useful feature to have in a mature filesystem. This is a work in progress and we have a functional prototype implemented. The first version of this feature will support RO snapshots only. The support for RW snapshots will be added in the subsequent releases. There are several features that can be incorporated into snapshots, such as time to live for snapshots with auto deletion, schedule based creation of snapshots, marking specific directories as snapshot-worthy, quota based restriction on space used by RW snapshots and delegation of authority for creating/deleting snapshots at specific locations to users etc.

To track the development of snapshots feature in HDFS, please follow the jira HDFS-2802.

~ Hari Mankude

Executive Video Series: The Hortonworks Vision for Apache Hadoop

I’m pleased to announce the first in a series of videos featuring Hortonworks founders and executives sharing their thoughts on how Apache Hadoop is being extended to become the next generation enterprise data platform. Over the coming weeks and months, you will be hearing from folks such as Matt Foley, Arun Murthy, Sanjay Radia and Alan Gates, just to name a few.

The first video features Shaun Connolly, Hortonworks VP of Corporate Strategy, talking about the Hortonworks vision for Apache Hadoop. In this video, Shaun does a nice job of outlining our vision that Apache Hadoop will process or touch half of the world’s data by 2015. How is Hortonworks helping to make this happen? Click on the video image below to find out.

Read More

Announcing the Hadoop Summit Community Choice Winners

Thank you to the community members that cast over 8,000 votes during the Hadoop Summit Community Choice voting process. The turnout far exceeded our expectations and is further evidence that the momentum behind Apache Hadoop has never been stronger.

As we announced, the sessions with the most votes in each track are automatically accepted into the Hadoop Summit agenda. As such, I am pleased to announce the winners of the Hadoop Summit Community Choice vote and the first confirmed sessions in the Hadoop Summit program:

Future of Apache Hadoop track: Dynamic Namespace Partitioning with Giraffa File System, Konstantin Shvachko (eBay)

Deployment and Operations track: Dynamic Reconfiguration of Apache Zookeeper, Alexander Shraer and Benjamin Reed (Yahoo!)

Enterprise Data Architecture track: iMStor: Hadoop Storage-based Tiering Platform, Vishal Malik (Cognizant Technology Solutions)

Applications and Data Science track: Hadoop & Cloud @Netflix: Taming the Social Data Firehose, Mohammad Sabah (Netflix)

Analytics and Business Intelligence track: Mapping and Reducing Passenger Turbulence using Big Data, Farhan Hussain and Saad Patel (Open Source Architect)

Hadoop in Action track: The Merchant Lookup Service at Intuit, Vrushali Channapattan (Intuit)

Read More

Hadoop Summit Community Choice

As I first mentioned when we announced Hadoop Summit 2012, we are focused on making Hadoop Summit the preeminent conference for the Apache Hadoop community. Today I’m pleased to tell you about Community Choice, a public online voting system that enables the entire Apache Hadoop community to have a say in the sessions chosen for Hadoop Summit. Anybody can vote and the top vote getters in each track will automatically be included in the Hadoop Summit agenda.

One of the things you will notice when you vote is the large number of abstracts that were submitted for the conference. In fact, there were 267 submissions for Hadoop Summit, more than double the number of submissions from last year’s highly successful event. There are six tracks; each of which has a wide selection of compelling topics. Another interesting fact is that there were submissions from 120 different organizations (companies, universities and government agencies). It’s becoming even clearer that Apache Hadoop is having a significant impact in the data industry.

In addition to Community Choice, there is also a content selection committee in place that will identify the other sessions for Hadoop Summit. This is also a community effort. The content selection committee is made up of 36 leaders from the ecosystem representing 27 different organizations (vendors, end users and universities). The committee is hard at work reviewing sessions and we expect to be able to publish the final agenda before the end of March.

Please remember to vote in the Community Choice process. If you ever wanted to have input into a conference, this is your chance. Voting ends March 20th, so please vote today.

~E14

Namenode HA Reaches a Major Milestone

We reached a significant milestone in HDFS: the Namenode HA branch was merged into the trunk. With this merge, HDFS trunk now supports HOT failover.

Significant enhancements were completed to make HOT Failover work:

  • Configuration changes for HA
  • Notion of active and standby states were added to the Namenode
  • Client-side redirection
  • Standby processing journal from Active
  • Dual block reports to Active and Standby

We have extensively tested HOT manual failover in our labs over the last few months. The HDFS team is now working on completing automatic failover. Please see HDFS-1623 for more details.

~Jitendra Pandey

Open Source Data Integration for Apache Hadoop

Today we announced an important strategic partnership with Talend, provider of the world’s most popular open source data integration platform. This is another win for both Hortonworks customers and the larger Apache Hadoop community. There were two key aspects of the announcement that I wanted to highlight:

Talend releases Talend Open Studio for Big Data

Based upon Talend’s very popular open source data integration platform, Talend Open Studio for Big Data adds connectors for HDFS, HBase, Pig, Sqoop and Hive. It allows organizations to move data into and out of Hadoop much more easily. It also leverages the MapReduce architecture to generate native Hadoop code and run data transformations directly inside Hadoop, in a highly scalable fashion. Talend Open Studio for Big Data will also be released with Apache licensing, which is a good match for the Apache Hadoop community.

Read More

Go to page:12345