This blog covers our on-going work on Snapshots in Apache Hadoop HDFS. In this blog, I will cover the motivations for the work, a high level design and some of the design choices we made. Having seen snapshots in use with various filesystems, I believe that adding snapshots to Apache Hadoop will be hugely valuable to the Hadoop community. With luck this work will be available to Hadoop users in late 2012 or 2013.…
From the Dev Team
Follow the latest developments from our technical team
We reached a significant milestone in HDFS: the Namenode HA branch was merged into the trunk. With this merge, HDFS trunk now supports HOT failover.
Significant enhancements were completed to make HOT Failover work:
- Configuration changes for HA
- Notion of active and standby states were added to the Namenode
- Client-side redirection
- Standby processing journal from Active
- Dual block reports to Active and Standby
We have extensively tested HOT manual failover in our labs over the last few months.…
Apache Hadoop provides a high performance native protocol for accessing HDFS. While this is great for Hadoop applications running inside a Hadoop cluster, users often want to connect to HDFS from the outside. For examples, some applications have to load data in and out of the cluster, or to interact with the data stored in HDFS from the outside. Of course they can do this using the native HDFS protocol but that means installing Hadoop and a Java binding with those applications.…
I ran across an interesting problem in my attempt to implement random forest using Apache Pig. In random forest, each tree is trained using a bootstrap sample. That is, sample N cases at random out of a dataset of size N, with replacement.
For example, here is the input data:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Here is one bootstrap sample drawn from input:
(5, 2, 3, 2, 3, 9, 7, 3, 0, 4)
Each element can appear 0 to N times.…
We have some great news for developers and researchers that want to start using Apache Hadoop quickly. With the release of Apache Hadoop 0.20.204 today comes, for the first time, availability of RPMs that make it much simpler to setup a basic Hadoop cluster. This will allow you to focus on how to use the features instead of having to learn how they were implemented.
Before we begin, I’d like to apologize for the fact that these instructions do not optimize Hadoop settings to make Hadoop fast.…
This was originally published on my blog; I’m re-posting it here on request from the fine people at Hortonworks.
One of the best features of embedding is how it simplifies writing UDFs and using them right away in the same script without superfluous declarations.…
In this post I’m going to give a very simple example of how to use Pig; embedded in Python to implement the PageRank; algorithm. It goes in a little more details on the same example given in the presentation I gave at the Pig user meetup. On the same topic, Daniel published a nice K-Means implementation using the same embedding feature. This was originally published on my blog; I’m re-posting it here on request from the fine people at Hortonworks.…
We are very excited to announce NextGen Apache Hadoop MapReduce is getting close. We just merged the code base to Apache Hadoop mainline and Arun is about to branch a hadoop-0.23 to prepare for a release!
We’ve talked about NextGen Apache Hadoop MapReduce and it’s advantages. The drawbacks of current Apache Hadoop MapReduce are both old and well understood. The proposed architecture has been in the public domain for over 3 years now.…
Data integrity and availability are important for Apache Hadoop, especially for enterprises that use Apache Hadoop to store critical data. This blog will focus on a few important questions about Apache Hadoop’s track record for data integrity and availability and provide a glimpse into what is coming in terms of automatic failover for HDFS NameNode.
What is Apache Hadoop’s Track Record for Data Integrity?
In 2009, we examined HDFS’s data integrity at Yahoo!…
A common use case we have seen is people want to operate on certain columns and project other columns as is or pass a range of input columns to a user defined function. In 0.9, you have project-range, which makes it easier to write statements that do just that.…
* Special note: the code discussed in this blog is available here *
A common complain of Pig is the lack of control flow statements: if/else, while loop, for loop, etc.
And now Pig has a response for it: Pig embedding. You can now write a python program and embed Pig scripts inside of it, leveraging all language features provided by Python, including control flow.
The Pig embedding API is similar to the database embedding API.…
This is the first of three blogs that will highlight the new features in Pig 0.9.
When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
Compare that to an SQL command:
Select COUNT(*) from student;
That’s just not intuitive, especially for new users.…
I’d like to congratulate Arun Murthy on his very popular Hadoop Summit talk. SlideShare.net reports that his presentation has gone viral. They originally promoted it as the most discussed SlideShare.net presentation on Linked In and yesterday they promoted it as the most Tweeted about presentation. In both cases, the presentation was moved up to the front page.
Arun is a Hortonworks founder and MapReduce expert. His talk does a great job of highlighting some of the current limitations in MapReduce and then outlining the roadmap for improving areas such as scalability, high availability, cluster utilization and support for paradigms other than MapReduce.…