As enterprises increasingly adopt Apache Hadoop for critical data, the need for high-quality releases of Apache Hadoop becomes even more crucial. Storage systems in particular require robustness and data integrity since enterprises cannot tolerate data corruption or loss. Further, Apache Hadoop offers an execution engine for customer applications that comes with its own challenges. Apache Hadoop handles failures of disks, storage nodes, compute nodes, network and applications. The distributed nature, scale and rich feature set makes testing Apache Hadoop non-trivial.…
The Hortonworks Blog
A common use case we have seen is people want to operate on certain columns and project other columns as is or pass a range of input columns to a user defined function. In 0.9, you have project-range, which makes it easier to write statements that do just that.…
* Special note: the code discussed in this blog is available here *
A common complain of Pig is the lack of control flow statements: if/else, while loop, for loop, etc.
And now Pig has a response for it: Pig embedding. You can now write a python program and embed Pig scripts inside of it, leveraging all language features provided by Python, including control flow.
The Pig embedding API is similar to the database embedding API.…
This is the first of three blogs that will highlight the new features in Pig 0.9.
When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
Compare that to an SQL command:
Select COUNT(*) from student;
That’s just not intuitive, especially for new users.…
For the first time in its history, OSCON, the premier open-source conference, had a special OSCON Data sub-conference. Apache Hadoop had a full track dedicated to it at OSCON Data. This clearly was indicative of the interest in Big Data and the central role Apache Hadoop plays in the space. A special shout out to Bradford Stephens and Sarah Novotny, the program chairs, who did a fantastic job with OSCON Data.…
I’d like to congratulate Arun Murthy on his very popular Hadoop Summit talk. SlideShare.net reports that his presentation has gone viral. They originally promoted it as the most discussed SlideShare.net presentation on Linked In and yesterday they promoted it as the most Tweeted about presentation. In both cases, the presentation was moved up to the front page.
Arun is a Hortonworks founder and MapReduce expert. His talk does a great job of highlighting some of the current limitations in MapReduce and then outlining the roadmap for improving areas such as scalability, high availability, cluster utilization and support for paradigms other than MapReduce.…
Things are going really well at Hortonworks. We’re in our new office, connected to our data center of nearly 1000 nodes (thanks Yahoo!) and working away on our new computers. We’ve gotten a lot done in a very small amount of time. Along with our excellent G&A team, a key reason we’ve gotten so much done is that our founders have really stepped up and are taking responsibility for getting their teams moving.…
More news. We’ve put the Hortonworks slides from the Hadoop Summit on slideshare.net for those that are interested in seeing them:
Hortonworks Hadoop Summit 2011 Keynote – Eric14 (my keynote)
Crossing the Chasm: Hadoop for the Enterprise – Sanjay Radia
Next Generation Apache Hadoop MapReduce – Arun C. Murthy
Introducing HCatalog (Hadoop Table Manager) – Alan Gates
HDFS Federation and Other Features – Suresh Srinivas and Sanjay Radia…
Wow, Hortonworks day one! Our first day of being “on the record”. It’s been a busy, but very productive day. Now that we are talking publicly about Hortonworks, there has been a LOT of interest in what we’re doing from analysts and journalists. So far the feedback we’re received has been very positive.
I haven’t been able to read every article but a few have caught my eye that I wanted to share.…
We’re glad to have finally launched Hortonworks after months of planning and speculation. I thought I’d use the opportunity of my first Hortonworks blog to lay out who we are and what we’re all about.
Hortonworks was formed by Yahoo! and Benchmark Capital in June 2011 in order to accelerate the development and adoption of Apache Hadoop. We believe that Apache Hadoop will become the de facto platform for storing, managing and analyzing “big data,” namely the exploding volume of data being generated daily by organizations around the globe.…