The fifth annual Hadoop Summit drew to a close last week, with over 2200 Hadoopniks in attendance. While there were many innovations demonstrated, for me the best action was about Pig, HCatalog and Hive from Hortonworks and Twitter.
At the Hadoop Summit Pig Meetup, Twitter announced Ambrose, which now includes an excellent graph layout of Pig EXPLAIN data. This visualization can be used to debug and better understand your Pig scripts.
Jimmy Lin’s sold out talk about Large Scale Machine Learning at Twitter (paper available) (slides available) described the use of Pig to train machine learning algorithms at scale using Hadoop. Interestingly, learning was achieved using a Pig UDF StoreFunc (documentation available). Some interesting, related work can be found by Ted Dunning on github (source available).
There was much excitement about Dmitriy Ryaboy’s talk about Flexible Indexing in Hadoop (slides available). Twitter has created a novel indexing system atop Hadoop to avoid “Looking for needles in haystacks with snowplows,” or – using mapreduce over lots of data to pick out a few records. Twitter Analytics’s new tool, Elephant Twin goes beyond folder/subfolder partitioning schemes used by many, for instance bucketizing data by /year/month/week/day/hour. Elephant Twin is a framework for creating indexes in Hadoop using Lucene. This enables you to push filtering down into Lucene, to return a few records and to dramatically reduce the records streamed and the time spent on jobs that only parse a small subset of your overall data. A huge boon for the Hadoop Community from Twitter!
Alan Gates‘ talk, Web Services in Hadoop (slides available) about HCatalog as a RESTful front-end to Hadoop resources to enable applications to integrate with Hadoop and extend the stack upwards. HCatalog helps users of Pig, Hive and MapReduce to effectively discover and share resources. At version 0.4, HCatalog is not quite yet ready for production but has been in the Apache Incubator for a year and is improving fast. HCatalog is available in the new Hortonworks Data Platform.
Daniel Dai and Thejas Nair’s talk, Pig programming is more fun: New features in Pig (slides available), covered new features of Pig in v0.10, which you can read more about here. Improvements to Piggybank to include Pig Macros were discussed. ILLUSTRATE has been fixed in Pig 0.10 and now works with AvroStorage. Pig’s ability to cast single-record relations as scalars is a great addition to the language. UDFs in JRuby greatly simplify extending Pig, bringing many JRuby compatible gems into Pig. Pig embedding enables iterative pig scripts, such as training statistical models. Pig’s HCatalog integration enables sharing resources with Hive and MapReduce users. MongoStorage integration enables simple data publishing to MongoDB, a popular NoSQL database. Finally, Talend integration allows graphical programming of Pig.
The Hadoop market continues to mature and grow, but “you ain’t seen nothing, yet!” Every shred of data on earth is going on HDFS, and we’ve only just begun the big data journey. I can’t wait till next year’s Hadoop Summit to find out more!
~ Russell Jurney