There will be a Pig meetup at Strata NYC/Hadoop World, at 6:30PM on Wed, Oct 24th in the Bryant Room of the Hilton New York. This will also be the inaugural meeting of the NYC Pig User Group, which Doug Daniels of Pig contributor Mortar Data was good enough to organize. We look forward to future Pig meetups in NYC!
The Hortonworks Blog
YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo.…
In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.
In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:
my_tf_idf_scores = tf_idf(id_body, ‘message_id’, ‘body’);
Our macro, in filename tfidf.macro looks just like our pig script, with a couple of new lines. Note the use of macro variables for input and output preceded with the ‘$’ character: $in_relation, $out_relation, $id_field and $text_field.…
Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great
turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.
On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups.…
The need for a ToJson EvalFunc
When integrating Pig with different NoSQL ‘databases,’ or when publishing data from Hadoop, it can be convenient to JSONize your data. Although Pig has JsonStorage, there hasn’t been a ToJson EvalFunc. This has been inconvenient, as in our post about Pig and ElasticSearch, such that for creating JSON for ElasticSearch to index, tricks like this were necessary:…
store enron_emails into ‘/tmp/enron_emails_elastic’ using JsonStorage();
json_emails = load ‘/tmp/enron_emails_elastic’ AS (json_record:chararray);
/* Now we can store our email json data to elasticsearch for indexing with message_id.
As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!
Search-Hadoop.com searches across projects – JIRAs, source code, mailing lists, wikis, etc. so you can see design and API docs, as well as questions, answers and general documentation.…
Apache ZooKeeper release 3.4.4 is now available. This is a bug fix release including 50 bug fixes. Following is a summary of the critical issues fixed in the release.
ZOOKEEPER-1419 Leader Election never settles for a 5 node cluster
ZOOKEEPER-1489 Data loss after truncate on transaction log
ZOOKEEPER-1412 java client watches inconsistently triggered on reconnect
ZOOKEEPER-1344 ZooKeeper client multi-update command is not considering the
ZOOKEEPER-1496 Ephemeral node not getting cleared even after client has exited
ZOOKEEPER-1437 Client uses session before SASL authentication complete
Stability of 3.4.4
As you might have noticed we have been marking all the previous 3.4.* releases as Alpha and beta.…
I hope you had fun pigging out to Hadoop with Alan Gates. We had interesting questions during the webinar and as always, your participation in these discussions will help us understand different use cases of Apache Pig and the growing community around this project. The recording is now available on our webinar site.
For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. …
Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.
Apache HCatalog is a table and storage management service for data created using Apache Hadoop.
A central theme was using HCatalog to enable sharing and use of legacy data and diverse formats like TSV, JSON, RCFile, Protobuf, Thrift and Avro, among diverse tools like Pig, Hive, Cascading, SQL-H and JAQL.…
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.…
Hadoop featured prominently at Stanford’s annual XLDB conference last week, as representatives from academia and industry gathered to discuss Extremely Large Databases. The conference program, with slides are available: http://www-conf.slac.stanford.edu/xldb2012/ProgramC.asp. A highly technical lineup presented on Big Data in biology and physics, and cloud computing and Hive in particular were topic areas.
Hortonworks Summer Internship 2012
As a first time intern, I can undoubtedly say that Hortonworks was the perfect place for me to gain real world work experience and have the chance to team up with many incredibly talented, driven people. Of course, I didn’t get to fully interact with everyone in the company in the three months that I was here but even after such a short time it is clear to me that it is the welcoming atmosphere and the determined team here that have allowed Hortonworks to achieve so many goals in just over a year.…
Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager
Apache Hadoop YARN – NodeManager
The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.…
Twitter Analytics presented their distributed infrastructure, including Hadoop and Pig, at a UC Berkeley iSchool special course called INFO 290: Analyzing Big Data with Twitter. Twitter is a major contributor to many Apache projects. The course was over-subscribed and was a great success, as students got to learn from practicing data scientists using Hadoop on truly massive datasets. The entire lecture series is available here.
Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.
What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.…