YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo.…
The Hortonworks Blog
In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.
In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:
my_tf_idf_scores = tf_idf(id_body, ‘message_id’, ‘body’);
Our macro, in filename tfidf.macro looks just like our pig script, with a couple of new lines. Note the use of macro variables for input and output preceded with the ‘$’ character: $in_relation, $out_relation, $id_field and $text_field.…
Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great
turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.
On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups.…
The need for a ToJson EvalFunc
When integrating Pig with different NoSQL ‘databases,’ or when publishing data from Hadoop, it can be convenient to JSONize your data. Although Pig has JsonStorage, there hasn’t been a ToJson EvalFunc. This has been inconvenient, as in our post about Pig and ElasticSearch, such that for creating JSON for ElasticSearch to index, tricks like this were necessary:…
store enron_emails into ‘/tmp/enron_emails_elastic’ using JsonStorage();
json_emails = load ‘/tmp/enron_emails_elastic’ AS (json_record:chararray);
/* Now we can store our email json data to elasticsearch for indexing with message_id.
Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us?…
As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!
Search-Hadoop.com searches across projects – JIRAs, source code, mailing lists, wikis, etc. so you can see design and API docs, as well as questions, answers and general documentation.…
Apache ZooKeeper release 3.4.4 is now available. This is a bug fix release including 50 bug fixes. Following is a summary of the critical issues fixed in the release.
ZOOKEEPER-1419 Leader Election never settles for a 5 node cluster
ZOOKEEPER-1489 Data loss after truncate on transaction log
ZOOKEEPER-1412 java client watches inconsistently triggered on reconnect
ZOOKEEPER-1344 ZooKeeper client multi-update command is not considering the
ZOOKEEPER-1496 Ephemeral node not getting cleared even after client has exited
ZOOKEEPER-1437 Client uses session before SASL authentication complete
Stability of 3.4.4
As you might have noticed we have been marking all the previous 3.4.* releases as Alpha and beta.…
I hope you had fun pigging out to Hadoop with Alan Gates. We had interesting questions during the webinar and as always, your participation in these discussions will help us understand different use cases of Apache Pig and the growing community around this project. The recording is now available on our webinar site.
For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. …
Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.
Apache HCatalog is a table and storage management service for data created using Apache Hadoop.
A central theme was using HCatalog to enable sharing and use of legacy data and diverse formats like TSV, JSON, RCFile, Protobuf, Thrift and Avro, among diverse tools like Pig, Hive, Cascading, SQL-H and JAQL.…
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.…
Hadoop featured prominently at Stanford’s annual XLDB conference last week, as representatives from academia and industry gathered to discuss Extremely Large Databases. The conference program, with slides are available: http://www-conf.slac.stanford.edu/xldb2012/ProgramC.asp. A highly technical lineup presented on Big Data in biology and physics, and cloud computing and Hive in particular were topic areas.
Partner Webinar Series
On September 18 at 10am PT/1pm ET we join our partner Datameer in a webcast aimed at providing answers to some common questions we hear in the industry. Specifically, what are some of the use cases that big data analytics is perfect for?
By looking at some common uses we are seeing, you’ll be able to envision how you can leverage the analytics results from your own data.…
Hortonworks Summer Internship 2012
As a first time intern, I can undoubtedly say that Hortonworks was the perfect place for me to gain real world work experience and have the chance to team up with many incredibly talented, driven people. Of course, I didn’t get to fully interact with everyone in the company in the three months that I was here but even after such a short time it is clear to me that it is the welcoming atmosphere and the determined team here that have allowed Hortonworks to achieve so many goals in just over a year.…
Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop
It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…
- Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
- Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
- And, our Hortonworks team grew by leaps and bounds!
Partner Webinar Series
Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.
Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen.…