Hortonworks on Apache Hadoop


Big Data in London – Thoughts From the Tube

Hortonworks sponsored the O’Reilly Strata conference in earlier this month at the Hilton Metropole in London. It was great meeting big data enthusiasts at the conference. We had fun giving away our little green mascot and came away pleasantly surprised at the state of interest in Big Data in the UK and Europe. There were over 500 attendees, which for a first time conference is a very good result. Conversations ranged from introductory “What is Apache Hadoop?” to deep discussions regarding how Hadoop was being used in production today. After talking to other vendors, attendees and organizers it appears that the market is somewhere between 12 and 18 months less mature than the Big Data market in the US. That said we think adoption could occur more quickly in the US as the state of the technology and ecosystem evolves heading into 2013.…

Read More

Apache Hadoop 2.0.2-alpha Released!

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.0.2-alpha.

This is the second (alpha) release of the next generation release of Apache Hadoop 2.x and comes with significant enhancements to both the major components of Hadoop:

  • HDFS HA has undergone significant enhancements since the previous release for NameNode High Availability
  • YARN has undergone significant testing and stabilization and validation as is been heavily battle-tested since the previous release.

These are exciting times indeed for the Apache Hadoop community – personally, this is very reminiscent of the period in 2009 when we finally saw the light at the end of the tunnel during the stabilization of Apache Hadoop 1.x (then called Apache Hadoop 0.20.x). A déjà vu, if you will – albeit of the pleasant kind! Yes, we have a few miles to clock, but it feels like the hardest part is already behind us.…

Read More

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview!

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.…

Read More

Teradata Webinar: Business Value with Big Analytics

Back in June we joined Teradata Aster in a webcast “Back to the Future – MapReduce, Hadoop and the Data Scientist” to highlight the benefits of Apache Hadoop and the role that data scientists are playing in big data. You can check out the replay here. The discussion focused around how big data architectures could bring more value to businesses using relational DBMS technology and Hadoop, and how the two can coexist.

On October 17th at 10am PDT, Teradata will host a webcast that raises the level and builds on the important theme of Hadoop and business value, recognizing that many are deeply involved with discovering the easiest and best way to bring their data to life. Teradata Aster plans to show how executives, analysts and IT managers can leverage breakthrough enterprise class big analytics solutions to inject innovative analytics into business processes for better data-driven decisions.…

Read More

Big Data Security Part One: Introducing PacketPig

Series Introduction

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

Introducing Packetpig

Intrusion detection is the analysis of network traffic to detect intruders on your network. Most intrusion detection systems (IDS) look for signatures of known attacks and identify them in real-time. Packetpig is different.…

Read More

Meet the Committer: Mahadev Konar

We had another amazing turn out on our Ambari webinar with Matt Foley a couple of weeks back. This series was meant to educate Hadoop enthusiasts and help them gain better understanding of the value of Hadoop and I think we’re on the right track. If you missed or would like a refresher from our last two webinars (Pig and Ambari) you can find the recording here: https://hortonworks.com/webinars/

We’re starting the third installment of the “Future of Apache Hadoop” series next Wednesday on “Scaling Apache Zookeeper to the Next Generation Applications” with Mahadev Konar (@mahadevkonar) Hortonworks co-founder and core contributor and PMC member of the Apache Zookeeper.

Get to know Mahadev in this third installment of our “Meet the Committer” series.

Kim: Tell us about your current role and how you interact with Apache Hadoop?

Mahadev: Currently I am leading the effort on Apache Ambari.…

Read More

Insights from DataWeek: San Francisco

I spent some time at the first ever DataWeek in San Francisco last week.  It is a brand new show and it was very well-run, spread across a few cool spaces with an interesting mix of novice to experienced data professionals.  They had a good blend of labs, speakers, panels and great networking opportunities.  In all, it was great and a big thanks and kudos to the organizers.

I took part in a panel and also presented a three-hour overview of Hadoop.  There were some good questions thrown at the panel but more interesting was the discussion over the three sessions.  Before each presentation, I ran an informal survey of the room to get a sense of audience and there was an even mix of complete novice, those new to Hadoop and experienced practitioners.

Each session had lively discussion and great engagement.  …

Read More

Miss Piggy Takes Manhattan: Pig Meetup at Strata NYC on Wed, Oct 24th

There will be a Pig meetup at Strata NYC/Hadoop World, at 6:30PM on Wed, Oct 24th in the Bryant Room of the Hilton New York. This will also be the inaugural meeting of the NYC Pig User Group, which Doug Daniels of Pig contributor Mortar Data was good enough to organize. We look forward to future Pig meetups in NYC!

Hortonworks’ own Daniel Dai @daijy, VP of Apache Pig, will present on new features in Pig 0.11. You can view a summary of JIRA tickets for Pig 0.11 here. New features include the CUBE operator, a new RANK operator, the addition of a DateTime type, speed improvements via SchemaTuple, and many others.

More information is available on the Pig meetup page: http://www.meetup.com/PigUser/events/85047782/.

Those of you too young to understand the Miss Piggy reference, should look here.…

Read More

YARN Meetup at Hortonworks on Friday, Oct 12

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

Agenda includes:

  • YARN committers from Yahoo will present on current YARN deployments at Yahoo, including lessons learned, stability, etc.

Read More

Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig

In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.

In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:

1
2
import 'tfidf.macro';
my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

Our macro, in filename tfidf.macro looks just like our pig script, with a couple of new lines. Note the use of macro variables for input and output preceded with the ‘$’ character: $in_relation, $out_relation, $id_field and $text_field. These let us apply the variable to any relation with a unique identifier field and a text body field. You can get it on github here. The file which tests the macro is here. The code that the macro generates is here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
DEFINE tf_idf(in_relation, id_field, text_field) RETURNS out_relation {
  token_records = foreach $in_relation generate $id_field, FLATTEN(TOKENIZE($text_field)) as tokens;
 
  /* Calculate the term count per document */
  doc_word_totals = foreach (group token_records by ($id_field, tokens)) generate 
    FLATTEN(group) as ($id_field, token), 
    COUNT_STAR(token_records) as doc_total;
 
  /* Calculate the document size */
  pre_term_counts = foreach (group doc_word_totals by $id_field) generate
    group AS $id_field,
    FLATTEN(doc_word_totals.(token, doc_total)) as (token, doc_total), 
    SUM(doc_word_totals.doc_total) as doc_size;
 
  /* Calculate the TF */
  term_freqs = foreach pre_term_counts generate $id_field as $id_field,
    token as token,
    ((double)doc_total / (double)doc_size) AS term_freq;
 
  /* Get count of documents using each token, for idf */
  token_usages = foreach (group term_freqs by token) generate
    FLATTEN(term_freqs) as ($id_field, token, term_freq),
    COUNT_STAR(term_freqs) as num_docs_with_token;
 
  /* Get document count */
  just_ids = foreach $in_relation generate $id_field;
  ndocs = foreach (group just_ids all) generate COUNT_STAR(just_ids) as total_docs;
 
  /* Note the use of Pig Scalars to calculate idf */
  $out_relation = foreach token_usages {
    idf    = LOG((double)ndocs.total_docs/(double)num_docs_with_token);
    tf_idf = (double)term_freq * idf;
    generate $id_field as $id_field,
      token as score,
      (chararray)tf_idf as value:chararray;
  };
};

Note that to debug macros, we can use the -r flag, which will expand the code the macro generates into a .expanded file.…

Read More

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group)

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great
turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup.…

Read More

JSONize Anything in Pig with ToJson

The need for a ToJson EvalFunc

When integrating Pig with different NoSQL ‘databases,’ or when publishing data from Hadoop, it can be convenient to JSONize your data. Although Pig has JsonStorage, there hasn’t been a ToJson EvalFunc. This has been inconvenient, as in our post about Pig and ElasticSearch, such that for creating JSON for ElasticSearch to index, tricks like this were necessary:

1
2
3
4
5
6
store enron_emails into '/tmp/enron_emails_elastic' using JsonStorage();
json_emails = load '/tmp/enron_emails_elastic' AS (json_record:chararray);
 
/* Now we can store our email json data to elasticsearch for indexing with message_id. */
store json_emails into 'es://enron/email?json=true&size=1000' USING
  com.infochimps.elasticsearch.pig.ElasticSearchStorage('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins')

Note how we store as JSON via JsonStorage, then load as a chararray to get the entire record as JSON. It would be more convenient to convert Pig bags and tuples to JSON directly. This would let us retain an ID field as key, and only JSONize our record for that key as a string.…

Read More

InfoQ: Hadoop and Metadata (Removing the Impedance Mis-match)

InfoQ has an article out today on HCatalog by Hortonworks’ own Alan Gates and Russell Jurney.

Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?

Check out the article at InfoQ: http://www.infoq.com/articles/HadoopMetadata

Read More

Search Hadoop with Search-Hadoop.com

As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!

Search-Hadoop.com searches across projects – JIRAs, source code, mailing lists, wikis, etc. so you can see design and API docs, as well as questions, answers and general documentation. Filtering by project is a big help – but search-hadoop also lets you see the similarities between projects.

Search Hadoop runs on Solr 3.6.1, but will be moving to Solr 4.0 this Fall. Solr 4.0, aka SolrCloud, is a fully distributed version of Solr (indices are sharded and replicated) that uses ZooKeeper for coordination.

The autocomplete feature is particularly cool. It offers several groups of suggestions separated by a lovely thin pink line, so one can easily pick the suggestion to follow.…

Read More

ZooKeeper 3.4.4 is Now Available

Apache ZooKeeper release 3.4.4 is now available. This is a bug fix release including 50 bug fixes. Following is a summary of the critical issues fixed in the release.

ZOOKEEPER-1419 Leader Election never settles for a 5 node cluster

ZOOKEEPER-1489 Data loss after truncate on transaction log

ZOOKEEPER-1412 java client watches inconsistently triggered on reconnect

ZOOKEEPER-1344 ZooKeeper client multi-update command is not considering the
Chroot request

ZOOKEEPER-1496 Ephemeral node not getting cleared even after client has exited

ZOOKEEPER-1437 Client uses session before SASL authentication complete

Stability of 3.4.4

As you might have noticed we have been marking all the previous 3.4.* releases as Alpha and beta. After having a couple of releases out (3.4.0, 3.4.1, 3.4.2, 3.4.3), we in the ZooKeeper community have decided to upgrade 3.4.4 as a stable release.

Acknowledgements

Thanks to everyone who contributed towards the release including our users who reported the bugs in 3.4.4.…

Read More

Go to page:« First...7891011...Last »