Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig

In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.

In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:

  1. import 'tfidf.macro';
  2. my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

Our macro, in filename tfidf.macro looks just like our pig script, with a couple of new lines. Note the use of macro variables for input and output preceded with the ‘$’ character: $in_relation, $out_relation, $id_field and $text_field. These let us apply the variable to any relation with a unique identifier field and a text body field. You can get it on github here. The file which tests the macro is here. The code that the macro generates is here.

  1. DEFINE tf_idf(in_relation, id_field, text_field) RETURNS out_relation {
  2.   token_records = foreach $in_relation generate $id_field, FLATTEN(TOKENIZE($text_field)) as tokens;
  4.   /* Calculate the term count per document */
  5.   doc_word_totals = foreach (group token_records by ($id_field, tokens)) generate 
  6.     FLATTEN(group) as ($id_field, token), 
  7.     COUNT_STAR(token_records) as doc_total;
  9.   /* Calculate the document size */
  10.   pre_term_counts = foreach (group doc_word_totals by $id_field) generate
  11.     group AS $id_field,
  12.     FLATTEN(doc_word_totals.(token, doc_total)) as (token, doc_total), 
  13.     SUM(doc_word_totals.doc_total) as doc_size;
  15.   /* Calculate the TF */
  16.   term_freqs = foreach pre_term_counts generate $id_field as $id_field,
  17.     token as token,
  18.     ((double)doc_total / (double)doc_size) AS term_freq;
  20.   /* Get count of documents using each token, for idf */
  21.   token_usages = foreach (group term_freqs by token) generate
  22.     FLATTEN(term_freqs) as ($id_field, token, term_freq),
  23.     COUNT_STAR(term_freqs) as num_docs_with_token;
  25.   /* Get document count */
  26.   just_ids = foreach $in_relation generate $id_field;
  27.   ndocs = foreach (group just_ids all) generate COUNT_STAR(just_ids) as total_docs;
  29.   /* Note the use of Pig Scalars to calculate idf */
  30.   $out_relation = foreach token_usages {
  31.     idf    = LOG((double)ndocs.total_docs/(double)num_docs_with_token);
  32.     tf_idf = (double)term_freq * idf;
  33.     generate $id_field as $id_field,
  34.       token as score,
  35.       (chararray)tf_idf as value:chararray;
  36.   };
  37. };

Note that to debug macros, we can use the -r flag, which will expand the code the macro generates into a .expanded file. It is worth pointing out that this takes us from 37 lines of Pig to 2 lines of pig. Macros facilitate code modularization, re-use and sharing.

Are you sharing enough Hadoop code in your enterprise?

Categorized by :

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.