New Apache Pig 0.9 Features – Part 1 (Macros)

This is the first of three blogs that will highlight the new features in Pig 0.9.

When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
dump C;

Compare that to an SQL command:
Select COUNT(*) from student;

That’s just not intuitive, especially for new users.

Things are now changing for the better. With the 0.9 macro feature, you can write a macro to do this:

DEFINE row_count(X) RETURNS Z { Y = group $X all; $Z = foreach Y generate COUNT($X); };

You can also add a column name and constant as the parameter to macro:

DEFINE row_count_by(A, col, par) RETURNS C { B = group $A by $col parallel $par; $C = foreach B generate group, COUNT($A); }; X = LOAD 'student.txt' AS (name, age, gpa); Y = row_count_by(X, name, 3); dump Y;

One of the best things about macro is that you can organize your code in a meaningful way. Have you tried the Pig tutorial? Here is how your can rewrite the sample script:

REGISTER ./tutorial.jar; raw = LOAD 'excite-small.log' USING
PigStorage('\t') AS (user, time, query);
houred = get_hour_query(raw);
ngram_by_hour = generate_ngram_count_by_hour(houred);
scored = generate_score(ngram_by_hour);

How does that compare to the original script?

The macro can be housed in a separate file. Use “import” to import these files:
import “common.macro”
import “ngram.macro”
raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);

This is how you modularize your Pig script.

If there is something wrong in your macro, there is an easy way to debug it. Simply use the -r option and you can see the expanded code:

pig -x local -r tutorial.pig
[main] INFO  org.apache.pig.Main - Dry run completed. Expanded pig script is at tutorial.expanded.pig.

More information about Pig macro can be found in Pig documentation.

Special thanks to Richard Ding for all of his great work in helping to make this happen.

** More precisely, it should be COUNT_STAR. COUNT does not count empty rows. Thanks Raghu pointing out!

— Daniel Dai

Categorized by :
Hadoop Pig

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.