Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
July 29, 2011
prev slideNext slide

New Apache Pig 0.9 Features – Part 1 (Macros)

This is the first of three blogs that will highlight the new features in Pig 0.9.

When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
dump C;

Compare that to an SQL command:
Select COUNT(*) from student;

That’s just not intuitive, especially for new users.

Things are now changing for the better. With the 0.9 macro feature, you can write a macro to do this:

DEFINE row_count(X) RETURNS Z { Y = group $X all; $Z = foreach Y generate COUNT($X); };

You can also add a column name and constant as the parameter to macro:

DEFINE row_count_by(A, col, par) RETURNS C { B = group $A by $col parallel $par; $C = foreach B generate group, COUNT($A); }; X = LOAD 'student.txt' AS (name, age, gpa); Y = row_count_by(X, name, 3); dump Y;

One of the best things about macro is that you can organize your code in a meaningful way. Have you tried the Pig tutorial? Here is how your can rewrite the sample script:

REGISTER ./tutorial.jar; raw = LOAD 'excite-small.log' USING
PigStorage('t') AS (user, time, query);
houred = get_hour_query(raw);
ngram_by_hour = generate_ngram_count_by_hour(houred);
scored = generate_score(ngram_by_hour);

How does that compare to the original script?

The macro can be housed in a separate file. Use “import” to import these files:
import “common.macro”
import “ngram.macro”
raw = LOAD 'excite-small.log' USING PigStorage('t') AS (user, time, query);

This is how you modularize your Pig script.

If there is something wrong in your macro, there is an easy way to debug it. Simply use the -r option and you can see the expanded code:

pig -x local -r tutorial.pig
[main] INFO  org.apache.pig.Main - Dry run completed. Expanded pig script is at tutorial.expanded.pig.

More information about Pig macro can be found in Pig documentation.

Special thanks to Richard Ding for all of his great work in helping to make this happen.

** More precisely, it should be COUNT_STAR. COUNT does not count empty rows. Thanks Raghu pointing out!

— Daniel Dai




It would be nice if macros could also be re-defined, or at least UNDEFINEd and DEFINEd again.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums