New Apache Pig 0.9 Features – Part 1 (Macros)
This is the first of three blogs that will highlight the new features in Pig 0.9.
When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
dump C;
Compare that to an SQL command:
Select COUNT(*) from student;
That’s just not intuitive, especially for new users.
Things are now changing for the better. With the 0.9 macro feature, you can write a macro to do this:
DEFINE row_count(X) RETURNS Z { Y = group $X all; $Z = foreach Y generate COUNT($X); };
You can also add a column name and constant as the parameter to macro:
DEFINE row_count_by(A, col, par) RETURNS C { B = group $A by $col parallel $par; $C = foreach B generate group, COUNT($A); }; X = LOAD 'student.txt' AS (name, age, gpa); Y = row_count_by(X, name, 3); dump Y;
One of the best things about macro is that you can organize your code in a meaningful way. Have you tried the Pig tutorial? Here is how your can rewrite the sample script:
REGISTER ./tutorial.jar; raw = LOAD 'excite-small.log' USING
PigStorage('\t') AS (user, time, query);
houred = get_hour_query(raw);
ngram_by_hour = generate_ngram_count_by_hour(houred);
scored = generate_score(ngram_by_hour);
post_process_and_save(scored);
How does that compare to the original script?
The macro can be housed in a separate file. Use “import” to import these files:
import “common.macro”
import “ngram.macro”
raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);
This is how you modularize your Pig script.
If there is something wrong in your macro, there is an easy way to debug it. Simply use the -r option and you can see the expanded code:
pig -x local -r tutorial.pig
[main] INFO org.apache.pig.Main - Dry run completed. Expanded pig script is at tutorial.expanded.pig.
More information about Pig macro can be found in Pig documentation.
Special thanks to Richard Ding for all of his great work in helping to make this happen.
Notes:
** More precisely, it should be COUNT_STAR. COUNT does not count empty rows. Thanks Raghu pointing out!
– Daniel Dai
Pingback: Advanced Pig Latin — Macros | Hadoop in Berlin