New Apache Pig 0.9 Features – Part 1 (Macros)

This is the first of three blogs that will highlight the new features in Pig 0.9.

When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
dump C;

Compare that to an SQL command:
Select COUNT(*) from student;

That’s just not intuitive, especially for new users.

Things are now changing for the better. With the 0.9 macro feature, you can write a macro to do this:

DEFINE row_count(X) RETURNS Z { Y = group $X all; $Z = foreach Y generate COUNT($X); };

You can also add a column name and constant as the parameter to macro:

DEFINE row_count_by(A, col, par) RETURNS C { B = group $A by $col parallel $par; $C = foreach B generate group, COUNT($A); }; X = LOAD 'student.txt' AS (name, age, gpa); Y = row_count_by(X, name, 3); dump Y;

One of the best things about macro is that you can organize your code in a meaningful way. Have you tried the Pig tutorial? Here is how your can rewrite the sample script:

REGISTER ./tutorial.jar; raw = LOAD 'excite-small.log' USING
PigStorage('\t') AS (user, time, query);
houred = get_hour_query(raw);
ngram_by_hour = generate_ngram_count_by_hour(houred);
scored = generate_score(ngram_by_hour);

How does that compare to the original script?

The macro can be housed in a separate file. Use “import” to import these files:
import “common.macro”
import “ngram.macro”
raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);

This is how you modularize your Pig script.

If there is something wrong in your macro, there is an easy way to debug it. Simply use the -r option and you can see the expanded code:

pig -x local -r tutorial.pig
[main] INFO  org.apache.pig.Main - Dry run completed. Expanded pig script is at tutorial.expanded.pig.

More information about Pig macro can be found in Pig documentation.

Special thanks to Richard Ding for all of his great work in helping to make this happen.

** More precisely, it should be COUNT_STAR. COUNT does not count empty rows. Thanks Raghu pointing out!

— Daniel Dai

Categorized by :
Apache Hadoop Pig


Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Thursday, November 6, 2014
1:00 PM Eastern / 12:00 PM Central / 11:00 AM Mountain / 10:00 AM Pacific

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Explore Technology Partners
Hortonworks nurtures an extensive ecosystem of technology partners, from enterprise platform vendors to specialized solutions and systems integrators.