New Apache Pig 0.9 Features – Part 1 (Macros)

This is the first of three blogs that will highlight the new features in Pig 0.9.

When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
dump C;

Compare that to an SQL command:
Select COUNT(*) from student;

That’s just not intuitive, especially for new users.

Things are now changing for the better. With the 0.9 macro feature, you can write a macro to do this:

DEFINE row_count(X) RETURNS Z { Y = group $X all; $Z = foreach Y generate COUNT($X); };

You can also add a column name and constant as the parameter to macro:

DEFINE row_count_by(A, col, par) RETURNS C { B = group $A by $col parallel $par; $C = foreach B generate group, COUNT($A); }; X = LOAD 'student.txt' AS (name, age, gpa); Y = row_count_by(X, name, 3); dump Y;

One of the best things about macro is that you can organize your code in a meaningful way. Have you tried the Pig tutorial? Here is how your can rewrite the sample script:

REGISTER ./tutorial.jar; raw = LOAD 'excite-small.log' USING
PigStorage('\t') AS (user, time, query);
houred = get_hour_query(raw);
ngram_by_hour = generate_ngram_count_by_hour(houred);
scored = generate_score(ngram_by_hour);
post_process_and_save(scored);

How does that compare to the original script?

The macro can be housed in a separate file. Use “import” to import these files:
import “common.macro”
import “ngram.macro”
raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);

This is how you modularize your Pig script.

If there is something wrong in your macro, there is an easy way to debug it. Simply use the -r option and you can see the expanded code:

pig -x local -r tutorial.pig
[main] INFO  org.apache.pig.Main - Dry run completed. Expanded pig script is at tutorial.expanded.pig.

More information about Pig macro can be found in Pig documentation.

Special thanks to Richard Ding for all of his great work in helping to make this happen.

Notes:
** More precisely, it should be COUNT_STAR. COUNT does not count empty rows. Thanks Raghu pointing out!

– Daniel Dai

Categorized by :
Apache Hadoop Pig

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

Big Data Virtual Meetup Chennai
Wednesday, October 29, 2014
9:00 pm India Time / 8:30 am Pacific Time / 4:30 pm Europe Time (Paris)

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :