New Apache Pig 0.9 Features – Part 2 (Embedding)

* Special note: the code discussed in this blog is available here *

A common complain of Pig is the lack of control flow statements: if/else, while loop, for loop, etc.

And now Pig has a response for it: Pig embedding. You can now write a python program and embed Pig scripts inside of it, leveraging all language features provided by Python, including control flow.

The Pig embedding API is similar to the database embedding API. You will compile statement, bind to parameter, execute statement and then iterate through cursor. The Pig embedding document provides an excellent guide on how the Pig embedding API works.

In this blog, I will show an example of a classic k-means algorithm implemented by Pig embedding.  The goal of the K-means algorithm is to cluster data into k cluster in which each observation belongs to the cluster with the nearest mean. It is an iterative process and it converges only if the cluster centroid does not move any further. To simplify the example, we only calculate k-means of 1-dimention data: gpa field in student.txt.

We compile the Pig script outside the loop since we will run the same query every time:

P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output'; """)

Then we enter the loop:

while iter_num<MAX_ITERATION: Q = P.bind({'centroids':initial_centroids}) # bind parameter: centroids from last iteration results = Q.runSingle()  # run this Pig script iter = results.result("result").iterator()  # open the iterator of result for i in range(k): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) # get ith centroid of this iteration # … calculate the moving distance # if the moving distance < tolerance then converge # otherwise, use the new centroid and loop again

Inside the loop, we bind the program with the Python variable: initial centroid. Then we run the Pig script by “runSingle”. “runSingle” kicks off a Pig script and waits for its completion. This is what we need in k-means. The current iteration depends on the result of the previous iteration, so we cannot run multiple iterations at the same time. Then we get an iterator, and we can walk through the iterator to get our results: new centroid. This process repeats until converge: centroid stops moving.

This covers the major steps of a typical Pig embedding script.

The complete Python script, udf and input data are available here.

To run this python script in local mode, use the command:

export PIG_CLASSPATH=build/ivy/lib/Pig/jython-2.5.0.jar bin/pig -x local kmeans.py

By the way, Pig embedding is also available in JavaScript.

Thanks to Richard Ding and Julien Le Dem for building this great feature.

Additional information can be found in the Pig embedding document and the slides created by Julien Le Dem.

– Daniel Dai

 

Categorized by :
Apache Hadoop Pig

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Integrating to YARN using Slider (part 2 of 3)
Thursday, August 7, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.