Posts by Daniel Dai:


Apache Pig 0.10.1 Released

We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.

Some of the notable changes include:

  • Source code-only distribution

In the download section for Pig 10.0.1, you will now find a source-only tarball (pig-0.10.1-src.tar.gz) alongside the traditional full tarball, rpm and deb distributions.

  • Better support for Apache Hadoop 0.23.x/2.x

Starting with Pig 0.10.1, the Pig team will now publish Maven artifacts for Hadoop 0.23.x/2.x (PIG-2907). Note that if you are using Hadoop 0.23.x/2.x, you will need to get different Pig Maven artifacts than from Hadoop 0.20.x/1.x. Here is the information to retrieve the Pig Maven artifacts for Hadoop 0.23.x/2.x:

<dependency>
 
 <groupId>org.apache.pig</groupId>
 
 <artifactId>pig</artifactId>
 
 <version>0.10.1</version>
 
 <classifier>h2</classifier>
 
</dependency>

In addition, the Pig team fixed a number of bugs specific to Hadoop 0.23.x/2.x (including PIG-3035, PIG-2783, PIG-2761, PIG-2912, and PIG-2791).

  • Better support for Oracle JDK 7

All unit tests for Pig 0.10.1 now pass with Oracle JDK7 (PIG-2908).

  •  End-to-End (e2e) tests and unit tests fixes

We continue to improve Pig e2e testing. With the latest enhancements, we are able to significantly reduce runtime for Pig e2e tests (PIG-2711). We are trying hard to make e2e tests pass on all platforms (PIG-2859, PIG-2783, PIG-2745).

We have also included some fixes for unit tests (PIG-2908, PIG-2650, PIG-3099, PIG-2960) to make sure unit tests pass on all currently supported platforms.

  • Other fixes

There are a number of other important bug fixes in the core Pig code, UDF and documentation. Details can be found in this document.

Special thanks for the Apache Pig community for doing all of this great work to make these improvements happen!

~ Daniel Dai

New Features in Apache Pig 0.10

Another important milestone for Apache Pig was reached this week with the release of Pig 0.10. The purpose of this blog is to summarize the new features in Pig 0.10.

Boolean Data Type

Pig 0.10 introduces boolean data type as a first-class Pig data type. Users can use the keyword “boolean” anywhere where a data type is expected, such as load-as clause, type cast clause, etc.

Here are some sample use cases:

a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);

b = foreach a generate a0, a1, (boolean)a2;

c = group b by a2; — group by a boolean field

When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null. When storing boolean data using PigStorage, true value will emit text “true” and false value will emit text “false”.
Read More

Bootstrap Sampling with Apache Pig

I ran across an interesting problem in my attempt to implement random forest using Apache Pig. In random forest, each tree is trained using a bootstrap sample. That is, sample N cases at random out of a dataset of size N, with replacement.

For example, here is the input data:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

Here is one bootstrap sample drawn from input:
(5, 2, 3, 2, 3, 9, 7, 3, 0, 4)

Each element can appear 0 to N times.

How does one get it done in Pig? I explored a few options and wanted to share my findings.

Read More

New Apache Pig 0.9 Features – Part 2 (Embedding)

* Special note: the code discussed in this blog is available here *

A common complain of Pig is the lack of control flow statements: if/else, while loop, for loop, etc.

And now Pig has a response for it: Pig embedding. You can now write a python program and embed Pig scripts inside of it, leveraging all language features provided by Python, including control flow.

The Pig embedding API is similar to the database embedding API. You will compile statement, bind to parameter, execute statement and then iterate through cursor. The Pig embedding document provides an excellent guide on how the Pig embedding API works.

Read More

New Apache Pig 0.9 Features – Part 1 (Macros)

This is the first of three blogs that will highlight the new features in Pig 0.9.

When I first started to use Pig, the one thing that I hated the most was that I needed to write 4 lines of code to get a simple count:
A = load ‘student.txt’ as (name, student, gpa);
B = group A all;
C = foreach B generate COUNT(A); **
dump C;

Compare that to an SQL command:
Select COUNT(*) from student;

That’s just not intuitive, especially for new users.

Things are now changing for the better. With the 0.9 macro feature, you can write a macro to do this:

Read More