New Apache Pig 0.9 Features – Part 3 (Additional Features)

In addition to the new Macros and Embedding features describe earlier by Daniel Dai, here are a set of additional features in Apache Pig 0.9:

Project-range expression
A common use case we have seen is people want to operate on certain columns and project other columns as is or pass a range of input columns to a user defined function. In 0.9, you have project-range, which makes it easier to write statements that do just that. It is similar to the previously available star expression except that it lets you specify a start and end column to be projected.

For example, using previous versions of Pig, if you wanted to replace the IP address field in your input with city and state, the query would like following:

input_city_state = FOREACH input GENERATE user, age, gender, flatten(getCityState(ip)), start_date, rank, isRobot, friend_list, privacy_setting;

If the schema had columns in following order – (user, age, gender, ip, start_date, rank, activity_summary, friend_list,privacy_setting).

In Pig 0.9, the query can now be written using a project-range expression as:

input_city_state = FOREACH input GENERATE user .. gender, flatten(getCityState(ip)), start_date .. ;

Here ‘user .. gender’ would represent all fields in schema starting from user to gender. ‘start_date .. ‘ would represent all fields in schema start from start_date to the last column.

You can also use it to specify arguments for a user-defined function:

nonRobotUsers = filter input by IsRobotConfidence(start_date .. friend_list) < 0.5 ;

See pig expression documentation, PIG-1693 and PIG-1938 for more information.

Combiner optimization in more cases
The MapReduce combiner phase is now used in more situations because of improvements in combiner optimizer rules.

For example, the combiner now can be used even when the algebraic udfs are part of another expression:

dup_info = FOREACH grouped_rel generate ufunc(SUM(bagCol.$0) , COUNT(bagCol.$0)) AS dups;

It also fixes the case when accessing elements of group column in output of (co)group would result in combiner not being used. For example:

info = FOREACH grouped_rel generate group.$0 , group.$1, COUNT(bagCol);

In earlier versions of Pig, you needed to use a workaround using flatten on the group column in order to enable the combiner.

See combiner use documentation, PIG-750 and PIG-946 for more information

Better error messages
Thanks to Xuefu Zhang, Pig now has better error messages for syntax and semantic errors. Pig 0.9 contains new parser generation code that uses antlr. You also now get line number information from error messages generated after the parsing phase, such as those during type checking.

For example, line number information can now be printed for the following query, which tries to add a chararray to an integer:

$ cat t.pig A = load 'x'; B = foreach A generate 'a' + 1; dump B; $ bin/pig -x local t.pig file t.pig, line 2, column 27 (Name: Add Type: null Uid: null) incompatible types in Add Operator left hand side:chararray right hand side:int.

This work is in PIG-1618 and several other JIRAs.

Illustrate (example generator) issues fixed
This feature is an awesome albeit generally unknown feature of Pig (a paper on this topic received the ACM SIGMOD 2009 best paper award). Essentially, this is a great tool for debugging your Pig script. It has been revamped to work with all of the new features added in recent versions of Pig. This is result of major work done by Yan Zhou.

See the Illustrate documentation and Illustrate fix tracking JIRA for more information

Cleanup of pig semantics
Daniel Dai worked on several JIRAs that fix issues with Pig semantics.

One example is the output schema of a FOREACH operator that has flatten on a column with unknown schema.

For example:

> describe in; in: {c1: (),c2: bytearray} > res = FOREACH IN generate FLATTEN(c1), c2;

In Pig 0.8, you would see this:

>describe res; res: {bytearray,c2: bytearray}

This is incorrect because the output of FLATTEN(c1) can result in several columns depending on how many columns the tuple c1 has.

In Pig 0.9, you will see this:

> describe res; Schema for res unknown.

As you can see, it now results in the output schema becoming unknown.

See PIG-1938PIG-1745PIG-1188, PIG-1112, PIG-749, PIG-435 and PIG-1627 for more information.

Typed Map
You can now specify the map value type for a map field in the schema. You can get rid of the casts that you had to convert the value to the real type.

See the map schema documentation and PIG-1876 for more information.

New udfs
There are several new piggybank udfs including -
* AvroStorage is a Pig load/store function for avro data, contributed by Lin Guo and Jakob Homan. See PIG-1748 for more information.
* AllLoader is the Swiss army knife of load functions. It can handle multiple file formats by looking at file extensions or guess the format using the bytes in the header. It also supports filtering partitions of Hive/HCatalog style. This feature was contributed by Gerrit Jansen van Vuuren. See PIG-1722 for more information.
* TOMAP built-in udf is used to create a map from other columns. Contributed by Olga Natkovich, this udf along with the udfs TOTUPLE and TOBAG makes it easy to create Pig complex data types easily from other columns. See PIG-1809 for more information.

There are also a bunch of bug fixes and improvements for HBaseStorage that have gone into Pig 0.9, including the ability to load data by column family. Thanks to contributions from Dmitriy V. Ryaboy, Bill Graham and Jacob Perkins.

See PIG-1886, PIG-2008, PIG-1782, PIG-1769 and PIG-1870 for more information.

Penny
This is an innovative feature contributed by Pig’s friends from Yahoo! Research. It is going to be a great debugging tool. It deserves a blog post of its own, and we will have one on that soon. Read the Penny documentation for more information.

Others
There are several other improvements in the 243 JIRAs that have been resolved as part of the Pig 0.9 release and the blog posts thus far have highlighted only a few of the most interesting ones.

– Thejas Nair

Categorized by :
Apache Hadoop Pig

Comments

|
February 11, 2013 at 11:37 pm
|

I am tring to use HBaseStorage with Pig 0.8 on Hbase 0.90.1…. I tried using the PIG-1680 patch but it does does not seem to work (all the HUNKs of the patch fail). Did anybody else face the same problem or is there any other way to use HBaseStorage ?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Integrating to YARN using Slider (part 2 of 3)
Thursday, August 7, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.