New Features in Apache Pig 0.10

Another important milestone for Apache Pig was reached this week with the release of Pig 0.10. The purpose of this blog is to summarize the new features in Pig 0.10.

Boolean Data Type

Pig 0.10 introduces boolean data type as a first-class Pig data type. Users can use the keyword “boolean” anywhere where a data type is expected, such as load-as clause, type cast clause, etc.

Here are some sample use cases:

a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);

b = foreach a generate a0, a1, (boolean)a2;

c = group b by a2; — group by a boolean field

When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null. When storing boolean data using PigStorage, true value will emit text “true” and false value will emit text “false”.

Nested Cross/Foreach

You can use nested cross and nested foreach statements inside foreach nested plan in Pig 0.10. Here is one example:

C = cogroup user by uid, session by uid;
D = foreach C {
    crossed = cross user, session;
    filtered = filter crossed by user::region == session::region;
    result = foreach filtered generate processSession(user::age, user::gender, session::ip); -- processSession is a UDF
    generate result;
}

Note the maximum level of nested plan is 2, that is, the nested foreach statement cannot have a nest plan.

For more information, please refer to the Foreach section of the Pig documentation.

JRuby UDF

In addition to Python/JavaScript, in Pig 0.10, you can now use JRuby UDFs as well.

To write a JRuby UDF, you need to create a new JRuby class and extend PigUdf, and add your UDFs as methods of the new class. Here is one example:

require 'pigudf'
class Myudfs < PigUdf
    def concat *input
        input.inject(:+)
    end
end

There are two ways to define output schema for the UDF: annotation or schema function. Either is fine for defining the output schema in the previous sample:

class Myudfs < PigUdf
    outputSchema "word:chararray"
    def concat *input
        input.inject(:+)
    end
end

or,

schema function:
class Myudfs < PigUdf
    outputSchemaFunction :concatSchema
    def concat *input
        input.inject(:+)
    end
    def squareSchema input
        input
    end
end

You can also write algebraic and accumulative UDFs in JRuby, which is not yet the case for other scripting languages. For more information, please refer to the Pig documentation for Writing Ruby UDFs.

Hadoop 0.23 (a.k.a. Hadoop 2.0) Support

Pig 0.10.0 supports Hadoop 0.23.X. All unit and end-to-end tests passed with hadoop-0.23. To run Pig with hadoop-0.23, you need to recompile Pig with hadoopversion flag set to 23:

ant -Dhadoopversion=23

You also need to set up all of the environment variables necessary to run the hadoop -23 client, plus, point HADOOP_HOME to HADOOP_COMMON_HOME, and make sure $HADOOP_HOME/bin/hadoop exists.

Performance Improvements

Map Aggregation

Map aggregation will aggregate records before it sends them to combiner. It reduces the serializing/deserializing costs of using combiner by sending fewer records to the combiner. It is especially useful in a group-by statement with very few group keys. In our experiments, map aggregation reduces the runtime for a map task for a group by clause by up to 50%.

Map aggregation is turned off by default. To turn it on, set “pig.exec.mapPartAgg” property to true.

For more information about map aggregation, read the PIG-2228 JIRA.

Push Limit into Loader

Pig optimizes limit query by pushing limit automatically to the loader, thus requiring only a fraction of the entire input to be scanned.

Language Enhancements

Re-aliasing

In the Pig script, you can rename an alias, and refer to the new name:

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

B = A;

DUMP B;

Limit/Sample by Expression

The Limit/Sample statement takes expression in addition to constant. For example:

a = load 'a.txt';

b = group a by all;

c = foreach b generate COUNT(*) as sum;

d = order a by $0;

e = limit d c.sum/100;

Default Split Destination

You can specify an “otherwise” destination for split statement. Split will automatically identify inputs that don’t belong to any of the other branches and direct those inputs to the “otherwise” destination:

split a into b if id > 3, c if id < 5, d otherwise;

TOMAP/TOTUPLE/TOBAG Syntax Support

You can compose a map/tuple/bag within a Pig script:

B = foreach A generate (name, age);  -- generate tuple

B = foreach A generate [name, age];  -- generate map

B = foreach A generate {name, age};  -- generate bag of single item tuples

Globbing in Register

Pig now supports globbing in register statements:

register lib/*.jar

UDF Enhancements

Improvements to PigStorage

We added a couple of options to PigStorage:

*-schema

This is for storing a .pig_schema along a data file when when using PigStorage. When loading data from PigStorage, Pig will check the existence of .pig_schema and use it automatically:

store a into 'output_dir' using PigStorage('\t', '-schema');

* -tagsource

PigStorage now adds a new column INPUT_FILE_NAME, which indicates the input file name of that input.

a = load 'input_dir' using PigStorage('\t', '-tagsource');

The first column of the output will be INPUT_FILE_NAME

Turn off the Write Ahead Log for HBaseStorage

You can now use the “-noWAL” option in HBaseStorage to turn off write ahead log while doing bulk loads into HBase:

STORE myalias INTO 'MyTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycolumnfamily:field1 mycolumnfamily:field2','-noWAL');

JsonLoader/JsonStorage

We added new pair of UDFs to the load/store Json format. Note JsonLoader does not auto detect the schema of your input data. You will still need to tell JsonLoader the schema of the data. Such as:

a = load 'input.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');

However, if you are storing the data using JsonStorage, there will be a schema file stored along with the data. In this scenario, you don’t have to specify the schema for JsonLoader. JsonLoader will detect the schema file and use it.

Bloom Filters

Bloom filters are a common way to select a limited set of records before moving data for a join or other heavyweight operation. Pig includes two UDFs: BuildBloom to build a bloom filter and Bloom to use the bloom filter in a filter statement. At present, users will need to explicitly call both UDFs to get the full benefit of bloom filter. In the future, we will include them in the optimizer so that large join queries can use bloom filter automatically.

Please read the PIG-2328 JIRA for more information about bloom filters.

Implement UDF by Simulation

In the chain of EvalFunc -> Accumulator -> Algebraic, if you implement a more complex UDF (righthand side), you can use simulation to get a simpler UDF (lefthand side) for free. You can achieve this by using AlgebraicEvalFunc or AccumulatorEvalFunc. Check AlgebraicEvalFunc.html and AccumulatorEvalFunc.html for detail.

Other Improvements

Sparse Joins

Pig 0.10 introduces a new join type: ‘merge-sparse’. This is useful for cases when both joined tables are pre-sorted and indexed, and the right-hand table has few ( < 1% of its total) matching keys. Further detail on sparse joins is available in the Pig documentation.

Complete S3 Support

In Pig 0.10, every component of a Pig script can be in HDFS or Amazon Web Service’s S3. This includes the Pig script file, dependent jars, parameter files, macros, scripting UDFs, etc.

Kill Hadoop Job

If you kill a Pig job using Ctrl-C or “kill”, Pig will now kill all associated Hadoop jobs currently running. This is applicable to both grunt mode and non-interactive mode.

Conclusion

Remember to check out Pig 0.10 when you get a chance. Also, don’t forget to register for Hadoop Summit 2012. There will be some useful Pig presentations including my session with Thejas Nair, Pig Programming is More Fun: New Features in Pig.

~ Daniel Dai

Categorized by :
Apache Hadoop Pig

Comments

|
February 27, 2013 at 8:50 am
|

Shouldn’t no WAL option should be ‘-noWAL true’ instead of just ‘-noWAL’?

|
April 29, 2012 at 12:02 pm
|

A few more things to note:
1) Pig Storage can now store (and subsequently read) its schema. This was previously available is a contrib in piggybank, but has now been greatly improved and moved into PigStorage itself. Prashant Kommireddi has a nice summary of how this works: http://hadoopified.wordpress.com/2012/04/22/pigstorage-options-schema-and-source-tagging/

2) The JRuby support is much greater than what has been done for Jython and JS. You can easily define efficient algebraic and accumulative UDFs in JRuby (not the case for other scripting languages yet).

3) Sparse joins. There is a new join type, called ‘merge-sparse’, useful for cases when both joined tables are pre-sorted and indexed, and the right-hand table has few ( < 1% of its total) matching keys. Details available here: http://pig.apache.org/docs/r0.10.0/perf.html#merge-sparse-joins . Note that so far this feature is very restricted.

    Daniel
    |
    May 2, 2012 at 3:46 pm
    |

    Thanks for pointing out the missing piece. I updated the blog to include the comments.

|
April 28, 2012 at 1:28 am
|

Excellemt post, and cheers to Pig 0.10!

Lirong
|
April 27, 2012 at 11:01 pm
|

Cool!!!!!
Waiting for these features for a long time.
Great job, guys. Many thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.

Thank you for subscribing!