New Features in Apache Pig 0.10
Another important milestone for Apache Pig was reached this week with the release of Pig 0.10. The purpose of this blog is to summarize the new features in Pig 0.10.
Boolean Data Type
Pig 0.10 introduces boolean data type as a first-class Pig data type. Users can use the keyword “boolean” anywhere where a data type is expected, such as load-as clause, type cast clause, etc.
Here are some sample use cases:
a = load ‘input’ as (a0:boolean, a1:tuple(a10:boolean, a11:int), a2);
b = foreach a generate a0, a1, (boolean)a2;
c = group b by a2; — group by a boolean field
When loading boolean data using PigStorage, Pig expects the text “true” (ignore case) for a true value, and “false” (ignore case) for a false value; while other values map to null. When storing boolean data using PigStorage, true value will emit text “true” and false value will emit text “false”.
Nested Cross/Foreach
You can use nested cross and nested foreach statements inside foreach nested plan in Pig 0.10. Here is one example:
C = cogroup user by uid, session by uid;
D = foreach C {
crossed = cross user, session;
filtered = filter crossed by user::region == session::region;
result = foreach filtered generate processSession(user::age, user::gender, session::ip); -- processSession is a UDF
generate result;
}
Note the maximum level of nested plan is 2, that is, the nested foreach statement cannot have a nest plan.
For more information, please refer to the Foreach section of the Pig documentation.
JRuby UDF
In addition to Python/JavaScript, in Pig 0.10, you can now use JRuby UDFs as well.
To write a JRuby UDF, you need to create a new JRuby class and extend PigUdf, and add your UDFs as methods of the new class. Here is one example:
require 'pigudf'
class Myudfs < PigUdf
def concat *input
input.inject(:+)
end
end
There are two ways to define output schema for the UDF: annotation or schema function. Either is fine for defining the output schema in the previous sample:
class Myudfs < PigUdf
outputSchema "word:chararray"
def concat *input
input.inject(:+)
end
end
or,
schema function:
class Myudfs < PigUdf
outputSchemaFunction :concatSchema
def concat *input
input.inject(:+)
end
def squareSchema input
input
end
end
You can also write algebraic and accumulative UDFs in JRuby, which is not yet the case for other scripting languages. For more information, please refer to the Pig documentation for Writing Ruby UDFs.
Hadoop 0.23 (a.k.a. Hadoop 2.0) Support
Pig 0.10.0 supports Hadoop 0.23.X. All unit and end-to-end tests passed with hadoop-0.23. To run Pig with hadoop-0.23, you need to recompile Pig with hadoopversion flag set to 23:
ant -Dhadoopversion=23
You also need to set up all of the environment variables necessary to run the hadoop -23 client, plus, point HADOOP_HOME to HADOOP_COMMON_HOME, and make sure $HADOOP_HOME/bin/hadoop exists.
Performance Improvements
Map Aggregation
Map aggregation will aggregate records before it sends them to combiner. It reduces the serializing/deserializing costs of using combiner by sending fewer records to the combiner. It is especially useful in a group-by statement with very few group keys. In our experiments, map aggregation reduces the runtime for a map task for a group by clause by up to 50%.
Map aggregation is turned off by default. To turn it on, set “pig.exec.mapPartAgg” property to true.
For more information about map aggregation, read the PIG-2228 JIRA.
Push Limit into Loader
Pig optimizes limit query by pushing limit automatically to the loader, thus requiring only a fraction of the entire input to be scanned.
Language Enhancements
Re-aliasing
In the Pig script, you can rename an alias, and refer to the new name:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = A; DUMP B;
Limit/Sample by Expression
The Limit/Sample statement takes expression in addition to constant. For example:
a = load 'a.txt'; b = group a by all; c = foreach b generate COUNT(*) as sum; d = order a by $0; e = limit d c.sum/100;
Default Split Destination
You can specify an “otherwise” destination for split statement. Split will automatically identify inputs that don’t belong to any of the other branches and direct those inputs to the “otherwise” destination:
split a into b if id > 3, c if id < 5, d otherwise;
TOMAP/TOTUPLE/TOBAG Syntax Support
You can compose a map/tuple/bag within a Pig script:
B = foreach A generate (name, age); -- generate tuple
B = foreach A generate [name, age]; -- generate map
B = foreach A generate {name, age}; -- generate bag of single item tuples
Globbing in Register
Pig now supports globbing in register statements:
register lib/*.jar
UDF Enhancements
Improvements to PigStorage
We added a couple of options to PigStorage:
*-schema
This is for storing a .pig_schema along a data file when when using PigStorage. When loading data from PigStorage, Pig will check the existence of .pig_schema and use it automatically:
store a into 'output_dir' using PigStorage('\t', '-schema');
* -tagsource
PigStorage now adds a new column INPUT_FILE_NAME, which indicates the input file name of that input.
a = load 'input_dir' using PigStorage('\t', '-tagsource');
The first column of the output will be INPUT_FILE_NAME
Turn off the Write Ahead Log for HBaseStorage
You can now use the “-noWAL” option in HBaseStorage to turn off write ahead log while doing bulk loads into HBase:
STORE myalias INTO 'MyTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycolumnfamily:field1 mycolumnfamily:field2','-noWAL');
JsonLoader/JsonStorage
We added new pair of UDFs to the load/store Json format. Note JsonLoader does not auto detect the schema of your input data. You will still need to tell JsonLoader the schema of the data. Such as:
a = load 'input.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');
However, if you are storing the data using JsonStorage, there will be a schema file stored along with the data. In this scenario, you don’t have to specify the schema for JsonLoader. JsonLoader will detect the schema file and use it.
Bloom Filters
Bloom filters are a common way to select a limited set of records before moving data for a join or other heavyweight operation. Pig includes two UDFs: BuildBloom to build a bloom filter and Bloom to use the bloom filter in a filter statement. At present, users will need to explicitly call both UDFs to get the full benefit of bloom filter. In the future, we will include them in the optimizer so that large join queries can use bloom filter automatically.
Please read the PIG-2328 JIRA for more information about bloom filters.
Implement UDF by Simulation
In the chain of EvalFunc -> Accumulator -> Algebraic, if you implement a more complex UDF (righthand side), you can use simulation to get a simpler UDF (lefthand side) for free. You can achieve this by using AlgebraicEvalFunc or AccumulatorEvalFunc. Check AlgebraicEvalFunc.html and AccumulatorEvalFunc.html for detail.
Other Improvements
Sparse Joins
Pig 0.10 introduces a new join type: ‘merge-sparse’. This is useful for cases when both joined tables are pre-sorted and indexed, and the right-hand table has few ( < 1% of its total) matching keys. Further detail on sparse joins is available in the Pig documentation.
Complete S3 Support
In Pig 0.10, every component of a Pig script can be in HDFS or Amazon Web Service’s S3. This includes the Pig script file, dependent jars, parameter files, macros, scripting UDFs, etc.
Kill Hadoop Job
If you kill a Pig job using Ctrl-C or “kill”, Pig will now kill all associated Hadoop jobs currently running. This is applicable to both grunt mode and non-interactive mode.
Conclusion
Remember to check out Pig 0.10 when you get a chance. Also, don’t forget to register for Hadoop Summit 2012. There will be some useful Pig presentations including my session with Thejas Nair, Pig Programming is More Fun: New Features in Pig.
~ Daniel Dai
Cool!!!!!
Waiting for these features for a long time.
Great job, guys. Many thanks.
Excellemt post, and cheers to Pig 0.10!
A few more things to note:
1) Pig Storage can now store (and subsequently read) its schema. This was previously available is a contrib in piggybank, but has now been greatly improved and moved into PigStorage itself. Prashant Kommireddi has a nice summary of how this works: http://hadoopified.wordpress.com/2012/04/22/pigstorage-options-schema-and-source-tagging/
2) The JRuby support is much greater than what has been done for Jython and JS. You can easily define efficient algebraic and accumulative UDFs in JRuby (not the case for other scripting languages yet).
3) Sparse joins. There is a new join type, called ‘merge-sparse’, useful for cases when both joined tables are pre-sorted and indexed, and the right-hand table has few ( < 1% of its total) matching keys. Details available here: http://pig.apache.org/docs/r0.10.0/perf.html#merge-sparse-joins . Note that so far this feature is very restricted.
Thanks for pointing out the missing piece. I updated the blog to include the comments.
Shouldn’t no WAL option should be ‘-noWAL true’ instead of just ‘-noWAL’?