Announcing Apache Pig 0.12…The Community Breeds a More Powerful Pig

Today we are proud to announce the general availability of Apache Pig 0.12!

pig12NotableIf you are a Pig user and you’ve been yearning to use additional languages, for more data validation tools, for more expressions, operators and data types, then read on. Version 0.12 includes all of those additions, and now Pig runs on Windows without Cygwin.

This was a great team effort over the past six months with over 30 engineers from Twitter, Yahoo, LinkedIn, Netflix, Microsoft, IBM, Salesforce, Mortardata, Cloudera and several others (including Hortonworks of course). Between Pig 0.11 and Pig 0.12, we resolved 305 Jira issues.

Improvements in Apache Pig 0.12

Assert operator

An assert operator can be used for data validation. For example, the following script will fail if any value is a negative integer:

a = load 'something' as (a0:int, a1:int);
assert a by a0 > 0, 'a cant be negative for reasons';

Streaming UDF

Users can now write a UDF using a language without JVM implementations. In particular, we implemented C Python UDF in this version. Users are able to write Python UDF using C Python extensions which otherwise are not possible in Jython.

Rewrite of AvroStorage

We completely revamped the AvroStorage. It is now part of Pig built-in functions. It uses the latest version of Avro and is significantly faster, with many bug fixes.

IN operator

Previously, Pig had no support for IN operators. To mimic those, users had to concatenate several OR operators, as in this example:

a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY 
   (i == 1) OR
   (i == 22) OR
   (i == 333) OR
   (i == 4444) OR
   (i == 55555)

Now, this type of expression can be re-written in a more compact manner, using an IN operator:

a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY i IN (1,22,333,4444,55555);

CASE expression

Before Pig had no support for a case statement. To mimic it, users often use nested bincond operators. Those could become unreadable when there were multiple levels of nesting.

Here’s an example of the type of CASE expression that Pig now supports:

bar = FOREACH foo GENERATE ( 
  CASE i % 3 
     WHEN 0 THEN '3n' 
     WHEN 1 THEN '3n+1' 
     ELSE '3n+2' 
  END 
);

BigInteger/BigDecimal data types

Some applications require calculations with a high degree of precision. In these cases BigInteger and BigDecimal can be used for more precise calculations.

Support for Microsoft Windows™

Changes that enable Apache Pig to run on Windows without Cygwin have now been committed to the trunk.

Parquet Support

Pig now wraps ParquetLoader/ParquetStorer in built-in functions. Users are able to load/store Parquet data easily.

Downloads:

Categorized by :
Data Analyst & Scientist Developer Hadoop Ecosystem HDP 2 Pig

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Thank you for subscribing!