Pig Forum

Regex in pig

  • #40275
    Lars Christoffersen
    Participant

    Hi
    I am trying to use regex in pig as follows
    a = load 'Node_Meas_Outflows_Test.csv' using PigStorage ( ';' ) as ( nodeid:chararray, consumer:chararray, pstart:chararray, pend:chararray, p:DOUBLE );
    b = FOREACH a GENERATE nodeid, consumer, REGEX_EXTRACT_ALL(pstart, '(19|20)\\d{2}') as ystart, REGEX_EXTRACT_ALL(pend, '(19|20)\\d{2}') as yend, pstart, pend, p;
    c = limit b 10;
    dump c;

    the pstart and pend columns are dates of different types, but all have a year format in common. However, if I use only one backslash infront of the “d” I get a syntax error with “D” unknown. If I use two, in order to escape it, the query runs through, but it does not find the year. I am sure the regex is correct, as I have tested it in other regex tools against a sample file.
    Any ideas of what I am doing wrong

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #40277
    Lars Christoffersen
    Participant

    Found the problem: You have to enclose the regex string in parentheses :-( Grrrrrrrrrr. Also the extract_all version extracts all matches in a bag.
    This works as expected:
    a = load ‘Node_Meas_Outflows_Test.csv’ using PigStorage ( ‘;’ ) as ( nodeid:chararray, consumer:chararray, pstart:chararray, pend:chararray, p:DOUBLE );
    b = FOREACH a GENERATE nodeid, consumer, REGEX_EXTRACT(pstart, ‘((19|20)\\d{2})’, 1) as ystart, REGEX_EXTRACT(pend, ‘((19|20)\\d{2})’, 1) as yend, pstart, pend, p;
    c = limit b 10;
    dump c;

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.