Home Forums Pig Regex in pig

Tagged: ,

This topic contains 1 reply, has 1 voice, and was last updated by  Lars Christoffersen 1 year, 1 month ago.

  • Creator
    Topic
  • #40275

    Hi
    I am trying to use regex in pig as follows
    a = load 'Node_Meas_Outflows_Test.csv' using PigStorage ( ';' ) as ( nodeid:chararray, consumer:chararray, pstart:chararray, pend:chararray, p:DOUBLE );
    b = FOREACH a GENERATE nodeid, consumer, REGEX_EXTRACT_ALL(pstart, '(19|20)\\d{2}') as ystart, REGEX_EXTRACT_ALL(pend, '(19|20)\\d{2}') as yend, pstart, pend, p;
    c = limit b 10;
    dump c;

    the pstart and pend columns are dates of different types, but all have a year format in common. However, if I use only one backslash infront of the “d” I get a syntax error with “D” unknown. If I use two, in order to escape it, the query runs through, but it does not find the year. I am sure the regex is correct, as I have tested it in other regex tools against a sample file.
    Any ideas of what I am doing wrong

Viewing 1 replies (of 1 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #40277

    Found the problem: You have to enclose the regex string in parentheses :-( Grrrrrrrrrr. Also the extract_all version extracts all matches in a bag.
    This works as expected:
    a = load ‘Node_Meas_Outflows_Test.csv’ using PigStorage ( ‘;’ ) as ( nodeid:chararray, consumer:chararray, pstart:chararray, pend:chararray, p:DOUBLE );
    b = FOREACH a GENERATE nodeid, consumer, REGEX_EXTRACT(pstart, ‘((19|20)\\d{2})’, 1) as ystart, REGEX_EXTRACT(pend, ‘((19|20)\\d{2})’, 1) as yend, pstart, pend, p;
    c = limit b 10;
    dump c;

    Collapse
Viewing 1 replies (of 1 total)