Join after multiple operations

This topic contains 0 replies, has 1 voice, and was last updated by  Miha 10 months ago.

  • Creator
  • #50726


    Hi, I’m new to programming in PIG and I have a relation with multiple fields (I’m simplifying the schema in this example below). I’m doing some calculations multiple times, and at the end I’m trying to join the results. But I get no results, and if I run a describe the schema seems to be correct. Also, when looking at the syntax check the only thing that catches my eyes is this warning: WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_CHARARRAY.

    Desired output:

    c1 = load 'file.csv' using PigStorage(',') as (ID, LN, PAY_AMT:double,UNIT_QTY:int, PD_DT);
    c3 = group c2 by (ID, LN);
    c3agg = FOREACH c3 GENERATE FLATTEN(group) as (ID,LN),
    SUM(c2.PAY_AMT) as PdAmt, SUM(c2.UNIT_QTY) as Unit_qty;

    describe c3agg;
    > c3agg: {ID: bytearray,LN: bytearray,PdAmt: double,Unit_qty: long}

    So now I’m trying to get the MAX(PD_DT) since using the actual MAX operator doesn’t work (or at least I can’t figure it out to work without using this code below).

    c4 = foreach c1 generate ID, LN, PD_DT;
    c5 = group c4 by (ID, LN);
    c3dt = FOREACH c5 {
    c5ord = ORDER c4 by PD_DT DESC;
    c5lmt = LIMIT c5ord 1;
    GENERATE FLATTEN(c5lmt);};

    describe c3dt;
    > c3dt: {c5lmt::ID: bytearray,c5lmt::LN: bytearray,c5lmt::PD_DT:bytearray}

    Now trying for the join, which doesn’t return anything (even the log doesn’t show anything):

    cj = JOIN c3agg BY (ID, LN), c3dt BY (ID, LN);
    dump cj;

    I tried using field position but with the same blank result – cj = join c3agg by ($0, $1), c3dt BY ($0, $1);

    describe cj;
    > cj: {c3agg::ID: bytearray,c3agg::LN: bytearray,c3agg::PdAmt: double,c3agg::Unit_qty: long,c3dt::c5lmt::ID: bytearray,c3dt::c5lmt::LN: bytearray,c3dt::c5lmt::PD_DT: bytearray}

    Also, I tried defining the field type , for example ID:chararray and LN:int, but still no results. I really can’t figure it out what am I doing wrong?

    Thank you!

You must be to reply to this topic. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.