Left inner join in pig

to create new topics or reply. | New User Registration

This topic contains 1 reply, has 2 voices, and was last updated by  Alan Gates 1 year, 9 months ago.

  • Creator
    Topic
  • #40558

    Dan Sadler
    Member

    Hi,
    I have two files I am loading into Pig. File A contains detials on Students and File B the list of students who are in detention.
    I would like to get the list of Students that are in File A but not in File B. Using a filter works well for names but I have over 200+ names in file B.
    A left inner join would work but I cannot seem to find any information and don’t think Pig supports this.
    Any help to solve this would be great!
    Thanks

Viewing 1 replies (of 1 total)

You must be to reply to this topic. | Create Account

  • Author
    Replies
  • #40572

    Alan Gates
    Participant

    I’m not familiar with “left inner” joins. But a anti-join will do what you want. Maybe they’re the same thing.

    A = load ‘file A';
    B = load ‘file B';
    C = cogroup A on studentname, B on studentname;
    D = filter C by COUNT(B) == 0;
    E = foreach D generate flatten(A);

    This will do an anti-join. It cogroups everything in A and B on the same key, filters out any groups where there have records in B, and then flattens the grouping of A so that you again have individual records.

    Collapse
Viewing 1 replies (of 1 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.