Pig Forum

Left inner join in pig

  • #40558
    Dan Sadler
    Member

    Hi,
    I have two files I am loading into Pig. File A contains detials on Students and File B the list of students who are in detention.
    I would like to get the list of Students that are in File A but not in File B. Using a filter works well for names but I have over 200+ names in file B.
    A left inner join would work but I cannot seem to find any information and don’t think Pig supports this.
    Any help to solve this would be great!
    Thanks

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #40572
    Alan Gates
    Moderator

    I’m not familiar with “left inner” joins. But a anti-join will do what you want. Maybe they’re the same thing.

    A = load ‘file A';
    B = load ‘file B';
    C = cogroup A on studentname, B on studentname;
    D = filter C by COUNT(B) == 0;
    E = foreach D generate flatten(A);

    This will do an anti-join. It cogroups everything in A and B on the same key, filters out any groups where there have records in B, and then flattens the grouping of A so that you again have individual records.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.