Home Forums Hive / HCatalog HIVE GROUP BY issue

Tagged: ,

This topic contains 9 replies, has 5 voices, and was last updated by  Akki Sharma 1 year ago.

  • Creator
    Topic
  • #27240

    Hi,

    On running the following query I am getting multiple records with same value of F1

    SELECT F1, COUNT(*)
    FROM
    (
    SELECT F1, F2, COUNT(*)
    FROM TABLE1
    GROUP BY F1, F2
    ) a
    GROUP BY F1;

    As per what I understand there are multiple number of records based on number of reducers.

    Replicating the test scenario:
    STEP1: get the dataset as available in http://snap.stanford.edu/data/amazon0302.html

    STEP2: Open the file and delete the heading

    STEP3: hadoop fs -mkdir /test

    STEP4: hadoop fs -put amazon0302.txt /test

    STEP5: create external table test (f1 int, f2 int) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile location ‘/test’;

    STEP6: create table test1 location ‘/test1′ as select left_table.* from (select * from test where f1<10000) left_table join (select * from test where f1 < 10000) right_table;

    STEP7: hadoop fs -mkdir /test2

    STEP8: create table test2 location '/test2' as select f1, count(*) from (select f1, f2, count(*) from test1 group by f1, f2) a group by f1;

    STEP9: select * from test2 where f1 = 9887;

    ENVIRONMENT:
    HADOOP 2.0.4
    HIVE 0.11

    Please do let me know whether I am doing anything wrong.

    Thanks and Regards,
    Gourav Sengupta
    (PS: I have previously posted this issue in HIVE groups but am yet to receive any response. My apologies as the test data generation does take time)

Viewing 9 replies - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #33713

    Akki Sharma
    Moderator

    Please try with set hive.optimize.reducededuplication = false;
    and see if it makes a difference.

    Collapse
    #31671


    Member

    Hi,

    I have run into this bug too, and would like to track the issue. Is there an Apache JIRA – I couldn’t find one, but may have just not searched well enough.

    Thank you

    Collapse
    #28938

    Carter Shanklin
    Participant

    Gourav,

    I meant to respond quite some time ago but got busy with Hadoop summit. I filed an internal bug for this back in June and alerted the dev team. It should appear as an Apache JIRA before too long and we plan to fix the issue.

    Collapse
    #28608

    tedr
    Moderator

    Hi Gourav,

    So are you then making the request to log this as a bug through this post?

    Thanks,
    Ted.

    Collapse
    #28078

    Hi Ted,

    thanks a ton for taking time to understand the issue and respond.

    I had a chance to talk with Alan while he was around in London for one of the meetups and he mentioned that this must be an issue with the optimizer and asked had then asked me to request this issue to be filed as a bug.

    In case this still appears as a bug then it would be great to know whether it is being considered for resolution in the upcoming releases or if there is any patch for the same. We are eagerly looking forward to use HIVE 0.11 in production environment and this issue is a major impediment.

    Please do let me know in case there is anything in particular that I can do, or any tests that you would want me to run.

    Thanks and Regards,
    Gourav

    Collapse
    #27885

    tedr
    Moderator

    Hi Gourav,

    I cannot find a specific bug for this issue. That doesn’t mean that there is not one filed, it just means that my search parameters were probably not what they should have been. I will keep looking.

    Ted.

    Collapse
    #27613

    Hi,

    can anyone please let me know whether there is any ticket for resolving this issue?

    Currently setting the reducers to 1 does not sound like an optimal solution.

    Thanks and Regards,
    Gourav Sengupta

    Collapse
    #27531

    Carter Shanklin
    Participant

    Let me clarify my note, use that reducer setting before you run STEP8.

    Collapse
    #27529

    Carter Shanklin
    Participant

    Gourav,

    This is a Hive bug. Hive is trying to be too clever with reducers, likely due to similarities in group by keys. It’s exacerbated by large numbers of reducers.

    To work around it set
    set hive.exec.reducers.max=1;

    Collapse
Viewing 9 replies - 1 through 9 (of 9 total)