Hive / HCatalog Forum


  • #27240


    On running the following query I am getting multiple records with same value of F1

    SELECT F1, F2, COUNT(*)
    GROUP BY F1, F2
    ) a
    GROUP BY F1;

    As per what I understand there are multiple number of records based on number of reducers.

    Replicating the test scenario:
    STEP1: get the dataset as available in

    STEP2: Open the file and delete the heading

    STEP3: hadoop fs -mkdir /test

    STEP4: hadoop fs -put amazon0302.txt /test

    STEP5: create external table test (f1 int, f2 int) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile location ‘/test’;

    STEP6: create table test1 location ‘/test1’ as select left_table.* from (select * from test where f1<10000) left_table join (select * from test where f1 < 10000) right_table;

    STEP7: hadoop fs -mkdir /test2

    STEP8: create table test2 location '/test2' as select f1, count(*) from (select f1, f2, count(*) from test1 group by f1, f2) a group by f1;

    STEP9: select * from test2 where f1 = 9887;

    HADOOP 2.0.4
    HIVE 0.11

    Please do let me know whether I am doing anything wrong.

    Thanks and Regards,
    Gourav Sengupta
    (PS: I have previously posted this issue in HIVE groups but am yet to receive any response. My apologies as the test data generation does take time)

to create new topics or reply. | New User Registration

  • Author
  • #27529
    Carter Shanklin


    This is a Hive bug. Hive is trying to be too clever with reducers, likely due to similarities in group by keys. It’s exacerbated by large numbers of reducers.

    To work around it set
    set hive.exec.reducers.max=1;

    Carter Shanklin

    Let me clarify my note, use that reducer setting before you run STEP8.



    can anyone please let me know whether there is any ticket for resolving this issue?

    Currently setting the reducers to 1 does not sound like an optimal solution.

    Thanks and Regards,
    Gourav Sengupta


    Hi Gourav,

    I cannot find a specific bug for this issue. That doesn’t mean that there is not one filed, it just means that my search parameters were probably not what they should have been. I will keep looking.



    Hi Ted,

    thanks a ton for taking time to understand the issue and respond.

    I had a chance to talk with Alan while he was around in London for one of the meetups and he mentioned that this must be an issue with the optimizer and asked had then asked me to request this issue to be filed as a bug.

    In case this still appears as a bug then it would be great to know whether it is being considered for resolution in the upcoming releases or if there is any patch for the same. We are eagerly looking forward to use HIVE 0.11 in production environment and this issue is a major impediment.

    Please do let me know in case there is anything in particular that I can do, or any tests that you would want me to run.

    Thanks and Regards,


    Hi Gourav,

    So are you then making the request to log this as a bug through this post?


    Carter Shanklin


    I meant to respond quite some time ago but got busy with Hadoop summit. I filed an internal bug for this back in June and alerted the dev team. It should appear as an Apache JIRA before too long and we plan to fix the issue.



    I have run into this bug too, and would like to track the issue. Is there an Apache JIRA – I couldn’t find one, but may have just not searched well enough.

    Thank you

    Akki Sharma

    Please try with set hive.optimize.reducededuplication = false;
    and see if it makes a difference.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.