Please try with set hive.optimize.reducededuplication = false; and see if it makes a difference. ...
Please try with set hive.optimize.reducededuplication = false;
and see if it makes a difference.
On running the following query I am getting multiple records with same value of F1
SELECT F1, COUNT(*)
SELECT F1, F2, COUNT(*)
GROUP BY F1, F2
GROUP BY F1;
As per what I understand there are multiple number of records based on number of reducers.
Replicating the test scenario:
STEP1: get the dataset as available in http://snap.stanford.edu/data/amazon0302.html
STEP2: Open the file and delete the heading
STEP3: hadoop fs -mkdir /test
STEP4: hadoop fs -put amazon0302.txt /test
STEP5: create external table test (f1 int, f2 int) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile location ‘/test';
STEP6: create table test1 location ‘/test1′ as select left_table.* from (select * from test where f1<10000) left_table join (select * from test where f1 < 10000) right_table;
STEP7: hadoop fs -mkdir /test2
STEP8: create table test2 location '/test2' as select f1, count(*) from (select f1, f2, count(*) from test1 group by f1, f2) a group by f1;
STEP9: select * from test2 where f1 = 9887;
Please do let me know whether I am doing anything wrong.
Thanks and Regards,
(PS: I have previously posted this issue in HIVE groups but am yet to receive any response. My apologies as the test data generation does take time)