Please try with set hive.optimize.reducededuplication = false; and see if it makes a difference. ...
Please try with set hive.optimize.reducededuplication = false;
and see if it makes a difference.
On running the following query I am getting multiple records with same value of F1
SELECT F1, COUNT(*)
SELECT F1, F2, COUNT(*)
GROUP BY F1, F2
GROUP BY F1;
As per what I understand there are multiple number of records based on number of reducers.
Replicating the test scenario:
STEP1: get the dataset as available in http://snap.stanford.edu/data/amazon0302.html
STEP2: Open the file and delete the heading
STEP3: hadoop fs -mkdir /test
STEP4: hadoop fs -put amazon0302.txt /test
STEP5: create external table test (f1 int, f2 int) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile location ‘/test';
STEP6: create table test1 location ‘/test1′ as select left_table.* from (select * from test where f1<10000) left_table join (select * from test where f1 < 10000) right_table;
STEP7: hadoop fs -mkdir /test2
STEP8: create table test2 location '/test2' as select f1, count(*) from (select f1, f2, count(*) from test1 group by f1, f2) a group by f1;
STEP9: select * from test2 where f1 = 9887;
Please do let me know whether I am doing anything wrong.
Thanks and Regards,
(PS: I have previously posted this issue in HIVE groups but am yet to receive any response. My apologies as the test data generation does take time)
You must be logged in to reply to this topic.