I have two tables:
tbl1: 81m rows
tbl2: 4m rows
tbl1 is partitioned on one column and tbl2 has none.
I’m attempting the following query:
JOIN tbl1 ON (tbl1.col_pk=tbl2.col_pk)
WHERE tbl1.partitioned_col IN (‘2011′,’2012′,’2013’)
I get this error:
OutOfMemoryError: GC overhead limit exceeded
So, I followed the suggestion at the end of the error output (Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e ‘set hive.map.aggr.hash.percentmemory = 0.25;’) through several iterations, eventually getting my hive.map.aggr.hash.percentmemory setting down to something like .0165 and it still failed.
I did some searching and found some convoluted recommendations of what to try next. Some mentioned upping my heap size, some mentioned re-writing my query, etc. I upped my Hadoop maximum Java heap size to 4096mb ,re-ran, and got the same results.
Currently, some relevant settings are:
NameNode Heap Size: 4096mb
DataNode maximum Java heap size: 4096mb
Hadoop maximum Java heap size: 4096mb
Java Options for MapReduce tasks: 768mb
I have 16 map slots and 8 reduce slots available (5 node cluster, 4 data and one name)
Thanks in advance for the help,