Hive / HCatalog Forum

GC Error/OOM w/ Hive Query

  • #46160
    Nick Martin

    Hi all,

    I have two tables:

    tbl1: 81m rows
    tbl2: 4m rows

    tbl1 is partitioned on one column and tbl2 has none.

    I’m attempting the following query:

    FROM tbl2
    JOIN tbl1 ON (tbl1.col_pk=tbl2.col_pk)
    WHERE tbl1.partitioned_col IN (‘2011′,’2012′,’2013’)

    I get this error:

    OutOfMemoryError: GC overhead limit exceeded

    So, I followed the suggestion at the end of the error output (Currently is set to 0.5. Try setting it to a lower value. i.e ‘set = 0.25;’) through several iterations, eventually getting my setting down to something like .0165 and it still failed.

    I did some searching and found some convoluted recommendations of what to try next. Some mentioned upping my heap size, some mentioned re-writing my query, etc. I upped my Hadoop maximum Java heap size to 4096mb ,re-ran, and got the same results.

    Currently, some relevant settings are:

    NameNode Heap Size: 4096mb
    DataNode maximum Java heap size: 4096mb
    Hadoop maximum Java heap size: 4096mb
    Java Options for MapReduce tasks: 768mb

    I have 16 map slots and 8 reduce slots available (5 node cluster, 4 data and one name)

    Thanks in advance for the help,

to create new topics or reply. | New User Registration

  • Author
  • #46763
    Yi Zhang

    Hi Nick,

    If it is the task that is OOM, try increase the mapred task jvm heap.

    Also, for this query of mainly sum function, suggest give orc table a try.


    Nick Martin

    Increased the mapred task jvm heap by 2x and still seeing the same results.

    Carter Shanklin

    What options did you set? My guess is your OOMs happened in the reducers. 768mb is a really small amount of memory, ensure you increased heap space for reduces as well as maps.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.