Home Forums Hive / HCatalog Overhead of queries executed via beeswax

Tagged: , ,

This topic contains 2 replies, has 3 voices, and was last updated by  Noam Cohen 3 months, 3 weeks ago.

  • Creator
    Topic
  • #52844

    Michael
    Participant

    We have set the following property in the /etc/hive/conf/hive-site.xml of our HDP Sandbox 2.1:

    <property>
    <name>hive.execution.engine</name>
    <value>tez</value>
    </property>

    A simple query of “select avg(salary) from sample_07;” takes quite different time when executed via command line or via beeswax:

    First execution via hive cli takes usually around 20 seconds, sometimes 40 seconds:

    hive> select avg(salary) from sample_07;
    [...]
    Status: Running (application id: application_1399099087178_0040)
    [...]
    Map 1: 1/1 Reducer 2: 1/1
    Status: Finished successfully
    OK
    47963.62637362637
    Time taken: 18.808 seconds, Fetched: 1 row(s)

    Second execution within the same cli is reusing the same tez container (application_1399099087178_0040) and takes usually around 2 seconds, sometimes 10 seconds:

    hive> select avg(salary) from sample_07;
    [...]
    Status: Running (application id: application_1399099087178_0040)
    [...]
    Map 1: 1/1 Reducer 2: 1/1
    Status: Finished successfully
    OK
    47963.62637362637
    Time taken: 2.613 seconds, Fetched: 1 row(s)

    When the cli is closed the tez container is also closed within seconds.

    Executing the same query via beeswax also takes around 20 seconds, sometimes 40 seconds. But it starts three different tez containers:

    14/05/03 04:33:16 INFO tez.TezSessionState: User of session id 2261bf26-0ea6-4245-8e54-9a93fa773b60 is hue
    [...]
    14/05/03 04:33:17 INFO impl.YarnClientImpl: Submitted application application_1399099087178_0041
    [...]
    14/05/03 04:33:18 INFO impl.YarnClientImpl: Submitted application application_1399099087178_0042
    [...]
    14/05/03 04:33:41 INFO impl.YarnClientImpl: Submitted application application_1399099087178_0043
    14/05/03 04:33:41 INFO mapred.FileInputFormat: Total input paths to process : 1

    Run took 25 seconds. The tez containers (application_1399099087178_0041, _0042, _0043) keep running even after the query is finished.

    Executing the query again will not reuse any of these still running containers and therefore does not see significant speedup:

    14/05/03 04:34:08 INFO tez.TezSessionState: User of session id 0aa02a71-d6aa-4c42-a9d9-97f0f6b8957d is hue
    [...]
    14/05/03 04:34:09 INFO impl.YarnClientImpl: Submitted application application_1399099087178_0044
    [...]
    14/05/03 04:34:10 INFO impl.YarnClientImpl: Submitted application application_1399099087178_0045
    [...]
    14/05/03 04:34:31 INFO impl.YarnClientImpl: Submitted application application_1399099087178_0046
    14/05/03 04:34:31 INFO mapred.FileInputFormat: Total input paths to process : 1

    Run took 23 seconds. Now we have tez containers running from two finished beeswax queries.

    A third run from beeswax is now taking much longer (56 seconds), as it is waiting for the old tez containers to timeout.

    Can beeswax use only one tez container per query?
    Can beeswax reuse existing containers?

Viewing 2 replies - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #56673

    Noam Cohen
    Participant

    I’m experiencing the same issue. I tried using HiveServer2 instead of beeswax to workaround this issue, but apparently – this is not supported: http://hortonworks.com/community/forums/topic/hue-questions/

    This is a major problem for us. Hue is the main access point for our Hadoop users, and because of this problem – they are forced to use MapReduce instead of Tez. Any idea on how to solve this?

    Collapse
    #52997

    Thejas Nair
    Participant

    I think the current version of hue is not using hiveserver2. Once that changes, you should be able to see the similar speed improvements.
    If you want to a gui, you could also use some jdbc based application like ‘squirrel sql’ . As that would use HiveServer2, you would see speed improvements.

    Collapse
Viewing 2 replies - 1 through 2 (of 2 total)