Home Forums Hive / HCatalog Hive Stinger

This topic contains 12 replies, has 3 voices, and was last updated by  Thejas Nair 10 months, 3 weeks ago.

  • Creator
    Topic
  • #44430

    I have HDP2 installed on Rhel Linux 5.8 on a 2 node cluster. Everything is good.
    But I see poor performance on queries so would like to know more about Stinger.
    Is this a separate install on top of HDP2 ? If so can someone please provide the instructions to install this ?

    Please shed some light.
    Thanks

Viewing 12 replies - 1 through 12 (of 12 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #44684

    Thejas Nair
    Participant

    fyi, there is work going on in hive, so that the join algorithm selection logic uses row and column statistics instead of file size and such out of memory can be avoided.

    Collapse
    #44683

    Thejas Nair
    Participant

    Try asking the firewall question on Ambari forum page or mailing list.
    ORC performance compared to text will depend on the query and state of the cluster. It should be faster in general.
    Regarding the out of memory in join, you can try increasing the memory available to map tasks. If that does not help try disabling conversion to map join by either setting hive.auto.convert.join=false or setting hive.auto.convert.join.noconditionaltask.size to a smaller value (in bytes).

    Collapse
    #44670

    I see that this happens when the firewall is up and when its down it completes fine.
    A little history for you, We had the firewall down for install and now after everything was stable and working good, we brought it up after configuring the ports.
    I have configured the firewall as per the Ambari install doc and Ambari dashboard looks good.
    No alerts or issues.. But its just the Hive queries running too too slow.
    I am running from Hue.
    Not sure why.I will check the logs, once the firewall is brought down, the queries runs fast.

    Another issue,
    Also, with the firewall down , I did not see a lot of performance gain on a join using orc table.
    I tried joining 2 tables each a million recs (5 cols) it failed on memory just like when they were in text mode. I have about 14 gb free memory on each node.
    Any reason why? is it because i did not add the custom property hive.optimize.ppd=true ?

    Collapse
    #44668

    Thejas Nair
    Participant

    Check the logs of the map tasks through jobtracker web UI. Looks like tasks are failing for some reason (maybe issue with some nodes ?)

    Collapse
    #44667

    I am seeing a very slow in performance on all Hive queries, just a select count(*) from a 1 million row table; ran for 40 mins.

    What I noticed is that map task completes 100 % then goes back to 0% and then goes 100% goes back to 0% and keeps running for several minutes
    Any idea why
    This is happening with both ORC as well as ordinary table,

    2/02 14:12:31 INFO exec.Task: 2013-12-02 14:12:31,594 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:13:31 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
    2013-12-02 14:13:31,724 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:13:31 INFO exec.Task: 2013-12-02 14:13:31,724 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:14:31 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
    2013-12-02 14:14:31,841 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:14:31 INFO exec.Task: 2013-12-02 14:14:31,841 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:15:31 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
    2013-12-02 14:15:31,902 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:15:31 INFO exec.Task: 2013-12-02 14:15:31,902 Stage-1 map = 0%, reduce = 0%
    2013-12-02 14:15:45,139 Stage-1 map = 100%, reduce = 0%
    13/12/02 14:15:45 INFO exec.Task: 2013-12-02 14:15:45,139 Stage-1 map = 100%, reduce = 0%
    2013-12-02 14:16:45,429 Stage-1 map = 100%, reduce = 0%
    13/12/02 14:16:45 INFO exec.Task: 2013-12-02 14:16:45,429 Stage-1 map = 100%, reduce = 0%
    2013-12-02 14:17:45,527 Stage-1 map = 100%, reduce = 0%
    13/12/02 14:17:45 INFO exec.Task: 2013-12-02 14:17:45,527 Stage-1 map = 100%, reduce = 0%
    2013-12-02 14:17:50,620 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:17:50 INFO exec.Task: 2013-12-02 14:17:50,620 Stage-1 map = 0%, reduce = 0%
    13/12/02 14:18:50 INFO exec.Task: 2013-12-02 14:18:50,697 Stage-1 map = 0%, reduce = 0%

    Collapse
    #44615

    Thejas Nair
    Participant

    Yes, ORC format is not tab separated. It has a specific binary encoding scheme. So, you cannot load tab separated files directly into ORC format (or any other binary formats including RC, Avro etc)
    The blog posts have details on the format.

    Collapse
    #44594

    I got it, The last time I loaded the table from a tab delimited file, using load option from Hive.
    This time,I dropped the table and re created a orc table and loaded this time as inserts statement as select * from a regular table,
    So this time I was able to query the orc table.
    For some reason I guess when the orc table is loaded from a file, its not formatted right.

    Collapse
    #44568

    I am not sure if I understand correctly, are you saying that the select * from the table is not possible with the orc table, because thats what brought me the exception.

    Collapse
    #44554

    Yi Zhang
    Moderator

    Hi Siva,

    The orc table format stores table as orc files. To view the content of orc file, the viewer you are using probably can’t parse the orc data. you can try ‘hive –servie orcfiledump’ to view metadate about the orc files.

    In general, if a property is not exposed in ambari, you can add it through the custom section.

    Thanks,
    Yi

    Collapse
    #44539

    Hi,
    My initial format was text, I created a table stored as orc and loaded the table from a tab delimited file using LOAD Path option in Hive.
    It completed fine, but I am getting exception error while trying to view or browse the data.
    Any idea why ?
    Also I see that the blog recommends to set hive.optimize.ppd=true, I dont see this property in Hive thru Ambari, should I add as custom one?
    Is the exception error due this property not being set ?
    Please shed some light.

    Collapse
    #44531

    Hi Thejas,
    Thanks a lot for the information, I am new to Hadoop so pardon me, I dont use any specific format in Hive, I just use the default, meaning create external and internal tables and tried querying where I found performance issues.
    The blog you referred to was also very helpful, I will try storing as ORC and try.

    Collapse
    #44500

    Thejas Nair
    Participant

    Parts of the stinger release are available in HDP 2 (via hive 0.12). This includes optimizer improvements and ORC file format . There is no separate install on top of HDP 2.0 that is required.
    What format are you using ? You would want to try using the ORC file format to get a big portion of the improvements.
    You can checkout our blogs on ORC – http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

    The next set of improvements are going to be coming through use of Apache Tez and vectorization. Vectorization in hive already available in apache hive trunk codebase (if you want to build that and use on top of HDP2.0). Apache Tez integration work is available on a branch of apache hive codebase.

    Collapse
Viewing 12 replies - 1 through 12 (of 12 total)