Home Forums Hive / HCatalog HDP 2.0 /Hive/Stinger

Tagged: , ,

This topic contains 2 replies, has 2 voices, and was last updated by  Dipti Joshi 1 year ago.

  • Creator
    Topic
  • #34559

    Dipti Joshi
    Member

    I have HDP 2.0 beta installed including Hive. Where can I find benchmark database/queries used to profile Hive improvement in Stinger ? I have in past bench marked hive with HDP 1.3 using my own data and queries – but I do not see improvement in performance of my queries in HDP 2.0. Has there been a benchmark database/query set established that is being used by Hortonworks to define the performance of Hive in HDP 2.0 vs. HDP 1.3 ?

Viewing 2 replies - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #36455

    Dipti Joshi
    Member

    We are using piwik data on a 6 node (2 Name Node/SNameNode and 4 data node) AWS m1-xtra-large, 250 G .

    Example table being queried is

    CREATE TABLE log_link_visit_action${hiveconf:tbl_version} (
    idlink_va INT,
    idsite INT,
    server_time TIMESTAMP,
    idvisit INT,
    visitor_idcookie STRING,
    visitor_idcookie_a bigint,
    visitor_idcookie_b bigint ,
    idaction_url INT,
    idaction_url_ref INT,
    idaction_name INT,
    idaction_name_ref INT,
    time_spent_ref_action INT
    )
    STORED AS ORC;

    and example query is
    set st_time=1293840000;
    set en_time=1293926399;

    SELECT
    idaction_url idaction,
    count(distinct idvisit) dis_visit ,
    count(*) subquery_count
    FROM log_link_visit_action
    WHERE ( (server_time >= ${hiveconf:st_time}) AND
    (server_time 0) )
    GROUP BY idaction_url
    ORDER by subquery_count desc LIMIT 10000 ;

    I have tried to used ORCFile as compressed as well as without compression, but not much difference. I have optimized number of mappers and reducers, as well as mapreduce.map.java.opts and mapreduce.reduce.java.opts with memory per task upto 1024m . It has only given little improvement over HDP 1.3 version.

    Collapse
    #36003

    Carter Shanklin
    Participant

    Hi Dipti,

    We extensively use the TPC-DS suite of queries to benchmark: http://www.tpc.org/tpcds/
    This benchmark simulates a data warehousing environment with a dimensionalized schema.

    Can you give details on your data and types of queries? If so we might be able to give some pointers that can make your queries faster.

    Collapse
Viewing 2 replies - 1 through 2 (of 2 total)