Hive / HCatalog Forum

HDP 2.0 /Hive/Stinger

  • #34559
    Dipti Joshi
    Member

    I have HDP 2.0 beta installed including Hive. Where can I find benchmark database/queries used to profile Hive improvement in Stinger ? I have in past bench marked hive with HDP 1.3 using my own data and queries – but I do not see improvement in performance of my queries in HDP 2.0. Has there been a benchmark database/query set established that is being used by Hortonworks to define the performance of Hive in HDP 2.0 vs. HDP 1.3 ?

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #36003
    Carter Shanklin
    Participant

    Hi Dipti,

    We extensively use the TPC-DS suite of queries to benchmark: http://www.tpc.org/tpcds/
    This benchmark simulates a data warehousing environment with a dimensionalized schema.

    Can you give details on your data and types of queries? If so we might be able to give some pointers that can make your queries faster.

    #36455
    Dipti Joshi
    Member

    We are using piwik data on a 6 node (2 Name Node/SNameNode and 4 data node) AWS m1-xtra-large, 250 G .

    Example table being queried is

    CREATE TABLE log_link_visit_action${hiveconf:tbl_version} (
    idlink_va INT,
    idsite INT,
    server_time TIMESTAMP,
    idvisit INT,
    visitor_idcookie STRING,
    visitor_idcookie_a bigint,
    visitor_idcookie_b bigint ,
    idaction_url INT,
    idaction_url_ref INT,
    idaction_name INT,
    idaction_name_ref INT,
    time_spent_ref_action INT
    )
    STORED AS ORC;

    and example query is
    set st_time=1293840000;
    set en_time=1293926399;

    SELECT
    idaction_url idaction,
    count(distinct idvisit) dis_visit ,
    count(*) subquery_count
    FROM log_link_visit_action
    WHERE ( (server_time >= ${hiveconf:st_time}) AND
    (server_time 0) )
    GROUP BY idaction_url
    ORDER by subquery_count desc LIMIT 10000 ;

    I have tried to used ORCFile as compressed as well as without compression, but not much difference. I have optimized number of mappers and reducers, as well as mapreduce.map.java.opts and mapreduce.reduce.java.opts with memory per task upto 1024m . It has only given little improvement over HDP 1.3 version.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.