HDP 2.0 /Hive/Stinger

to create new topics or reply. | New User Registration

Tagged: , ,

This topic contains 2 replies, has 2 voices, and was last updated by  Dipti Joshi 1 year, 9 months ago.

  • Creator
  • #34559

    Dipti Joshi

    I have HDP 2.0 beta installed including Hive. Where can I find benchmark database/queries used to profile Hive improvement in Stinger ? I have in past bench marked hive with HDP 1.3 using my own data and queries – but I do not see improvement in performance of my queries in HDP 2.0. Has there been a benchmark database/query set established that is being used by Hortonworks to define the performance of Hive in HDP 2.0 vs. HDP 1.3 ?

Viewing 2 replies - 1 through 2 (of 2 total)

You must be to reply to this topic. | Create Account

  • Author
  • #36455

    Dipti Joshi

    We are using piwik data on a 6 node (2 Name Node/SNameNode and 4 data node) AWS m1-xtra-large, 250 G .

    Example table being queried is

    CREATE TABLE log_link_visit_action${hiveconf:tbl_version} (
    idlink_va INT,
    idsite INT,
    server_time TIMESTAMP,
    idvisit INT,
    visitor_idcookie STRING,
    visitor_idcookie_a bigint,
    visitor_idcookie_b bigint ,
    idaction_url INT,
    idaction_url_ref INT,
    idaction_name INT,
    idaction_name_ref INT,
    time_spent_ref_action INT

    and example query is
    set st_time=1293840000;
    set en_time=1293926399;

    idaction_url idaction,
    count(distinct idvisit) dis_visit ,
    count(*) subquery_count
    FROM log_link_visit_action
    WHERE ( (server_time >= ${hiveconf:st_time}) AND
    (server_time 0) )
    GROUP BY idaction_url
    ORDER by subquery_count desc LIMIT 10000 ;

    I have tried to used ORCFile as compressed as well as without compression, but not much difference. I have optimized number of mappers and reducers, as well as mapreduce.map.java.opts and mapreduce.reduce.java.opts with memory per task upto 1024m . It has only given little improvement over HDP 1.3 version.


    Carter Shanklin

    Hi Dipti,

    We extensively use the TPC-DS suite of queries to benchmark: http://www.tpc.org/tpcds/
    This benchmark simulates a data warehousing environment with a dimensionalized schema.

    Can you give details on your data and types of queries? If so we might be able to give some pointers that can make your queries faster.

Viewing 2 replies - 1 through 2 (of 2 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.