The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

Hive / HCatalog Forum

HDP 2.0 /Hive/Stinger

  • #34559
    Dipti Joshi
    Member

    I have HDP 2.0 beta installed including Hive. Where can I find benchmark database/queries used to profile Hive improvement in Stinger ? I have in past bench marked hive with HDP 1.3 using my own data and queries – but I do not see improvement in performance of my queries in HDP 2.0. Has there been a benchmark database/query set established that is being used by Hortonworks to define the performance of Hive in HDP 2.0 vs. HDP 1.3 ?

  • Author
    Replies
  • #36003
    Carter Shanklin
    Participant

    Hi Dipti,

    We extensively use the TPC-DS suite of queries to benchmark: http://www.tpc.org/tpcds/
    This benchmark simulates a data warehousing environment with a dimensionalized schema.

    Can you give details on your data and types of queries? If so we might be able to give some pointers that can make your queries faster.

    #36455
    Dipti Joshi
    Member

    We are using piwik data on a 6 node (2 Name Node/SNameNode and 4 data node) AWS m1-xtra-large, 250 G .

    Example table being queried is

    CREATE TABLE log_link_visit_action${hiveconf:tbl_version} (
    idlink_va INT,
    idsite INT,
    server_time TIMESTAMP,
    idvisit INT,
    visitor_idcookie STRING,
    visitor_idcookie_a bigint,
    visitor_idcookie_b bigint ,
    idaction_url INT,
    idaction_url_ref INT,
    idaction_name INT,
    idaction_name_ref INT,
    time_spent_ref_action INT
    )
    STORED AS ORC;

    and example query is
    set st_time=1293840000;
    set en_time=1293926399;

    SELECT
    idaction_url idaction,
    count(distinct idvisit) dis_visit ,
    count(*) subquery_count
    FROM log_link_visit_action
    WHERE ( (server_time >= ${hiveconf:st_time}) AND
    (server_time 0) )
    GROUP BY idaction_url
    ORDER by subquery_count desc LIMIT 10000 ;

    I have tried to used ORCFile as compressed as well as without compression, but not much difference. I have optimized number of mappers and reducers, as well as mapreduce.map.java.opts and mapreduce.reduce.java.opts with memory per task upto 1024m . It has only given little improvement over HDP 1.3 version.

The forum ‘Hive / HCatalog’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.