Hive / HCatalog Forum

Can hive-testbench run on Hive 0.12.0 ?

  • #58221
    Hank Jakiela
    Participant

    I’d like to run the Hive benchmarks used in the recently published report comparing Hive 13 to Hive 10:

    http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/

    I’m trying to use the testbench at:

    https://github.com/cartershanklin/hive-testbench

    The readme file says Hive 13 is required, but the published report compares results from Hive 10, so I hope at least some of the queries will work on older Hives.

    Our cluster is currently running HDP 2.0.6 with Hive 0.12.0. Can I run any of the queries on Hive 12? If I have to skip some of the queries not supported on older versions of Hive, that’s fine. But if none of the queries will run on Hive 12, I won’t waste time trying. If none of the queries in the testbench will run on Hive 12, how were queries run on Hive 10 for the published report?

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #58230
    Carter Shanklin
    Participant

    Hank, it’s not strictly true that you need Hive 13 to run the benchmark. The problem is that large scale data generation is extremely difficult without using Hive 13, so the benchmark tries to push you to using Hive 13. If you do data generation of any meaningful scale (1 TB+) with Hive 10 or 12 it will crash and be very difficult to tune around. I’ve seen people generate smaller datasets (100GB or so) without problem.

    I had a customer go through this a few weeks ago and what they ended up doing was comparing performance of Hive 12 using textfile versus Hive 13 using textfile, and then Hive 13 using ORCFile. All of the data was generated in Hive 13. If you want to test at 1TB+ you should go this way. You can install a Hive 13 package on your cluster without interfering with Hive 12 and use it to generate the data.

    As an example of generating 1 TB of text data use ‘FORMAT=textfile ./tpcds-setup.sh 1000′. The database that is generated can be queried from both Hive 12 and Hive 13. When using textfile you won’t get the benefits of vectorization or ORCFile but you will get the benefit of Tez. In addition you could compare both against Hive 13 using ORCFile / vectorization to see the added benefit.

    #58242
    Hank Jakiela
    Participant

    Carter, thanks for your response. I’m sure that if I get to the point of scaling this up, this information will be very useful. At this point, I’m still trying to get things to run at any scale (I’m starting with a scale of 10).

    tpch-setup.sh works, but tpcds-setup.sh has failed in several ways. Then trying to run a query:

    # cd sample-queries-tpch
    # hive -i testbench.settings
    hive> use tpch_bin_partitioned_orc_100;
    hive.exec.pre.hooks Class not found:org.apache.hadoop.hive.ql.hooks.ATSHook
    FAILED: Hive Internal Error: java.lang.ClassNotFoundException(org.apache.hadoop.hive.ql.hooks.ATSHook)
    java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.hooks.ATSHook

    Anyway, I’ve dropped back to v1.1 of the testbench for the time being. So far, so good.

    Thanks

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.