Home Forums Hive / HCatalog Can hive-testbench run on Hive 0.12.0 ?

This topic contains 2 replies, has 2 voices, and was last updated by  Hank Jakiela 4 months, 3 weeks ago.

  • Creator
    Topic
  • #58221

    Hank Jakiela
    Participant

    I’d like to run the Hive benchmarks used in the recently published report comparing Hive 13 to Hive 10:

    http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/

    I’m trying to use the testbench at:

    https://github.com/cartershanklin/hive-testbench

    The readme file says Hive 13 is required, but the published report compares results from Hive 10, so I hope at least some of the queries will work on older Hives.

    Our cluster is currently running HDP 2.0.6 with Hive 0.12.0. Can I run any of the queries on Hive 12? If I have to skip some of the queries not supported on older versions of Hive, that’s fine. But if none of the queries will run on Hive 12, I won’t waste time trying. If none of the queries in the testbench will run on Hive 12, how were queries run on Hive 10 for the published report?

Viewing 2 replies - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #58242

    Hank Jakiela
    Participant

    Carter, thanks for your response. I’m sure that if I get to the point of scaling this up, this information will be very useful. At this point, I’m still trying to get things to run at any scale (I’m starting with a scale of 10).

    tpch-setup.sh works, but tpcds-setup.sh has failed in several ways. Then trying to run a query:

    # cd sample-queries-tpch
    # hive -i testbench.settings
    hive> use tpch_bin_partitioned_orc_100;
    hive.exec.pre.hooks Class not found:org.apache.hadoop.hive.ql.hooks.ATSHook
    FAILED: Hive Internal Error: java.lang.ClassNotFoundException(org.apache.hadoop.hive.ql.hooks.ATSHook)
    java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.hooks.ATSHook

    Anyway, I’ve dropped back to v1.1 of the testbench for the time being. So far, so good.

    Thanks

    Collapse
    #58230

    Carter Shanklin
    Participant

    Hank, it’s not strictly true that you need Hive 13 to run the benchmark. The problem is that large scale data generation is extremely difficult without using Hive 13, so the benchmark tries to push you to using Hive 13. If you do data generation of any meaningful scale (1 TB+) with Hive 10 or 12 it will crash and be very difficult to tune around. I’ve seen people generate smaller datasets (100GB or so) without problem.

    I had a customer go through this a few weeks ago and what they ended up doing was comparing performance of Hive 12 using textfile versus Hive 13 using textfile, and then Hive 13 using ORCFile. All of the data was generated in Hive 13. If you want to test at 1TB+ you should go this way. You can install a Hive 13 package on your cluster without interfering with Hive 12 and use it to generate the data.

    As an example of generating 1 TB of text data use ‘FORMAT=textfile ./tpcds-setup.sh 1000′. The database that is generated can be queried from both Hive 12 and Hive 13. When using textfile you won’t get the benefits of vectorization or ORCFile but you will get the benefit of Tez. In addition you could compare both against Hive 13 using ORCFile / vectorization to see the added benefit.

    Collapse
Viewing 2 replies - 1 through 2 (of 2 total)