Hank, it’s not strictly true that you need Hive 13 to run the benchmark. The problem is that large scale data generation is extremely difficult without using Hive 13, so the benchmark tries to push you to using Hive 13. If you do data generation of any meaningful scale (1 TB+) with Hive 10 or 12 it will crash and be very difficult to tune around. I’ve seen people generate smaller datasets (100GB or so) without problem.
I had a customer go through this a few weeks ago and what they ended up doing was comparing performance of Hive 12 using textfile versus Hive 13 using textfile, and then Hive 13 using ORCFile. All of the data was generated in Hive 13. If you want to test at 1TB+ you should go this way. You can install a Hive 13 package on your cluster without interfering with Hive 12 and use it to generate the data.
As an example of generating 1 TB of text data use ‘FORMAT=textfile ./tpcds-setup.sh 1000’. The database that is generated can be queried from both Hive 12 and Hive 13. When using textfile you won’t get the benefits of vectorization or ORCFile but you will get the benefit of Tez. In addition you could compare both against Hive 13 using ORCFile / vectorization to see the added benefit.