Yahoo! JAPAN needed a data platform that could scale to generate 100,000 reports per day as well as having the ability to process large amounts of data. It needed to keep the last 13 months’ worth of data, which is approximately 500 billion rows, organized and easily accessible. Relational Database Management Systems (RDBMS) cannot scale to these levels from a cost and processing power perspective. Yahoo! JAPAN explored Hadoop to achieve this and evaluated two platforms based on our requirements; Hortonworks Hive and Tez on YARN and Cloudera Impala. Hive and Tez on YARN was able to scale beyond 15,000 queries per hour while Impala hovered at about 2,500 queries per hour.
Our business report systems generate 15,000 reports per hour from about 500 billion lines of Time Series data over the last 13 months based on conditions predefined by a user. These report systems had the following problems because of file-based processing:
It is possible to put data in an RDBMS and use SQL queries for implementing more processing variations and filter functions. However, it was not possible to migrate to an RDBMS due to the large amount of data and large number of jobs.
We reviewed Hadoop-based SQL engines from Cloudera and Hortonworks, and solved these challenges by employing Hortonworks’ HDP2.2 and Apache Hive-0.14, Apache Tez.
When a user specifies query conditions, this system creates asynchronous report data for later download.
The time report generation option was the most challenging by far. Many users specify periodic reporting, such as daily and weekly reporting. A huge amount of report creating processes are executed immediately after daily data updates, which increases the number of queries per unit of time.
This report system handles files in the following TEXT gz format.
Format Text gz
File size per day 10GB
Number of lines per day 1.3 billion lines
Data retention period 13 months
Total data amount 6,000GB
Total lines 450 billion lines
Originally, MapReduce is suited for batch processing. We thought that it would be practical to use it in the report system, if we could control the latency for each query and ensure parallel execution performance.
We tried Impala, which has a different execution engine from MapReduce.
Impala was promising because it executes a query in a relatively short amount of time. It had a typical range of several seconds to several tens of seconds.
If the response time for one query is 15 seconds, about 240 queries can be executed per hour.
For increasing the processing amount per unit of time, we considered the idea of increase the number of parallel queries per unit time by increasing the multiplicity of queries.
However, we encountered a problem. Impala’s processing time increases linearly when the degree of query parallelism increases.
It could not handle all of the batch queries that are automatically executed every day after data updates. Therefore, we had to give up on the idea of employing Cloudera Impala.
With the advent of Hadoop-2.x and YARN, more fine-tuned parallel execution control has become possible, and the use of the Tez engine has significantly reduced latency of MapReduce.
YARN and Tez adds significant processing power, scaling data processing to the levels beyond 15,000 jobs per hour. This is a dramatic leap from the 100,000 jobs per day processing capability that we had with our previous cluster setup that is based on Hadoop 1.0.
At first, for the testing purpose, little less than 2,000 SQL queries were picked up from those actually used by the real report generation processing.
Most SQL returns the result set of less than 1,000 rows, but also include some which returned big sets of more than 100,000 rows.
At first, we compared with Impala which we were planning to deploy. These 2,000 SQL run in 32 parallels, and fig 2 is the graph of the breakdown of all the SQL processing time.
Hive processes most SQL within 20 seconds, and even the SQL with the big result set, which takes more time, completed the process within 70 seconds.
On the other hand, many took 30 to 90 seconds for Impala, and some SQL’s took more than 10 minutes, namely the big result sets, and performance degradation during parallel execution was remarkable.
Impala can process in milliseconds when running at low load conditions and Impala is one of the valid choices if no SQL parallel processing is executed.
However, there is a batch execution in our use case, and SQL parallel execution was a mandatory requirement, and that is why we could not choose Impala.
Then we tested how many parallel queries can be executed, and how much of the degree of parallelism is processed most effectively for a single HiveServer2 instance.
We raised number of SQL parallel execution by 16 a time, and confirmed all processing times of the sample SQL.
The throughput was raised rapidly up until it reached parallel 64. At this point, the throughput started declining. In this environment, it was concluded that our needs would be best met if we set the number of parallels 64
The single SQL execution time is around 15 seconds for Hive on Tez, and it would scale by raising the parallel number. (The limit of parallel number depends on the size and performance of cluster)
Impala returns a very fast response when the cluster is at low load state, but is not suitable for use such as running the SQL in parallel.
We have decided to adopt Hive on Tez because it can process the 15,000 SQL per hour that is being requested from our service.