Pig Performance and Optimization Analysis
In this post, Hortonworks Intern Jie Li talks about his work this summer on performance analysis and optimization of Apache Pig. Jie is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.
Pig Performance Analysis and Optimization
I am proud that I was among the first several interns at Hortonworks, one of the leaders in the Hadoop community. In this post, I want to summarize my project on Pig performance and also share my experience this summer.
I began working on Pig one year ago, when my classmates in CPS216 and I developed the TPC-H benchmark for Pig, in order to compare the performance of Pig and Hive. TPC-H (specified here) consists of a set of complex queries and is the well-known benchmark for the traditional data warehouse. Hive has used it to develop new features and optimize performance for some time. Our work is available in a paper here.
Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.
You can see that the performance of Hive vs Pig depends on the query. During the process of comparison, we came up with a few best practice rules for writing pig queries in PIG-2397. After several iterations of rewriting Pig scripts, we managed to make Pig competitive with Hive with a few best practice rules. However, there are still a few queries for which Pig is apparently slower than Hive (such as Q1) which need further investigation.
I wanted to incorporate these best practice rules into Pig itself, therefore I continued this project this summer as an intern at Hortonworks. I was excited to do so because Hortonworks is a major contributor to Apache Pig development. With the help of Pig and Hive committers here, I successfully identified some of the bottlenecks that contributed to the performance gap in the benchmark, and was able to implement initial solutions for them.
Pig TPC-H Bottlenecks
- Map Aggregation vs. Combiner
- Type Conversion
- Extra Jobs
When the Pig TPC-H benchmark was developed, Map Aggregation PIG-2228 was not available yet. As the first step I applied Map Aggregation to Q1, which is dominated by a group-by clause with only four different groups. This turns out to be very effective: simply enable Map Aggregation and we see more than a 20% speed up. The improvement comes from the advantages of Map Aggregation, which are as follows:
|multiple invocation||one invocation|
However, the current implementation of Map Aggregation is not aggressive enough.
First, it requires the combiner to be turned on in the hope that if the Map Aggregation is not effective enough, the combiner can further help. But as the combiner hasn’t been able to auto-disable itself, it makes sense to provide separate options for turning on/off Map Aggregation and combiners independently.
Second, the thresholds for auto-disable are too conservative such that Map Aggregation might easily get disabled. Given that Hive has used Map Aggregation since the very beginning, we can also be confident in its efficacy. I proposed and implemented these changes in PIG-2829. Below is the benchmark result comparing Map Aggregation and combiners for several queries.
The first query is TPC-H Q1, for which Map Aggregation improves performance by more than 20%. For the other three queries, the group-by keys are varied to achieve different record reduction rates (the number of groups over the number of input records). For example, S-1 means the reduction rate is 1, i.e. the number of groups is the same as the number of input records, so the combiner doesn’t help at all and should be turned off. We can observe the overhead of the combiner in S-1, where Map Aggregate is auto-disabled. For queries with enough reduction rate, Map Aggregate can achieve better performance.
Pig has a simple type conversion mechanism: if we declare types in the schema, Pig will immediately do type conversion for all columns that are used in the script. Otherwise, Pig will guess the types and do type conversion as late as possible. It’s clear that we want to take advantage of lazy type conversion. We can easily verify this with a simple query which loads the biggest table in TPC-H and then filters out all records by an always-false condition so we can avoid writing data and focus on the type conversion overhead.
We can observe from the above result that with lazy type conversion, Pig can save a lot of time loading the data. As a result, one of the best practices we recommended removing all types in the schema.
However, when benchmarking TPC-H Q1, we observed that even with Map Aggregation turned on, Pig still took 4x time as Hive. After a bit of profiling, we identified the bottleneck: type conversion, which took half the time of the entire query. The explanation was that when I removed all the types in the schema, Pig guessed some columns should be Integer though they were actually Double. When converting raw data to Integer, Pig has a backup solution that if the conversion throws an exception, it will retry by converting to Double first and then converting back to Integer.
Therefore, for each such conversion, Pig went through exception handling for each tuple, which took 10x as much time as successful conversion, thus dominating the whole query running time. The easiest solution for users is to explicitly declare types for those columns. For Pig itself, we can either change the default type Pig will guess, or replace the exception handling with a light weight check as implemented in PIG-2835. Below is the benchmark result.
Reducing the number of MR jobs for a given Pig script is always effective for performance optimization. There are still many types of jobs compiled by Pig that can be removed. An extreme example is the Order-By query, which is implemented by three jobs in Pig while Hive only requires one job.
Note that Hive achieves one job by limiting the Order By to use a single reducer. This can become a bottleneck if you are sorting large data. But even without this limitation, we can also optimize Pig to use less jobs. Skew-Join, implemented in a similar way, can also benefit from the same optimization.
First, the map only job can be merged into the sample job and the sort job. Pig used to do this in SampleOptimizer but it was broken unintentionally. PIG-2661 tries not only to re-enable this optimization, but also to make it more aggressive so it still works if the map-only job contains operations such as filters.
Second, the sample job can be safely removed if only one reducer is finally used for the sort job, as the partition file generated by the sample job is not useful at all. It looks like this:
However, there are some challenges for this second optimization. It needs to be done dynamically, as we need the final number of reducers, which is available only before submitting the sort job. In addition, it needs to modify the runtime query plan, which is a completely new challenge to Pig. As a first step, I implemented a light-weight solution in PIG-483, which introduces the notion of a SkipJob, that will be skipped. Of course, eventually we need a general framework for dynamic query optimization, which will open a lot of new opportunities for optimization Pig, such as auto suggesting the type of join, auto fail-over, etc. PIG-2784 can serve as a place to discuss more details.
My Hortonworks Experience
Besides this project, I also developed some bug fixes requested by Pig users such as PIG-2780 or required by the optimization such as PIG-2779. Also I had a chance to experience the exciting moment of releasing our first product, Hortonworks Data Platform, and contributed some tests for that release!
I want to say Hortonworks is definitely a great place to work. There are lots of smart and knowledgeable people here who are all easily approachable, and I enjoyed the time spent with them during lunches, games and parties. I was also amazed by the flexible working environment where we can customize our working schedule and even work from home. I really enjoyed myself this summer, and I’m looking forward to work again with them in the near future.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.