Enterprises are increasingly moving portions or entire datacenters to the cloud in order to minimize their physical footprint, minimize operational overhead, and shorten their infrastructure acquisition cycles. An incidental benefit is that cloud services, like cloud-based object storage, bring a new set of tools to a Hadoop architect. At Hortonworks, our customers use a number of different cloud architectures but we frequently recommend two when it comes to using Apache Hive for analytics and reporting. In this blog, we will present those two architectures, examine their advantages and disadvantages, and provide reproducible benchmarks to support our analysis.
Cloud storage enables the decoupling of storage and compute to varying degrees. To this effect, we propose that customers leverage this storage in one of two architectures:
There are advantages and disadvantages to each approach. The tiered approach has the most flexibility for an operator to tune the performance of the cluster while trading off size of the hot data zone for better performance or smaller resource footprint. The downside of this approach is that, having data on HDFS, resizing the cluster is a slow and tedious process due to HDFS needing to be rebalanced to achieve performance and fault-tolerance expectations. Thus this architecture is generally only used for statically sized clusters with steady, well-known workloads.
The decoupled architecture, on the other hand, enables maximum flexibility for cluster growth and reduction. For example, a cluster could run at 100 nodes during the day to support analytics and reporting and then shrink to 24 nodes overnight to support smaller ETL workloads. Historically, the disadvantage to decoupling is that cloud storage is not local and therefore could drastically affect runtime of the analytical workloads (hence the hybrid approach of tiered storage). However, the advent of LLAP in Hive 2.0 has limited this overhead making the approach far more attractive. The dynamic cache within LLAP also means that we do not need to statically define what data is hot. It can be inferred at query time (i.e., as users access the data, that data will become hot). We will look closer at how LLAP closes the runtime gap in the next section.
We evaluated the architectures in Amazon EC2 with S3 for the cloud storage. However, we expect similar performance in alternative cloud environments using their respective cloud storage options. We used a 5 node (1 master + 4 workers) HDP 2.6.4 cluster consisting of m4.2xlarge machines with two 500GB volumes each for HDFS. We generated a TPCDS dataset at 50GB and duplicated the ORC database to S3 for the S3 evaluation.
The three scenarios that we evaluated are:
Figure 2 illustrates what we observed. Notice that both architectures have significant performance improvements over classic Hive 1.0 accessing data in HDFS. It also appears that, in this scenario, the overhead for leveraging cloud storage for the persistence layer is a fixed cost at around 2 seconds per query. In many situations, this overhead is worth it to gain flexibility in adjusting the cluster size over time.
We have presented a pair of common architectures for cloud-based Hadoop deployments and analyzed the tradeoffs of each. In summary, an organization that is willing to sacrifice a small runtime overhead that desires to minimize storage management, maximize resource flexibility (expand/reduce compute), and empower users with an interactive analytic engine can do so with a Hive data warehouse backed by cloud storage and powered by LLAP.