Discardable Memory and Materialized Queries
Julian Hyde will present the following talks at the Hadoop Summit:
- “Discardable In-Memory, Materialized Query for Hadoop,” (June 3rd, 11:15-11:55 am)
- “Cost-based Query Optimization in Hive,” (June 4th, 4:35 pm-5:15 pm)
What to do with all that memory in a Hadoop cluster? The question is frequently heard. Should we load all of our data into memory to process it? Unfortunately the answer isn’t quite that simple.
The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. These capabilities should be available to all applications that use Hadoop, and should not require a lot of configuration to make them work efficiently.
These are ambitious goals. In this article, I propose a solution, a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ) and its three key underlying concepts:
Memory-resident data, and
I shall explain how these concepts build on existing Hadoop facilities, are important and useful individually, combine to exploit memory, and balance the needs of various applications. Taken together they radically alter the performance characteristics and manageability of a Hadoop cluster.
Implementing DIMMQs will require changes at the query level (optimization queries in SQL and other languages such as Pig and Cascading) and also at the storage level. In an companion blog, Sanjay Radia proposes a Discardable Distributed Memory (DDM) for HDFS, a mechanism intended to support DIMMQs.
By the way, don’t ask me how to pronounce DIMMQ! I won’t mind at all if the name goes away, and the concepts surface in mundane-sounding features like “CREATE MATERIALIZED VIEW”. The important part is the concepts, and how they fit together.
Before I describe the concepts and vision in detail, let’s look at the trends in hardware, workloads, and data lifecycles that are driving this.
The trend in hardware is towards heterogenous storage, a balance of memory, SSD and disk.
Table 1 shows the hardware configuration of a server in a Hadoop cluster, now versus 5 years ago. (The configurations are anecdotal, but fairly typical; both servers cost approximately $5,000.)
The data shows a gradual move towards memory, but there are no signs that disk is going away. We are not moving to an “all-memory architecture”; if anything, with the arrival of SSD, the storage hierarchy is becoming more complex, and memory will be an important part of it.
Table 1: Hardware Trends
|Year||Cores||Memory||SSD||Disk||Disk : memory ratio|
|2009||4-8||8 GB||None||4 x 1 TB||512:1|
|2014||24||128 GB||1 TB||12 x 4 TB||384:1|
Workloads and data lifecycles
Traditionally Hadoop has been used for storing data, and for batch analytics to retrieve that data. But increasingly it is used for other workloads, such as interactive analytics, machine learning, and streaming. There is a continuum of desired latency, from hours to milliseconds.
If we look at the life cycle of a particular data item, we also see a continuum. Some data might live on disk for years until it is structured and analyzed; other data might need to be acted upon within seconds or even milliseconds. An analyst or machine-learning algorithm might start using a subset of that data for analysis, creating derived or cleaned data sets and combining with the original data.
In general, fresh data is more likely to be read and modified, and activity drops off exponentially over time. But a subset of historic data might become “hot” for a few minutes or days, before becoming latent again. Clearly we would like the hot data to be in memory, but without rewriting our application or excessive tuning.
A pool of resources
Hadoop’s strength is that it brings all of the data in a cluster of computers, and the resources to process it, into a pool. This yields economies of scale. If you and I store our data in the same cluster, and I am not accessing my data right now, you can use my slice of the cluster’s resources to process your data faster.
Along with CPU cycles and network and disk I/O, memory is one of those key resources. Memory can help deliver high performance on a wide variety of workloads, but to maximize the use of memory, we need to add it to the pool, and make it easy for tasks to share memory according to their need, and make it easy for data to migrate between memory and other storage media.
What is the right model for memory?
Data management systems traditionally use memory in two ways. First, a buffer-cache stores copies of disk blocks in memory so that if a block is used multiple times, it only needs to be read from disk once. Second, query-processing algorithms need operating memory. Most algorithms use modest amounts of memory, but algorithms that work on collections (sort, aggregation and join) require large amounts of memory for short periods in order to operate efficiently.
For many analytic workloads, a buffer cache is not an effective use of memory. Consider a table that is 10x larger than available memory. A query that sums every row in the table will read every block once. If you run the query again, no matter how smartly you cache blocks, you will need at least 90% of the blocks again. If all the data is accessed uniformly, random reads will also experience a cache hit-rate of only 10%.
A Discardable, In-Memory, Materialized Query (DIMMQ) allows a new mode of memory use.
A materialized query is a dataset whose contents are guaranteed to be the same as executing a particular query, called the defining query of the DIMMQ. Therefore any query that could be satisfied using that defining query can also be satisfied using the DIMMQ, possibly a lot faster.
Discardable means that the system can throw it away.
In-memory means that the contents of the dataset reside in the memory of one or more nodes in the Hadoop cluster.
Let’s look at the advantages DIMMQs bring.
The defining query provides a link between the DIMMQ and the underlying dataset that the system can understand. The system can rewrite queries to use the DIMMQ without the application explicitly referencing it.
Because the rewrite to use the DIMMQ is performed automatically by the system, not by the application, it makes it OK for the system to discard the DIMMQ.
Because DIMMQs are discardable, we don’t have to worry about creating too many of them. Various sources (users, applications, agents) continually suggest DIMMQs, and the system continually throws out the ones that are not pulling their weight. This dynamic process optimizes the system without requiring any part of it be omniscient or prescient.
How to introduce these capabilities to Hadoop? The architectural approach is compelling because we can build the core concepts separately, and we can evolve existing Hadoop systems such as HDFS, HCatalog, Tez and Hive.
It is tempting to make HDFS intelligent enough to recognize patterns and to rebuild DIMMQs that it had previously discarded. But that would introduce too much coupling into the system — in particular, HDFS would become dependent on high level concepts like HCatalog and a query-optimizer.
Instead, the design is elegantly stupid. The low-level system, HDFS, stores and retrieves DIMMQ data sets, and is allowed to discard them. A query optimizer in the high-level system (such as Hive or Pig) processes incoming queries and rewrites them in terms of DIMMQs. A user or agent builds DIMMQs speculatively, not directly controlling whether they are discarded, but knowing that HDFS will discard them if they do not pull their weight.
The core concepts — materialized queries, discardable data, and in-memory data — are loosely coupled and can be developed and improved independently of each other.
Work is already underway in Optiq to support materialized views, or more specifically, materialized query recognition. Optiq is a query-optimization engine and is already used in Hive as part of the CBO project.
Support for in-memory data is being planned in JIRA HDFS-5851.
Discardable data is an extension to HDFS’s long-standing support for replication (multiple copies of data on disk) and caching (additional copies in memory).
Sanjay Radia describes HDFS Discardable Distributed Memory (DDM), a mechanism that combines in-memory data, replication and caching, in a an upcoming blog post.
The Stinger vectorization initiative makes memory access more efficient by organizing data in column-oriented ranges. This reduces memory usage and makes for more efficient use of processor cache.
Other components, such as agents to gather statistics, recommend, build and maintain DIMMQs, can be built around the system without affecting its core parts.
Queries, materialized views, and caching
When a data management system such as Hadoop loads a data set into memory for more efficient processing, it is doing something that databases have always done: create a copy of the data, organized in a way that is more efficient for the task at hand, and that can be added or removed without the application’s knowledge.
B-tree indexes are perhaps the most familiar example, but there are also hash clusters, aggregate tables, remote snapshots, projections. Sometimes the copy is in a different medium (memory versus disk); sometimes the copy is organized differently (a b-tree index is sorted on a particular key, whereas the underlying table is not sorted); and sometimes the copy is a derived data set, for example a subset over a given time range or a summary.
Why create these copies? If the system knows about the copies of a data set, then it can use a copy to answer a query rather than the original data set. Queries answered that way can be several orders of magnitude faster, especially when the copy is in-memory and/or significantly smaller than the original.
The major databases (Oracle, IBM DB2, Teradata, Microsoft SQL Server) all have a feature called (with a few syntactic variations) materialized views. A materialized view consists of a SQL query and a table that contains the result of that query. For instance, here is how a materialized view might be defined in Hive:
CREATE MATERIALIZED VIEW emp_summary AS
SELECT deptno, gender, COUNT(*) AS c, SUM(salary) AS s
GROUP BY deptno, gender;
A materialized view is a table, so you can query it directly:
SELECT deptno FROM emp_summary
WHERE gender = ‘M’ AND c > 20;
More importantly, it can be used to answer queries on other tables. Given a query on the emp table,
SELECT deptno, AVG(salary) AS average_sal
FROM emp WHERE gender = ‘F'
GROUP BY deptno;
The planner can rewrite to use the emp_summary table, as follows:
SELECT deptno, s / c AS average_sal
FROM emp_summary WHERE gender = ‘F’
GROUP BY deptno;
emp_summary has done much of the work required to answer the query, so the results come back faster. It is also significantly smaller, so the memory budget required to keep it in cache is smaller.
From materialized views to DIMMQs
DIMMQs are an extension to materialized views.
First, we need to make the materialized query accessible to all applications written in all languages, so we convert it to Optiq’s language-independent relational algebra and store its definition in HCatalog.
Next, we tell HDFS that the materialized query (a) should be kept in memory, (b) can be discarded. This can be accomplished using hints on the file that underlies the table.
Other possible hints might tell HDFS whether to consider copying a DIMMQ to disk before discarding it, and estimates of the number of reads over the next hour, day, and month, to predict the DIMMQ’s usefulness. A materialized view that is likely to be used very soon is a good candidate to be stored in memory; if after a few hours the reads decline to a trickle, it might be worth paging it to disk rather than discarding if it is much smaller than the original data.
Lastly, we need a mechanism to suggest, create and populate DIMMQs. Here are a few:
Materialized views can be created explicitly, using CREATE MATERIALIZED VIEW syntax.
Perform incremental updates to materialized views using Apache Falcon
The system caches query results (or intermediate results) as DIMMQs.
An automatic recommender (based on ideas such as Data Cubes ) could suggest and build DIMMQs.
Respectively, these provide control to an administrator, the ability to adapt to new workloads, and optimization of the system based on past activity. We would recommend that systems use a blend of all three.
Variations on a theme
There are many ways that core concepts behind DIMMQs can be used and extended. Here are a few initial ideas, and we trust that the Hadoop community will find many more.
Materialized queries don’t have to be in-memory. A materialized query stored on disk would still be useful if, for example, the source dataset rarely changes and the materialized query is much smaller.
Materialized queries don’t have to be discardable, especially if they are on disk, where space is not generally a scarce resource. They will typically be deleted if they are out of sync with their source data.
Materialized query recognition is just part of the problem. It would be useful if Hadoop helped maintain the materialized query as the underlying data changes, or if you could tell Hadoop that the materialized query was no longer valid. We can build on ACID transactions work already started.
Materialized queries allow a wide variety of derived data structures to be described: summary tables, b-tree indexes (basically sorted projections), partitioned tables and remote snapshots are a few examples. Using the materialized query mechanism, applications can design their own derived data structures and have them automatically recognized by the system.
In-memory tables don’t have to be materialized queries. There are other good reasons to support in-memory tables. In a streaming scenario, for instance, you would write to an in-memory table first, and periodically flush to disk.
Materialized queries can help with data aging. As data gets older, it is accessed less frequently, and so you might wish to store it in slower and cheaper storage, at a lower replication level, or with less coverage by aggregate tables.
Discardable In-memory Materialized Query (DIMMQ) data sets express how our applications use data in a way that allows Hadoop to automatically optimize and adapt how that data is stored, in particular, storing copies of that data in memory.
DIMMQs are superior to alternative uses of memory. Unlike low-level buffer cache, DIMMQ caches high-level results, which can be much smaller and are closer to the required result. And unlike non-declarative materializations like Spark RDDs, an application can use a DIMMQ even if it doesn’t know that it exists.
DIMMQ is built from three concepts: materialized query recognition, in-memory data sets, and discardable data sets. Together, they allow applications to seamlessly use heterogeneous storage — disk, SSD and memory — and quickly adapt to changing patterns of data use. And they provide a new level of data independence that will allow the Hadoop ecosystem to develop novel data organizations.
Materializations combined with HDFS Discardable Distributed Memory (DDM) storage are a major advance in Hadoop architecture that build on Hadoop’s existing strengths and make Hadoop as the place to store and process all of your enterprise’s data.
The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. More »