Discardable Memory and Materialized Queries

Put your memory into its right place in the storage hierarchy for efficient queries

Julian Hyde will present the following talks at the Hadoop Summit:

  1. Discardable In-Memory, Materialized Query for Hadoop,”  (June 3rd, 11:15-11:55 am)
  2. “Cost-based Query Optimization in Hive,” (June 4th,  4:35 pm-5:15 pm)

What to do with all that memory in a Hadoop cluster? The question is frequently heard. Should we load all of our data into memory to process it? Unfortunately the answer isn’t quite that simple.

The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. These capabilities should be available to all applications that use Hadoop, and should not require a lot of configuration to make them work efficiently.

These are ambitious goals. In this article, I propose a solution, a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ) and its three key underlying concepts:

  1. Materialized queries,

  2. Memory-resident data, and

  3. Discardable data.

I shall explain how these concepts build on existing Hadoop facilities, are important and useful individually, combine to exploit memory, and balance the needs of various applications. Taken together they radically alter the performance characteristics and manageability of a Hadoop cluster.

Implementing DIMMQs will require changes at the query level (optimization queries in SQL and other languages such as Pig and Cascading) and also at the storage level. In an companion blog, Sanjay Radia proposes a Discardable Distributed Memory (DDM) for HDFS, a mechanism intended to support DIMMQs.

By the way, don’t ask me how to pronounce DIMMQ! I won’t mind at all if the name goes away, and the concepts surface in mundane-sounding features like “CREATE MATERIALIZED VIEW”. The important part is the concepts, and how they fit together.

Before I describe the concepts and vision in detail, let’s look at the trends in hardware, workloads, and data lifecycles that are driving this.

Hardware trends

The trend in hardware is towards heterogenous storage, a balance of memory, SSD and disk.

Table 1 shows the hardware configuration of a server in a Hadoop cluster, now versus 5 years ago. (The configurations are anecdotal, but fairly typical; both servers cost approximately $5,000.)

The data shows a gradual move towards memory, but there are no signs that disk is going away. We are not moving to an “all-memory architecture”; if anything, with the arrival of SSD, the storage hierarchy is becoming more complex, and memory will be an important part of it.

Table 1: Hardware Trends

 Year  Cores  Memory  SSD  Disk  Disk : memory ratio
 2009  4-8  8 GB  None  4 x 1 TB  512:1
 2014  24  128 GB  1 TB  12 x 4 TB  384:1

Workloads and data lifecycles

Traditionally Hadoop has been used for storing data, and for batch analytics to retrieve that data. But increasingly it is used for other workloads, such as interactive analytics, machine learning, and streaming. There is a continuum of desired latency, from hours to milliseconds.

If we look at the life cycle of a particular data item, we also see a continuum. Some data might live on disk for years until it is structured and analyzed; other data might need to be acted upon within seconds or even milliseconds. An analyst or machine-learning algorithm might start using a subset of that data for analysis, creating derived or cleaned data sets and combining with the original data.

In general, fresh data is more likely to be read and modified, and activity drops off exponentially over time. But a subset of historic data might become “hot” for a few minutes or days, before becoming latent again. Clearly we would like the hot data to be in memory, but without rewriting our application or excessive tuning.

A pool of resources

Hadoop’s strength is that it brings all of the data in a cluster of computers, and the resources to process it, into a pool. This yields economies of scale. If you and I store our data in the same cluster, and I am not accessing my data right now, you can use my slice of the cluster’s resources to process your data faster.

Along with CPU cycles and network and disk I/O, memory is one of those key resources. Memory can help deliver high performance on a wide variety of workloads, but to maximize the use of memory, we need to add it to the pool, and make it easy for tasks to share memory according to their need, and make it easy for data to migrate between memory and other storage media.

What is the right model for memory?

Data management systems traditionally use memory in two ways. First, a buffer-cache stores copies of disk blocks in memory so that if a block is used multiple times, it only needs to be read from disk once. Second, query-processing algorithms need operating memory. Most algorithms use modest amounts of memory, but algorithms that work on collections (sort, aggregation and join) require large amounts of memory for short periods in order to operate efficiently.

For many analytic workloads, a buffer cache is not an effective use of memory. Consider a table that is 10x larger than available memory. A query that sums every row in the table will read every block once. If you run the query again, no matter how smartly you cache blocks, you will need at least 90% of the blocks again. If all the data is accessed uniformly, random reads will also experience a cache hit-rate of only 10%.

A Discardable, In-Memory, Materialized Query (DIMMQ) allows a new mode of memory use.

  • A materialized query is a dataset whose contents are guaranteed to be the same as executing a particular query, called the defining query of the DIMMQ. Therefore any query that could be satisfied using that defining query can also be satisfied using the DIMMQ, possibly a lot faster.

  • Discardable means that the system can throw it away.

  • In-memory means that the contents of the dataset reside in the memory of one or more nodes in the Hadoop cluster.

Let’s look at the advantages DIMMQs bring.

  • The defining query provides a link between the DIMMQ and the underlying dataset that the system can understand. The system can rewrite queries to use the DIMMQ without the application explicitly referencing it.

  • Because the rewrite to use the DIMMQ is performed automatically by the system, not by the application, it makes it OK for the system to discard the DIMMQ.

  • Because DIMMQs are discardable, we don’t have to worry about creating too many of them. Various sources (users, applications, agents) continually suggest DIMMQs, and the system continually throws out the ones that are not pulling their weight. This dynamic process optimizes the system without requiring any part of it be omniscient or prescient.

Building DIMMQs

How to introduce these capabilities to Hadoop? The architectural approach is compelling because we can build the core concepts separately, and we can evolve existing Hadoop systems such as HDFS, HCatalog, Tez and Hive.

It is tempting to make HDFS intelligent enough to recognize patterns and to rebuild DIMMQs that it had previously discarded. But that would introduce too much coupling into the system — in particular, HDFS would become dependent on high level concepts like HCatalog and a query-optimizer.

Instead, the design is elegantly stupid. The low-level system, HDFS, stores and retrieves DIMMQ data sets, and is allowed to discard them. A query optimizer in the high-level system (such as Hive or Pig) processes incoming queries and rewrites them in terms of DIMMQs. A user or agent builds DIMMQs speculatively, not directly controlling whether they are discarded, but knowing that HDFS will discard them if they do not pull their weight.

The core concepts — materialized queries, discardable data, and in-memory data — are loosely coupled and can be developed and improved independently of each other.

  • Work is already underway in Optiq to support materialized views, or more specifically, materialized query recognition. Optiq is a query-optimization engine and is already used in Hive as part of the CBO project.

  • Support for in-memory data is being planned in JIRA HDFS-5851.

  • Discardable data is an extension to HDFS’s long-standing support for replication (multiple copies of data on disk) and caching (additional copies in memory).

  • Sanjay Radia describes HDFS Discardable Distributed Memory (DDM), a mechanism that combines in-memory data, replication and caching, in a an upcoming blog post.

  • The Stinger vectorization initiative makes memory access more efficient by organizing data in column-oriented ranges. This reduces memory usage and makes for more efficient use of processor cache.

Other components, such as agents to gather statistics, recommend, build and maintain DIMMQs, can be built around the system without affecting its core parts.

Queries, materialized views, and caching

When a data management system such as Hadoop loads a data set into memory for more efficient processing, it is doing something that databases have always done: create a copy of the data, organized in a way that is more efficient for the task at hand, and that can be added or removed without the application’s knowledge.

B-tree indexes are perhaps the most familiar example, but there are also hash clusters, aggregate tables, remote snapshots, projections. Sometimes the copy is in a different medium (memory versus disk); sometimes the copy is organized differently (a b-tree index is sorted on a particular key, whereas the underlying table is not sorted); and sometimes the copy is a derived data set, for example a subset over a given time range or a summary.

Why create these copies? If the system knows about the copies of a data set, then it can use a copy to answer a query rather than the original data set. Queries answered that way can be several orders of magnitude faster, especially when the copy is in-memory and/or significantly smaller than the original.

The major databases (Oracle, IBM DB2, Teradata, Microsoft SQL Server) all have a feature called (with a few syntactic variations) materialized views. A materialized view consists of a SQL query and a table that contains the result of that query. For instance, here is how a materialized view might be defined in Hive:

CREATE MATERIALIZED VIEW emp_summary AS
SELECT deptno, gender, COUNT(*) AS c, SUM(salary) AS s
FROM emp>
GROUP BY deptno, gender;

A materialized view is a table, so you can query it directly:

SELECT deptno FROM emp_summary
WHERE gender = ‘M’ AND c > 20;

More importantly, it can be used to answer queries on other tables. Given a query on the emp table,

SELECT deptno, AVG(salary) AS average_sal
FROM emp WHERE gender = ‘F'
GROUP BY deptno;

The planner can rewrite to use the emp_summary table, as follows:

SELECT deptno, s / c AS average_sal
FROM emp_summary WHERE gender = ‘F’
GROUP BY deptno;

emp_summary has done much of the work required to answer the query, so the results come back faster. It is also significantly smaller, so the memory budget required to keep it in cache is smaller.

From materialized views to DIMMQs

DIMMQs are an extension to materialized views.

First, we need to make the materialized query accessible to all applications written in all languages, so we convert it to Optiq’s language-independent relational algebra and store its definition in HCatalog.

Next, we tell HDFS that the materialized query (a) should be kept in memory, (b) can be discarded. This can be accomplished using hints on the file that underlies the table.

Other possible hints might tell HDFS whether to consider copying a DIMMQ to disk before discarding it, and estimates of the number of reads over the next hour, day, and month, to predict the DIMMQ’s usefulness. A materialized view that is likely to be used very soon is a good candidate to be stored in memory; if after a few hours the reads decline to a trickle, it might be worth paging it to disk rather than discarding if it is much smaller than the original data.

Lastly, we need a mechanism to suggest, create and populate DIMMQs. Here are a few:

  1. Materialized views can be created explicitly, using CREATE MATERIALIZED VIEW syntax.

  2. Perform incremental updates to materialized views using Apache Falcon

  3. The system caches query results (or intermediate results) as DIMMQs.

  4. An automatic recommender (based on ideas such as Data Cubes ) could suggest and build DIMMQs.

Respectively, these provide control to an administrator, the ability to adapt to new workloads, and optimization of the system based on past activity. We would recommend that systems use a blend of all three.

Variations on a theme

There are many ways that core concepts behind DIMMQs can be used and extended. Here are a few initial ideas, and we trust that the Hadoop community will find many more.

Materialized queries don’t have to be in-memory. A materialized query stored on disk would still be useful if, for example, the source dataset rarely changes and the materialized query is much smaller.

Materialized queries don’t have to be discardable, especially if they are on disk, where space is not generally a scarce resource. They will typically be deleted if they are out of sync with their source data.

Materialized queries don’t have to be specified in SQL. Other languages, such as Pig, Cascading, and Scalding, and in fact any application that uses Tez, should be able to use this facility.

Materialized query recognition is just part of the problem. It would be useful if Hadoop helped maintain the materialized query as the underlying data changes, or if you could tell Hadoop that the materialized query was no longer valid. We can build on ACID transactions work already started.

Materialized queries allow a wide variety of derived data structures to be described: summary tables, b-tree indexes (basically sorted projections), partitioned tables and remote snapshots are a few examples. Using the materialized query mechanism, applications can design their own derived data structures and have them automatically recognized by the system.

In-memory tables don’t have to be materialized queries. There are other good reasons to support in-memory tables. In a streaming scenario, for instance, you would write to an in-memory table first, and periodically flush to disk.

Materialized queries can help with data aging. As data gets older, it is accessed less frequently, and so you might wish to store it in slower and cheaper storage, at a lower replication level, or with less coverage by aggregate tables.

Conclusion

Discardable In-memory Materialized Query (DIMMQ) data sets express how our applications use data in a way that allows Hadoop to automatically optimize and adapt how that data is stored, in particular, storing copies of that data in memory.

DIMMQs are superior to alternative uses of memory. Unlike low-level buffer cache, DIMMQ caches high-level results, which can be much smaller and are closer to the required result. And unlike non-declarative materializations like Spark RDDs, an application can use a DIMMQ even if it doesn’t know that it exists.

DIMMQ is built from three concepts: materialized query recognition, in-memory data sets, and discardable data sets. Together, they allow applications to seamlessly use heterogeneous storage — disk, SSD and memory — and quickly adapt to changing patterns of data use. And they provide a new level of data independence that will allow the Hadoop ecosystem to develop novel data organizations.

Materializations combined with HDFS Discardable Distributed Memory (DDM) storage are a major advance in Hadoop architecture that build on Hadoop’s existing strengths and make Hadoop as the place to store and process all of your enterprise’s data.

Categorized by :
Administrator CIO & ITDM Data Analyst & Scientist Data Management Developer HDFS New Features

Comments

Matei Zaharia
|
June 2, 2014 at 11:55 am
|

Hey Julien, while your points apply to RDDs stored on the Scala heap in Spark, Spark has long supported RDDs stored off-heap and shared across applications with Tachyon (http://tachyon-project.org). Is there a major difference with DIMMQ in that case? (In fact even within the JVM Spark can store data in serialized form, see http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence.) In any case though, it’s good to see Hive and Pig adopting a similar model, and an effort to have a common manager for these.

    Julian Hyde
    |
    June 2, 2014 at 12:30 pm
    |

    There is a lot in common – and Spark and Tachyon have been an inspiration. I won’t argue that Spark supports both memory-organized and disk-organized data, and that data can be shared.

    The most important differences are that (1) you access RDDs by reference (or, I surmise, one could access them by constructing an identical expression in Spark’s algebra) whereas one accesses DIMMQs by writing an algebraically similar expression, and (2) Spark’s algebra is very closely linked with the Scala API.

    In Spark, it would be difficult to build an agent that continuously observes patterns of query activity and pre-populates RDDs that will be useful to queries not yet seen. This would require radical algebraic transformations of queries, and those are difficult in Spark’s algebra.

    Also, creating an adaptive system requires a richer discard policy than I believe Spark/Tachyon currently support.

    Most of the other differences are, frankly, just differences of emphasis. We could each adopt the better points of the other’s architecture if we put out minds to it. So I don’t want this to be framed as “RDDs versus DIMMQs”.

    I think there are immense benefits to bringing all data and resources into one pool, a pool that supports many compute models, and where data can be shared, regardless of the engine that created it. That pool is what I call Hadoop. The Spark-on-YARN efforts have already shown that Spark fits into Hadoop very elegantly. If every item in Tachyon can be accessed via HDFS, if every RDD can be accessed as a Hive table, and if RDDs/tables can be accessed algebraically, I have my wish.

      Matei Zaharia
      |
      June 2, 2014 at 10:05 pm
      |

      I think these differences are somewhat cosmetic. Files in Tachyon have names, so you can access datasets by name as well as by reference. And the issue of knowing the expression that built a dataset is at a higher level than the execution engine — you also couldn’t figure out the expression from an arbitrary Tez vertex DAG, or something like a MapReduce job running on Tez, only from systems like Hive and Pig that expose higher-level semantics. I just want to make sure that people understand what the differences actually are; the main difference I see is that there’s a proposal here to integrate these concepts into Pig and Hive, though one other difference with this proposal is that it’s HDFS-specific (whereas Tachyon also works over other filesystems).

Julian Hyde
|
June 2, 2014 at 9:45 am
|

Because DIMMQs are built in existing infrastructure, in many ways it is a non-problem. DIMMQs are tables / partitioned files, and Hadoop is good at ensuring that processing occurs near to the data.

Using Hive’s new cost-based optimizer, we can convert this into a costing question. Is it better to process using a local copy of data on disk or remote data in-memory? What about if the in-memory data is 10x smaller? What about it the remote machine is over-subscribed? These can all be answered by costing. When using a DIMMQ for a query is an option, we’ll still consider using the raw data, and chose which plan has the lowest cost.

Is it worth doing a direct memory-to-memory copy, so that one processor can read data in its peer’s memory? That’s something to look at. The HDFS and Tez folks are thinking about it.

DIMMQs will often (not always) be much smaller than the raw data. The raw data might be across all nodes of the cluster, the DIMMQ might be on 1 or 2. We need strategies for querying such “small” (relatively speaking) tables extremely quickly, sending the query to just where the data is and returning the results in a few milliseconds.

By the way, I tend to use the term locality in a different sense: is the data organized so that successive accesses of data items (from memory or disk) tend to come from the same place? Tables in a column-oriented format such as ORC or Parquet will tend to have good locality, even when mapped into memory. So that seems like an obvious format for DIMMQs.

|
June 2, 2014 at 4:31 am
|

Do you have any plans or thoughts around data locality for DIMMQs? How would the location of an in-memory materialized view be communicated to the optimiser so it can make a trade off between the underlying disk data locality and the in-memory summary locality?

Julian Hyde
|
June 1, 2014 at 12:47 am
|

There are quite a few similarities — not surprising, both concepts were born out of a need to manage distributed memory. But there are a few key differences.

(A couple of disclaimers. I have a fair knowledge of Spark’s architecture, but I don’t know every detail. If I am incorrect or out of date, someone please correct me. Also, I’m comparing Spark as it stands today with the vision for what could reasonably be put into the DIMMQ architecture, which is a bit unfair, like comparing oranges to next year’s Calvados, but you did ask.)

1. RDDs are explicitly accessed via reference, DIMMQs are accessed by algebraic rewrite

Spark is a language-based system, and the language is Scala (or other JVM-based languages). In order to access and use an RDD, you need a Scala reference to it. There are several implications of this. One is that you can only use RDDs you created, in this session. You can’t share RDDs between sessions.

Materialized queries are used by pattern matching. If there is a materialized view that satisfies your query, you can use it (provided it is up-to-date and you are authorized to read it.) You don’t need to know its name, or even that it exists. The match can be approximate — the system will figure out what needs to be done to convert from the materialized query to the result you need. There might be multiple matches, and a cost-based optimizer can choose the best of them, or piece together a result from several materialized queries (“tiles”).

2. DIMMQs are easier to share across sessions

The pattern-matching approach is a good deal more difficult to implement. But it means that you can use materialized queries independent of language, and across session boundaries, and without applications knowing they exist. This is especially important for the many tools out there that generate SQL.

One reason why we chose to store materialized queries in HDFS was that HDFS provides an single namespace of the disk, and now also memory, resources of a Hadoop cluster. This makes them very easy to share among sessions and applications.

3. RDDs are part of the query-execution engine, DIMMQs are data plus metadata

RDDs are Spark’s query-processing mechanism (in addition to representing the query expression, the final result, and being the recovery mechanism). Combining all of functions into one concept is clever, but it might be a bit too clever. It introduces a language-dependency and other forms of coupling that may make the architecture hard to evolve in future.

Hadoop has several processing models (including MapReduce, Tez, and indeed Spark) and a materialized query created by one can be used by another. Materialized queries are the end-result of processing a query, and a data source in a query, but they are just tables. After a query optimizer has decided to use a particular materialized query, the execution plan just uses a table (perhaps an in-memory table, but with the HDFS extensions, that’s still just a table). Nice and simple.

4. Recovery policy

RDDs are part of Spark’s recovery model. If an RDD is needed for the evaluation of a query, then it is re-created.

If a DIMMQ fails (i.e. its last in-memory or in-disk replica disappears), HDFS doesn’t attempt to rebuild it; that would get HDFS involved in query planning and execution, an ill-advised crossing of architectural layers. The query optimization layer will decide whether to rebuild the DIMMQ, or to execute the query without it.

5. Discard policy

The only mechanism (other than failure) for an RDD to be discarded is JVM garbage collection under memory pressure.

We plan a more deliberate discard policy for DIMMQs. HDFS throws away a DIMMQ based on a metric such as ratio of cost (in memory over time) to benefit (query processing effort saved).

Why go to the trouble of such a sophisticated policy? Our (ambitious) goal is an adaptive system. As a particular piece of data becomes hot (say the rows in a table in a particular time range and geographical region), the system should create summaries of that data in memory to speed queries, and reduce the memory devoted to summaries or copies of data that is becoming colder. A good way to solve this is to have agents continually suggesting and creating DIMMQs — too many to keep all of them — balanced by another process (HDFS) discarding the weaker ones.

6. In-memory versus on-disk organization

RDDs are in-memory structures that exist on the JVM’s heap and need to be serialized when written to disk and de-serialized on read. The contents of DIMMQs are data organized for storage on disk. The data does not need to be re-organized when it is written or read.

There is no question that, from a programmer’s perspective, objects are easier to work with than bytes.

But “disk-organized” data tends to have better memory access performance than the “memory-organized” objects in a JVM’s heap. In disk-organized data, related items are contiguous and therefore are likely to fall within the same L2 or L3 cache line. JVM objects are less compact and tend to be allocated all over the address space.

RDDs always originate in memory, and a few are paged to disk. But DIMMQs can originate either in memory or on disk. We think people will build materialized views during their batch load process; those materialized views are much smaller than the raw data, and can quickly be pulled into memory when users start querying that data.

7. Language-centrism

As I mentioned earlier, the only way to access an RDD is via a reference in a Spark (or other JVM language) program.

DIMMQs can be accessed from any language that speaks relational algebra. Obviously SQL. But you can also use an Optiq API to build the relational algebra directly, and that will allow, for instance, Pig, Cascading and other Tez-based applications to specify their pipelines in such a way that they can be automatically rewritten to take advantage of suitable DIMMQs.

Wes Mitchell
|
May 31, 2014 at 1:26 pm
|

How does this compare to Spark’s notion of RDD?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

Big Data Virtual Meetup Chennai
Wednesday, October 29, 2014
9:00 pm India Time / 8:30 am Pacific Time / 4:30 pm Europe Time (Paris)

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.