Apache Hadoop Ozone is a distributed key-value store that can efficiently manage both small and large files alike. Ozone is designed to work well with the existing Apache Hadoop ecosystem and also fill the need for a new open-source object store that is designed for ease of operational use and scale to thousands of nodes and billions of objects in a single cluster.
Earlier articles Introducing Ozone and Ozone Overview introduced the Ozone design philosophy and key concepts. This developer-oriented article dives deeper into the system architecture. We examine the building blocks of Ozone and see how they can be put together to build an scalable distributed storage system.
To scale Ozone to billions of files, we needed to solve two bottlenecks that exist in HDFS.
We can no longer store the entire namespace in the memory of a single node. The key insight is that the namespace has locality of reference so we can store just the working set in memory. The namespace is managed by a service called the Ozone Manager.
This is a harder problem to solve. Unlike the namespace, the block map does not have locality of reference since storage nodes (DataNodes) periodically send block reports about each block in the system. Ozone delegates this problem to a shared generic storage layer called Hadoop Distributed DataStore (HDDS).
Ozone consists of the following key components.
In this section, we see how to put these building blocks together to create a distributed key-value store.
At its lowest layer, Ozone stores user data in Storage Containers. A container is a collection of key-value pairs of block name and its data. Keys are block names which are locally unique within the container. Values are the block data and can vary from 0 bytes up to 256MB. Block names need not be globally unique.
Each container supports a few simple operations:
Containers are stored on disk using RocksDB, with some optimizations for larger values.
Distributed filesystems must tolerate the loss of individual disks/nodes, hence we need a way to replicate containers over the network. To achieve this we introduce a few additional properties of containers.
Container replicas are stored on DataNodes.
Containers start out in the open state. Clients write blocks to open containers and then finalize the block data with an commit operation. There are two phases in which a block is written:
Now we know how to store blocks in containers and replicate containers over the network. The next step is to build a centralized service that knows where all the containers are stored in the cluster. This service is the SCM.
SCM gets periodic reports from all DataNodes telling about the container replicas on those nodes and their current state. The SCM can choose a set of three DataNodes to store a new open container and direct them to mutually form a RAFT replication ring.
SCM also learns when a container is becoming full and direct the leader replica to “close” the container. SCM also detects under/over replicated close containers and ensures three replicas exist for each closed container.
With the above three building blocks, we have all the pieces to create HDDS, a distributed block storage layer without a global namespace.
DataNodes are now organized into groups of three with each group forming a Ratis replication ring. Each ring can have multiple open containers.
SCM receives reports from each DataNode once every 30 seconds informing about open and closed container replicas on each node. Based on this reports SCM makes decisions such as allocating new containers, closing open containers, re-replicating closed containers on disk/data loss.
Clients of SCM can request the allocation of a new block, and then write the block data into the assigned container. Clients can also read blocks in open/closed containers and delete blocks. The key point is that HDDS itself does not care about the contents of individual containers. The contents are completely managed by the application that is using SCM.
With HDDS in place, the only missing ingredient is a global Key-Value Namespace. This is provided by the OzoneManager. OM is a mapping service from key names to the corresponding set of blocks.
Clients can write multiple blocks into HDDS and then commit they key->blocks mapping atomically into OM to make the key visible in the namespace.
OM stores its own state in a RocksDB database.
HDDS can be used as a block storage layer by other distributed filesystem implementations. Some examples that have been discussed and may be implemented in the future:
And more… Bring your own namespace!