Hoya (HBase on YARN): persistence
In the last Hoya article, we talked about the its Application Architecture. Now let’s talk persistence. A key use case for Hoya is: support long-lived clusters that can be started and stopped on demand. This lets a user start and stop an HBase cluster when they want, only using CPU and memory resources when they actually need it. For example, a specific MR job could use a private HBase instance as part of its join operations, or for an intermediate store of results in a workflow.
Hoya-created clusters support the operations stop and start. When a Hoya HBase cluster is stopped, the cluster is shut down and all resources returned to YARN. What is not lost is the HBase data itself, nor the information needed to bring the cluster back up.
How is this done?
All the data about a cluster is stored in a well known location on HDFS, currently
Each cluster’s directory contains four entries
A JSON file describing the cluster to create/recreate
The original configuration directory -a copy of that supplied by the user
The directory containing a dynamically created cluster configuration for HBase
The HBase database itself
cluster.json file describes the cluster and it is saved when the cluster is created. When a user updates the cluster a new version of the file is created.
The two configuration directories,
generated contain the cluster configuration directory supplied to HBase. When a cluster is created, the
—confdir option names the directory containing the configuration. This can be in any Hadoop-supported filesystem. Hoya makes a copy of this directory for re-use when rebuilding a cluster.
When a cluster is being built up, Hoya creates a new configuration directory –
generated – which has a patched version of the configuration. In Hoya 0.1, this primarily consists of patching the site configuration with filesystem details, including setting the location of the HBase data files. Hoya also sets the properties stating which the HBase master and region servers listen on (the default) port “0”, instructing services HBase to find a free port. We need to do this rather than rely on hard-coded port numbers, to ensure one HBase cluster does not conflict with another running on the same servers.
The generated configuration is passed to all HBase servers that Hoya starts. It registers everything in the directory as data for YARN to copy to whichever host is given the server and YARN then handles the details of creating the destination directory, copying the files, and cleaning up afterwards.
The HBase data itself lives in the
hbase/ subdirectory in HDFS. This is not copied to the local servers, as it must be accessible by all services in the HBase cluster.
Once the cluster is started, the HBase services work directly with the data in the cluster’s
hbase/ directory. The generated configuration files are reused whenever a worker node is created. Again YARN does the work of copying them to the destination servers.
The cluster configuration file is updated when the user flexes the cluster size adding or removing nodes. This ensures that a recreated cluster is the same size it last was when it was running.
When the cluster stop command is issued –
stop <clustername> – the cluster is shut down simply by killing all the HBase service processes. The data remains on HDFS until the cluster is needed again, along with the cluster and HBase configuration files.
Restarting the cluster then becomes the simple matter of re-reading the
cluster.json file, building a new
generated/configuration directory, and starting the cluster as normal. The Hoya AM doesn’t know or care whether a cluster is newly created or being restarted, it just starts the HBase master then requests YARN containers for the workers, starting region servers on them when they are allocated.
A cluster is destroyed by deleting the cluster’s directory under the
~/.hoya/cluster path. This loses all data in the cluster, so is a drastic action. It should only be done if you are confident that the cluster and its data are never going to be needed again. As a safety check, you cannot destroy a running cluster – it must be stopped first. This is to reduce the risk of accidental data loss.
Because the information is all kept in a directory, you can play other tricks with it: copying it to make an entire snapshot of an HBase cluster – data and configuration, renaming the cluster by renaming its directory, or even copying it to a remote site. Be careful to stop the HBase clusters before manipulating their data as, just as with classic HBase clusters, messing with the live data is considered harmful.
To summarize: Hoya saves all the cluster-specific information about a cluster into a well-known location under the users’ home directory: Hoya cluster configuration, HBase cluster configuration, and the data itself. This enables Hoya to restart created clusters which is a feature that users can make use of via the cluster stop and start operations. Being able to stop and start HBase clusters lets you create clusters for specific workflows, starting them when needed and stopping them when not.