Hoya (HBase on YARN): persistence

In the last Hoya article, we talked about the its Application Architecture. Now let’s talk persistence. A key use case for Hoya is:  support long-lived clusters that can be started and stopped on demand. This lets a user start and stop an HBase cluster when they want, only using CPU and memory resources when they actually need it. For example, a specific MR job could use a private HBase instance as part of its join operations, or for an intermediate store of results in a workflow.

Hoya-created clusters support the operations stop and start. When a Hoya HBase cluster is stopped, the cluster is shut down and all resources returned to YARN. What is not lost is the HBase data itself, nor the information needed to bring the cluster back up.

How is this done?

All the data about a cluster is stored in a well known location on HDFS, currently ${user.home}/.hoya/cluster/${clustername}.

Each cluster’s directory contains four entries


A JSON file describing the cluster to create/recreate


The original configuration directory -a copy of that supplied by the user


The directory containing a dynamically created cluster configuration for HBase


The HBase database itself

The cluster.json file describes the cluster and it is saved when the cluster is created. When a user updates the cluster a new version of the file is created.

The two configuration directories, original and generated contain the cluster configuration directory supplied to HBase. When a cluster is created, the —confdir option names the directory containing the configuration. This can be in any Hadoop-supported filesystem. Hoya makes a copy of this directory for re-use when rebuilding a cluster.

When a cluster is being built up, Hoya creates a new configuration directory – generated – which has a patched version of the configuration. In Hoya 0.1, this primarily consists of patching the site configuration with filesystem details, including setting the location of the HBase data files. Hoya also sets the properties stating which the HBase master and region servers listen on (the default) port “0″, instructing services HBase to find a free port. We need to do this rather than rely on hard-coded port numbers, to ensure one HBase cluster does not conflict with another running on the same servers.

The generated configuration is passed to all HBase servers that Hoya starts. It registers everything in the directory as data for YARN to copy to whichever host is given the server and YARN then handles the details of creating the destination directory, copying the files, and cleaning up afterwards.

The HBase data itself lives in the hbase/ subdirectory in HDFS. This is not copied to the local servers, as it must be accessible by all services in the HBase cluster.

Once the cluster is started, the HBase services work directly with the data in the cluster’s hbase/ directory. The generated configuration files are reused whenever a worker node is created. Again YARN does the work of copying them to the destination servers.

The cluster configuration file is updated when the user flexes the cluster size adding or removing nodes. This ensures that a recreated cluster is the same size it last was when it was running.

When the cluster stop command is issued – stop <clustername> – the cluster is shut down simply by killing all the HBase service processes. The data remains on HDFS until the cluster is needed again, along with the cluster and HBase configuration files.

Restarting the cluster then becomes the simple matter of re-reading the cluster.json file, building a new generated/configuration directory, and starting the cluster as normal. The Hoya AM doesn’t know or care whether a cluster is newly created or being restarted, it just starts the HBase master then requests YARN containers for the workers, starting region servers on them when they are allocated.

A cluster is destroyed by deleting the cluster’s directory under the ~/.hoya/cluster path. This loses all data in the cluster, so is a drastic action. It should only be done if you are confident that the cluster and its data are never going to be needed again. As a safety check, you cannot destroy a running cluster – it must be stopped first. This is to reduce the risk of accidental data loss.

Because the information is all kept in a directory, you can play other tricks with it: copying it to make an entire snapshot of an HBase cluster – data and configuration, renaming the cluster by renaming its directory, or even copying it to a remote site. Be careful to stop the HBase clusters before manipulating their data as, just as with classic HBase clusters, messing with the live data is considered harmful.

To summarize: Hoya saves all the cluster-specific information about a cluster into a well-known location under the users’ home directory: Hoya cluster configuration, HBase cluster configuration, and the data itself. This enables Hoya to restart created clusters which is a feature that users can make use of via the cluster stop and start operations. Being able to stop and start HBase clusters lets you create clusters for specific workflows, starting them when needed and stopping them when not.

Take a look at Hoya: HBase on YARN here, and find out more about YARN here.

Categorized by :
Hadoop 2.0 HBase YARN

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.