cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Introduction to Apache HBase Concepts, Apache Phoenix and New Backup & Restore Utility in HBase

Lab 1: Introducing Apache HBase Concepts

Introduction

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of unstructured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).

The components of HBase data model consist of tables, rows, column families, columns, cells and versions. Tables are like logical collection of rows stored in separate partitions. A row is one instance of data in a table and is identified by a rowkey. Data in a row are grouped together as Column Families. Each Column Family has one or more Columns and these Columns in a family are stored together. Column Families form the basic unit of physical storage, hence it’s important that proper care be taken when designing Column Families in table. A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon. A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column Qualifier). The data stored in a cell is versioned and versions of data are identified by the timestamp.

For more information, refer HBase documentation here.

In this tutorial, we are going to walk you through some basic HBase shell commands, how to use Apache Phoenix which enables OLTP and operational analytics in Hadoop by combining the power of standard SQL and JDBC APIs and the flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase as its backing store.

Prerequisites

Outline

1. Start HBase

1.1 View the HBase Services page

In order to start/stop HBase service, you must log into Ambari as an administrator. The default account (maria_dev) will not allow you to do this. Please follow these step to setup password for admin account.
First SSH into the Hortonworks Sandbox with the command:

$>ssh root@127.0.0.1 -p 2222

sshTerminal

If do do not have ssh client, you can also access the shell via http://localhost:4200/
Now run the following command to reset the password for user admin:

$>ambari-admin-password-reset

admin_password_reset

Now navigate to Ambari on 127.0.0.1:8080 on the browser and give your credentials

From the Dashboard page of Ambari, click on HBase from the list of installed services.

hbaseServiceOnOff

1.2 Start HBase Service

From the HBase page, click on Service Actions -> Start

starthbaseService

Check the box and click on Confirm Start:

confirmhbaseStartIot

Check the box to turn off the Maintenance Mode as it suppresses alerts, warnings and status change indicators generated for the object.
Wait for HBase to start (It may take a few minutes to turn green)

hbaseStartedIot

2. Enter HBase Shell

HBase comes with an interactive shell from where you can communicate with HBase components and perform operations on them.

First SSH into the Hortonworks Sandbox with the command:

Switch the user to hbase.

$>su hbase

switchTohbase

Type hbase shell and you will see the following screen:

enterhbaseShell

To exit the interactive shell, type exit or use <ctrl+c>. But wait, it is time to explore more features of the shell.

3. Data Definition Language Commands in HBase

These are the commands that operate on tables in HBase.

3.1 Create

The syntax to create a table in HBase is create '<table_name>','<column_family_name>'. Let’s create a table called ‘driver_dangerous_event’ with a column family of name events. Run the following command:

hbase> create 'driver_dangerous_event','events'

create_table

3.2 List

Let’s check the table we’ve just created, type the following command in the HBase shell

hbase> list

list_table

4. Data Manipulation Commands in HBase

Let’s import some data into the table. We’ll use a sample dataset that tracks driving record of a logistics company.

Open a new terminal and ssh into the Sandbox. Download the data.csv file and let’s copy the file in HDFS,

$>ssh root@127.0.0.1 -p 2222

$>curl -o ~/data.csv https://raw.githubusercontent.com/hortonworks/data-tutorials/d0468e45ad38b7405570e250a39cf998def5af0f/tutorials/hdp/hdp-2.5/introduction-to-apache-hbase-concepts-apache-phoenix-and-new-backup-restore-utility-in-hbase/assets/data.csv

$>hadoop fs -copyFromLocal ~/data.csv /tmp

copyFromLocal_data_csv

Now execute the LoadTsv from hbase user statement as following:

$>su hbase

$>hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=,  -Dimporttsv.columns="HBASE_ROW_KEY,events:driverId,events:driverName,events:eventTime,events:eventType,events:latitudeColumn,events:longitudeColumn,events:routeId,events:routeName,events:truckId" driver_dangerous_event hdfs://sandbox.hortonworks.com:/tmp/data.csv

importtsv_command

Now let’s check whether the data got imported in the table driver_dangerous_events or not. Go back to the hbase shell.

4.1 Scan

The scan command is used to view the data in the HBase table. Type the following command:
scan 'driver_dangerous_event'

You will see all the data present in the table with row keys and the different values for different columns in a column family.

scan_command

4.2 Put

Using put command, you can insert rows in a HBase table. The syntax of put command is as follows:
put '<table_name>','row1','<column_family:column_name>','value'

Copy following lines to put the data in the table.

put 'driver_dangerous_event','4','events:driverId','78'
put 'driver_dangerous_event','4','events:driverName','Carl'
put 'driver_dangerous_event','4','events:eventTime','2016-09-23 03:25:03.567'
put 'driver_dangerous_event','4','events:eventType','Normal'
put 'driver_dangerous_event','4','events:latitudeColumn','37.484938'
put 'driver_dangerous_event','4','events:longitudeColumn','-119.966284'
put 'driver_dangerous_event','4','events:routeId','845'
put 'driver_dangerous_event','4','events:routeName','Santa Clara to San Diego'
put 'driver_dangerous_event','4','events:truckId','637'

put_command

Now let’s view a data from scan command.

hbase>scan 'driver_dangerous_event'

scan_command1

You can also update an existing cell value using the put command. The syntax for replacing is same as inserting a new value.

So let’s update a route name value of row key 4, from 'Santa Clara to San Diego' to 'Santa Clara to Los Angeles'. Type the following command in HBase shell:

hbase>put 'driver_dangerous_event','4','events:routeName','Santa Clara to Los Angeles'

update_table

Now scan the table to see the updated data:

hbase>scan 'driver_dangerous_event'

scan_command2

4.3 Get

It is used to read the data from HBase table. It gives a single row of data at a time. Syntax for get command is:
get '<table_name>','<row_number>'

Try typing get 'driver_dangerous_event','1' in the shell. You will see the all the column families (in our case, there is only 1 column family) along with all the columns in the row.

get_command

You can also read a specific column from get command. The syntax is as follows:
get 'table_name', 'row_number', {COLUMN ⇒ 'column_family:column-name '}

Type the following statement to get the details from the row 1 and the driverName of column family events.

hbase>get 'driver_dangerous_event','1',{COLUMN => 'events:driverName'}

get_command_column

If you want to view the data from two columns, just add it to the {COLUMN =>…} section. Run the following command to get the details from row key 1 and the driverName and routeId of column family events:

hbase>get 'driver_dangerous_event','1',{COLUMNS => ['events:driverName','events:routeId']}

get_command_two_columns

Summary

In this tutorial, we learned about the basic concepts of Apache HBase and different types of data definition and data manipulation commands that are available in HBase shell. Check out the lab 2 of this tutorial where we are going to learn how to use Apache Phoenix with Apache HBase.

Appendix

ImportTsv Utility in HBase:

ImportTsv is a utility that will load data in TSV or CSV format into a specified HBase table. The column names of the TSV data must be specified using the -Dimporttsv.columns option. This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier. The special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data. In our case, events is a column family and driverId, driverName,etc are columns.

Next argument is the table name where you want the data to be imported
Third argument specifies the input directory of CSV data.


Lab 2: Introducing Apache Phoenix Concepts

Introduction

Apache Phoenix is a SQL abstraction layer for interacting with HBase. Phoenix translates SQL to native HBase API calls. Phoenix provide JDBC/ODBC and Python drivers.

For more information about Phoenix capabilities, see the Apache Phoenix website.

Prerequisites

Outline

1. Enable Phoenix by Ambari

There is no separate installation required for Phoenix. You can enable Phoenix with Ambari:

1. Go to Ambari and select Services tab > HBase > Configs tab.

config_HBase

2. Scroll down to the Phoenix SQL settings.

phoenix_tab

3. Click the Enable Phoenix slider button.

enable_phoenix

4. You have to change 3 more properties. Click on Advanced Tab and then scroll down to open Advanced hbase-site section. Add org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor in following three properties:

hbase.coprocessor.master.classes
hbase.coprocessor.region.classes
hbase.coprocessor.regionserver.classes

config_HBase1

5. Scroll up and click Save to get your config change reflected.

save_phoenix_setting

A pop up will come to write about the change that you are making. Type Enabled Phoenix in it and click Save.

save_phoenix_setting_popup

After pressing Save, you will get a popup like this:

save_configuration_changes

6. Restart HBase service.

restart_HBase_service

2. Launch Phoenix Shell

To connect to Phoenix, you need to specify the zookeeper quorum and in the sandbox, it is localhost. To launch it, execute the following commands:

$>cd /usr/hdp/current/phoenix-client/bin

$>./sqlline.py localhost

Your Phoenix shell will look like this:

enter_phoenix_shell

3. Create Phoenix Table on existing HBase table

You can create a Phoenix table/view on a pre-existing HBase table. There is no need to move the data to Phoenix or convert it. Apache Phoenix supports table creation and versioned incremental alterations through DDL commands. The table metadata is stored in an HBase table and versioned. You can either create a READ-WRITE table or a READ only view with a condition that the binary representation of the row key and key values must match that of the Phoenix data types. The only addition made to the HBase table is Phoenix coprocessors used for query processing. A table can be created with the same name.

NOTE: The DDL used to create the table is case sensitive and if HBase table name is in lowercase, you have to put the name in between double quotes. In HBase, you don’t model the possible KeyValues or the structure of the row key. This is the information you specify in Phoenix and beyond the table and column family.

Create a Phoenix table from existing HBase table by writing a code like this:

create table "driver_dangerous_event" ("row" VARCHAR primary key,"events"."driverId" VARCHAR,"events"."driverName" VARCHAR,
"events"."eventTime" VARCHAR,"events"."eventType" VARCHAR,"events"."latitudeColumn" VARCHAR,
"events"."longitudeColumn" VARCHAR,"events"."routeId" VARCHAR,"events"."routeName" VARCHAR,
"events"."truckId" VARCHAR);

create_phoenix_table

You can view the HBase table data from this Phoenix table.

select * from "driver_dangerous_event";

select_data_phoenix

If you want to change the view from horizontal to vertical, type the following command in the shell and then try to view the data again:

!outputformat vertical

select * from "driver_dangerous_event";

select_data_phoenix1

If you do not like this view, you can change it back to horizontal view by running the following command:

!outputformat horizontal

So with all existing HBase tables, you can query them with SQL now. You can point your Business Intelligence tools and Reporting Tools and other tools which work with SQL and query HBase as if it was another SQL database with the help of Phoenix.

4. Inserting Data via Phoenix

You can insert the data using UPSERT command. It inserts if not present and updates otherwise the value in the table. The list of columns is optional and if not present, the values will map to the column in the order they are declared in the schema. Copy the UPSERT statement given below and then view the newly added row.

UPSERT INTO "driver_dangerous_event" values('5','23','Matt','2016-02-29 12:35:21.739','Abnormal','23.385908','-101.384927','249','San Tomas to San Mateo','814');

select * from "driver_dangerous_event";

upsert_data_phoenix

You will see a newly added row:

upsert_data_phoenix1

5. Setting up SQuirrel and Phoenix Integration

SQuirreL SQL Client is a graphical Java program that will allow you to view the structure of a JDBC compliant database, browse the data in tables, issue SQL commands etc. If you want to use a graphical user interface for Phoenix, you might give SQuirrel a try.

Go to this article on Hortonworks Community Connection to set up the SQuirrel client work with Phoenix.

Summary

Congratulations! Lets summarize what we learned in this tutorial. We went through the introduction of Apache Phoenix and how to use it with Apache HBase. We also learned how to integrate Phoenix with SQuirrel. Go through Lab 3 of this tutorial where we are discussing another SQL tool Apache Hive and how to integrate that with Apache HBase.

Further Reading


Lab 3: Accessing Apache HBase Data with Apache Hive

Introduction

We can also use Hive to perform SQL queries on the data stored in HBase tables. We will use storage handler mechanism to create HBase tables via hive. HBaseStorageHandler allows Hive DDL for managing table definitions in both Hive metastore and HBase catalog simultaneously and consistently.

Prerequisites

Outline

1. Setup HBase and Hive Integration

To set up the integration, there are some jar files which need to be added to value of hive.aux.jars.path property. There are 3 jars in the lib directory of Hive:

zookeeper-*.jar
guava-*.jar
hive-hbase-handler-*.jar

And 6 jars in the lib directory of HBase:

hbase-client-*.jar,
hbase-common-*.jar,
hbase-protocol-*.jar,
hbase-server-*.jar,
hbase-shell-*.jar,
hbase-thrift-*.jar

Let’s go to Ambari to add this property:

Click on Hive → Configs → Advanced

hive_config_page

Scroll down to find Custom hive-site section, click on Add Property button, following pop up will come:

add_property

Type hive.aux.jars.path in key. Check your HDP version and give the jar names as per your version. I am adding following jars in value:

file:///usr/hdp/2.5.0.0-817/hive/lib/zookeeper-3.4.6.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hive/lib/hive-hbase-handler-1.2.1000.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hive/lib/guava-14.0.1.jar,
file:///usr/hdp/2.5.0.0-817/hbase/lib/hbase-client-1.1.2.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hbase/lib/hbase-common-1.1.2.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hbase/lib/hbase-protocol-1.1.2.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hbase/lib/hbase-server-1.1.2.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hbase/lib/hbase-shell-1.1.2.2.5.0.0-817.jar,
file:///usr/hdp/2.5.0.0-817/hbase/lib/hbase-thrift-1.1.2.2.5.0.0-817.jar

Your popup should look like this:

added_property

Click Add and then Save. You have to restart Hive and Oozie to get your change reflected.

restart_hive

2. Mapping Existing HBase tables to Hive

We need to use External table to give Hive access to an existing HBase table with multiple columns and families. External tables are used when you want your tables to point to data files in place, therefore it has to be a folder you point to. In normal internal table in hive, data gets stored in default location in HDFS which is /apps/hive/warehouse.

Let’s create a corresponding Hive table for the HBase table driver_dangerous_events.
Select Hive view from the menu button next to admin button, type the following DDL:

CREATE EXTERNAL TABLE hbase_table_driver_dangerous_event(key string, driverId  string, driverName string, eventTime string, eventType string, latitudeColumn string, longitudeColumn string, routeId string, routeName string, truckId string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "events:driverId,events:driverName,events:eventTime,events:eventType,events:latitudeColumn,events:longitudeColumn,events:routeId,events:routeName,events:truckId")
TBLPROPERTIES("hbase.table.name" = "driver_dangerous_event");

This statement registers the HBase table named driver_dangerous_event in the Hive metastore, accessible from Hive by the name hbase_table_driver_dangerous_event.
HBaseStorageHandler is used to register the HBase table with the Hive metastore.
hbase.columns.mapping is used to link Hive column names to the HBase table’s row key and columns. Your view should look like this:

create_external_table

Click on green Execute to run the query. Your table will be created. Now let’s verify the content of this table. Refresh the database explorer and click you will see your table created. Click on menu button next to it to view its data.

view_table

Wait for 10 seconds, you will see the data of the table.

view_data

Summary

Congratulations! Lets summarize what we learned in this tutorial. Now we know how to integrate Hive with HBase and you can run any SQL query using Hive. Check out Lab 4 of this tutorial series where we will discuss about new HBase Backup and Restore utility.


Lab 4: Apache HBase Backup and Restore (New feature introduced in HDP 2.5)

Introduction

The HBase backup and restore utility helps you take backup of the table schema and data and enable you to recover your environment should failure occur. The HBase backup and restore utility also supports incremental backups. This means you don’t have to take full backup each time.

Prerequisites

Outline

1. Creating a Full Backup

The first step in running the backup-and-restore utilities is to capture the complete data set in a separate image from the source. The syntax for creating HBase backup is as follows:

hbase backup create { {full | incremental} {backup_root_path} {[tables] | [-set backup_set_name]} } [[-silent] | [-w number_of_workers] | [-b bandwidth_per_worker]]

Arguments

  • full incremental – Full argument takes the full backup image. Incremental argument creates an incremental backup that has an image of data changes since the full backup or the previous incremental backup.
  • backup_root_path – the root path where you want to store your backup image.
  • tables (optional) – Specify the table or tables to backup. If no tables are specified, all tables are backed up.
  • -set backup_set_name (optional)- Calls an existing backup set in the command.
  • -silent (optional) – Ensures that the progress of the backup is not displayed on the screen.
    1. -w number_of_workers (optional) – Specifies the number of parallel workers to copy data of the backup.
    2. -b bandwidth_per_worker (optional) – Specifies the bandwidth of the worker in MB per second.

Now create a full backup of table driver_dangerous_event on hdfs://sandbox.hortonworks.com:8020/user/hbase/backup HDFS path with 3 parallel workers. Run the following command from the command line:

$>hbase backup create full hdfs://sandbox.hortonworks.com:8020/user/hbase/backup driver_dangerous_event -w 3

create_full_backup_image

You check whether the backup of your table is created in HDFS or not.

$>hadoop fs -ls /user/hbase/backup

view_full_backup

One more way to check whether the backup is taken or not is by running:

$>hbase backup history

backup_history

Note the backup_ID which will be used while restoring the data.

2. Backup Sets

You can create a group of tables into a set so that it reduces the amount of repetitive inputs of table names. You can then use the -set argument to invoke named backup set in either hbase backup create or hbase backup restore utility. Syntax to create a backup set is:

hbase backup set {[add] | [remove] | [list] | [describe] | [delete]} backup_set_name tables

If you run the hbase backup set add command and specify a backup set name that does not yet exist on your system, a new set is created. If you run the command with the name of an existing backup set name, then the tables that you specify are added to the set.

Arguments

  • add – Add tables to a backup set. Specify a backup_set_name value after this argument to create a backup set.
    remove – Removes tables from the set. Specify the tables to remove in the tables argument.
  • list – Lists all backup sets.
  • describe – Use this subcommand to display on the screen a description of a backup set. This subcommand must precede a valid value for the backup_set_name value.
  • delete – Deletes a backup set. Enter the value for the backup_set_name option directly after the hbase backup set delete command.
  • backup_set_name (optional) – Used to assign or invoke a set name.
  • tables (optional) – list of tables to include in the backup set.

Now create a backup set called event which has a table driver_dangerous_event.

$>hbase backup set add event driver_dangerous_event

create_backup_set

Let’s check whether our set is added or not using list:

$>hbase backup set list

view_backup_set

3. Restoring a Backup

The syntax for running a restore utility is as follows:

hbase restore {[backup_root_path] | [backup_ID] | [tables]} [[table_mapping] | [-overwrite]]

Arguments

  • backup_root_path – Specifies the parent location of the stored backup image.
  • backup_ID – The backup ID that uniquely identifies the backup image to be restored.
  • tables – Table or tables to be restored.
  • table_mapping (optional)- Directs the utility to restore data in the tables that are specified in the tables option. Each table must be mapped prior to running the command.
  • -overwrite (optional) – Overwrites an existing table if there is one with the same name in the target restore location.
  • -automatic (optional) – Restores both the backup image and all the dependencies following the correct order.

Let’s drop a table so that restore utility can be tested. To drop a table in HBase, you first have to disable it and then drop it.

Run the following commands from the hbase shell to drop the table:

hbase>disable 'driver_dangerous_event'

hbase>drop 'driver_dangerous_event'

drop_table

You also have to delete the mapping table from Phoenix to avoid any confusion.

Exit from the hbase shell and go to /usr/hdp/current/phoenix-client/bin and open the Phoenix shell using:

cd /usr/hdp/current/phoenix-client/bin
./sqlline.py localhost

Now drop that mapping table:

drop table "driver_dangerous_event";

drop_table_phoenix

Exit the Phoenix shell by typing !quit and now drop the table from Hive also. Go back to Hive view and drop the table:

drop table hbase_table_driver_dangerous_event;

Click on green Execute button, you will see something like this:

drop_table_hive

Now let’s restore the backup of this table which you created earlier in the tutorial:

NOTE: Copy the same backup ID that you got while doing hbase backup history

$>hbase restore /user/hbase/backup backup_1466560117119 driver_dangerous_event -automatic

restore_command_result

You can view the result at the end of this command’s execution.

restore_command

Summary

Congratulations! Lets summarize what we learned in this tutorial. We went through the new Backup & Restore utility in HBase. We created a full backup of our HBase table, created a backup set and then restore the deleted table.