Round-trip Data Enrichment between Teradata and Hadoop

Oozie workflows with Sqoop and Hive

Hadoop can be a great complement to existing data warehouse platforms, such as Teradata, as it naturally helps to address two key storage challenges:

The purpose of this article is to detail some of the key integration points and to show how data can be easily exchanged for enrichment between the two platforms.

As a data integrator who is familiar with RDBMS systems and is new to the Hadoop platform, I was looking for a simple way (i.e. “SQL-way”) to exchange data with Teradata.  Fortunately, it was just a matter identifying the tools and connecting the dots.

Using pre-built adapters (Teradata Connector for Hadoop, Hortonworks Teradata Connector), SQL-based protocols for data exchange (Sqoop and Hive) and a GUI interface for Apache Hadoop (HUE, Oozie), simple workflows can be created to deliver big value.

First, lets review the key components:

  • Teradata Connector for Hadoop (TDCH). Teradata’s adapter for moving data to\from Hadoop, including data-type mapping, parallel processing capabilities and native loading options.
  • Hortonworks Teradata Connector (HTC). A Sqoop wrapper for TDCH that implements standard data exchange syntax for Hadoop.
  • Apache Sqoop. A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Apache Hive. An Apache project that extends a logical SQL structure over HDFS and supports query using a language called HiveQL.
  • Apache Oozie. A workflow scheduler system to manage Apache Hadoop jobs.

Although the parts are many, the majority of work was performed within HUE, which provided XML and script generation as well as a graphic test environment and logging capabilities.

Step 1:  Setup

The framework that will co-ordinate the data movement and manipulation process is the Oozie workflow engine, so it is here that the configuration begins.

An Oozie workflow consists of an application (deployment) folder and configuration files (workflow.xml and job.properties) that provide the processing instructions and variable definitions for the workflow job.

In HDFS, the application folder td_oozie_demo  and lib sub-directories were created by the hue user account.

# su - hue
# hadoop fs –mkdir td_oozie_demo
# hadoop fs –mkdir td_oozie_demo/lib

Supporting JAR files for each of the technologies we will leverage can be placed in the “lib” sub-directory.

Below is the list of supporting JAR files with relative source locations:

# /usr/lib/sqoop/lib
/td_oozie_demo/lib/avro-1.5.3.jar
/td_oozie_demo/lib/avro-mapred-1.5.3.jar
/td_oozie_demo/lib/commons-io-1.4.jar
/td_oozie_demo/lib/snappy-java-1.0.3.2.jar
/td_oozie_demo/lib/sqoop-1.4.4.2.0.6.0-76.jar
# /usr/lib/hive/lib
/td_oozie_demo/lib/antlr-runtime-3.4.jar
/td_oozie_demo/lib/datanucleus-api-jdo-3.2.1.jar
/td_oozie_demo/lib/datanucleus-core-3.2.2.jar
/td_oozie_demo/lib/derby-10.4.2.0.jar
/td_oozie_demo/lib/hive-cli-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-common-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-exec-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-metastore-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-serde-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/hive-service-0.12.0.2.0.6.0-76.jar
/td_oozie_demo/lib/jackson-core-asl-1.7.3.jar
/td_oozie_demo/lib/jdo-api-3.0.1.jar
/td_oozie_demo/lib/jline-0.9.94.jar
/td_oozie_demo/lib/jopt-simple-3.2.jar
/td_oozie_demo/lib/libfb303-0.9.0.jar
/td_oozie_demo/lib/opencsv-2.3.jar
# https://downloads.teradata.com/download/connectivity/teradata-connector-for-hadoop-sqoop-integration-edition
/td_oozie_demo/lib/tdgssconfig.jar
/td_oozie_demo/lib/teradata-connector-1.1.1-hadoop200.jar
/td_oozie_demo/lib/terajdbc4.jar
 # http://hortonworks.com/products/hdp-2/#add_ons
/td_oozie_demo/lib/hortonworks-teradata-connector-1.1.1.2.0.6.1-101.jar

One more key requirement is to copy the hive-site.xml file to the application folder.  This file provides the mapping between the Oozie workflow and the HCatalog metadata store for Hive.

# hadoop fs –cp /usr/lib/hive/conf/hive-site.xml td_oozie_demo/hive-config.xml

Once the initial setup is in place, the remaining steps can be performed in the HUE graphic interface.   Note that the workflow.xml file mentioned earlier will be auto-generated during “Step 3:  Process Definition”.

Step 2:  Configuration

In a default HDP 2.0 installation, the HUE interface can be reached from the following URL: http://<hostname or ip>:8000

From the HUE website, browse to the Oozie Editor and create a new Workflow:

tdrt1

Edit the Workflow properties to specify the application directory – td_oozie_demo – and configuration file – hive-config.xml.

tdrt2

Step 3:  Process Definition

The Oozie Editor support several types of workflow “actions” that can be coordinated, for this example we will focus on two types:  Hive and Sqoop.

The workflow will consist of the following process steps:

  1. Hive:  Create the target Hive table that will accept import from Teradata
  2. Sqoop:  Perform the import from Teradata to Hive
  3. Hive:  Generate an aggregated result set that enriches imported Teradata records with existing Hive\Hadoop data
  4. Sqoop:  Perform an export of the Hive result set to Teradata

Process step (a) will run the following SQL script – hive­_script1.sql:

DROP TABLE IF EXISTS CLIENT;

CREATE TABLE CLIENT (
	C_CUSTKEY INT,
	C_NAME STRING,
	C_ADDRESS STRING,
	C_NATIONKEY INT,
	C_PHONE STRING,
	C_ACCTBAL DOUBLE,
	C_MKTSEGMENT STRING,
	C_COMMENT STRING
)
ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\t';

Using the workflow editor, Add a “Hive” action and upload the hive_script1.sql file:

tdrt3

Process step 2 will perform a SQOOP import from the Teradata table “CLIENT” to the Hive table “client” that was created in the previous step.

Using the workflow editor, Add a “Sqoop” action and specify the IMPORT command:

import --connect jdbc:teradata://<HOSTNAME OR IP ADDRESS>/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table CLIENT --hive-table client

Note:  This example connects to the Teradata Studio Express 14 VM available from Teradata.

tdrt4

Process step 3 will run the following SQL script – hive­_script2.sql – that generates an aggregated result set from the import table (client) and existing tables (contract, item):

DROP TABLE IF EXISTS CUST_BY_YEAR;

CREATE TABLE IF NOT EXISTS CUST_BY_YEAR(
	C_CUSTKEY INT,
	C_NAME STRING,
	C_YEAR INT,
	C_TOTAL_BY_YEAR DOUBLE
)
ROW FORMAT DELIMITED
	FIELDS TERMINATED BY '\t';
  	
INSERT OVERWRITE TABLE CUST_BY_YEAR 
select c.c_custkey, c.c_name, substr(o_orderdate,0,4) AS o_year, 
sum(l_extendedprice) AS o_year_total
FROM client c JOIN
contract o ON
c.c_custkey = o.o_custkey
JOIN
item l ON
o.o_orderkey = l.l_orderkey
GROUP BY
c.c_custkey, c.c_name, substr(o_orderdate,0,4);

Using the workflow editor, Add a “Hive” action and upload the hive_script2.sql file:

tdrt5

Process step 4 will perform a Sqoop export from the Hive table cust_by_year to the Teradata table HDP_IMPORT.CUST_BY_YEAR.

Using the workflow editor, Add a “Sqoop” action and specify the IMPORT command:

export --connect jdbc:teradata://&lt;HOST OR IP ADDRESS&gt;/Database=HDP_IMPORT --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table CUST_BY_YEAR --export-dir /apps/hive/warehouse/cust_by_year --input-fields-terminated-by "\t"

Note:  The DDL for the target Teradata table CUST_BY_YEAR is:

CREATE SET TABLE HDP_IMPORT.CUST_BY_YEAR 
(
C_CUSTKEY INTEGER NOT NULL,
C_NAME VARCHAR(25) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
C_YEAR INTEGER NOT NULL,
C_YEARTOTAL DECIMAL(15,2) NOT NULL);

tdrt6

Upon completion of the workflow definition, the following workflow.xml is auto-generated in the Oozie application folder:

	<workflow-app name="Teradata Oozie Demo" xmlns="uri:oozie:workflow:0.4">
		<global>
			<job-xml>/td_oozie_demo/hive-config.xml</job-xml>
		</global>
		<start to="create_hive_target"/>
			<action name="create_hive_target">
				<hive xmlns="uri:oozie:hive-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
					<script>/td_oozie_demo/hive_script1.sql</script>
				</hive>
				<ok to="td_to_hive"/>
				<error to="kill"/>
			</action>
			<action name="td_to_hive">
				<sqoop xmlns="uri:oozie:sqoop-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
<command>import --connect jdbc:teradata://192.168.1.13/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table CLIENT --hive-table client</command>
				</sqoop>
				<ok to="populate_hive_results"/>
				<error to="kill"/>
			</action>
			<action name="populate_hive_results">
				<hive xmlns="uri:oozie:hive-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
					<script>/td_oozie_demo/hive_script2.sql</script>
				</hive>
				<ok to="hive_to_td"/>
				<error to="kill"/>
			</action>
			<action name="hive_to_td">
				<sqoop xmlns="uri:oozie:sqoop-action:0.2">
					<job-tracker>${jobTracker}</job-tracker>
					<name-node>${nameNode}</name-node>
<command>export --connect jdbc:teradata://192.168.1.13/Database=HDP_IMPORT --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table CUST_BY_YEAR --export-dir /apps/hive/warehouse/cust_by_year --input-fields-terminated-by &quot;\t&quot;</command>
				</sqoop>
				<ok to="end"/>
				<error to="kill"/>
			</action>
			<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
			</kill>
			<end name="end"/>
		</workflow-app>

Step 4:  Testing and Execution

Using the Oozie Workflow Manager, the newly created job can be submitted for execution.

tdrt7

The workflow actions can be tracked and analyzed, each action maintaining it’s own process log file.  The interface also provides a “Rerun” option that allows for unit testing should any action fail to complete.

tdrt8

Final thoughts

In this example, records from Teradata are imported to Hadoop.  SQL-like processing occurs in a series of Oozie Workflow “actions” in the Hadoop platform.  A smaller, aggregated result is then returned to Teradata.

By leveraging Hadoop and its SQL integration points, it is possible to off-load storage and processing demands from a data warehouse to a Big Data store.

Integration between Teradata and the Hortonworks Data Platform (HDP) is not limited to the example in this article.  SQL-H integration for “on-demand” data exchange with Hadoop, Viewpoint \ Ambari integration for monitoring and management of Hadoop are both powerful tools that are available within the Teradata suite of products.

This article’s focus was on the integration features within HDP and these same tools (Oozie, Sqoop and Hive) can interact with almost any RDBMS that supports JDBC.

Additional Notes:

This integration demo was setup between two downloadable VMs:

  1. Hortonworks – HDP 2.0 Sandbox
  2. Teradata Studio Express v14

Data for the Hive tables, “contract” and “item”, were originally sourced from the Teradata Studio Express “retail” DB, using SQOOP.

sqoop import -libjars ${LIB_JARS} --connect jdbc:teradata://<HOST OR IP ADDRESS>/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table ITEM --hive-table item

sqoop import -libjars ${LIB_JARS} --connect jdbc:teradata://<HOST OR IP ADDRESS/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --hive-import --username dbc --password dbc --table CONTRACT --hive-table contract

Target Hive table DDL for the Teradata Retail tables CONTRACT and ITEMS:

CREATE TABLE IF NOT EXISTS ITEM (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ‘\t’;

CREATE TABLE IF NOT EXISTS CONTRACT (o_orderkey int, o_custkey int, o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string, o_clerk string, o_shippriority int, o_comment string)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ‘\t’;

See documentation from Hortonworks related to the Hortonworks Teradata Connector for more information on Sqoop integration with Teradata.

Categorized by :
Administrator Architect & CIO Data Analyst & Scientist Developer HDP 2 Hive Modern Data Architecture Oozie Other Partners Sandbox Sqoop

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Recently in the Blog

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Thank you for subscribing!