Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
HDP > Hadoop Administration > Security

Cross Component Lineage with Apache Atlas across Apache Sqoop, Hive, Kafka & Storm

cloud Ready to Get Started?

DOWNLOAD SANDBOX

Introduction

Hortonworks introduced Apache Atlas as part of the Data Governance Initiative, and has continued to deliver on the vision for open source solution for centralized metadata store, data classification, data lifecycle management and centralized security.
Atlas is now offering, as a tech preview, cross component lineage functionality, delivering a complete view of data movement across a number of analytic engines such as Apache Storm, Kafka, Falcon and Hive.
This tutorial walks through the steps for creating data in Apache Hive through Apache Sqoop and using Apache Kafka with Apache Storm.

Prerequisites

Outline

1: Configure Hive to work with Atlas

Started by logging into Ambari as raj_ops user. User name – raj_ops and password – raj_ops.

1.1: View the Services Page

ambari_dashboard_rajops

From the Dashboard page of Ambari, click on Hive from the list of installed services.
Then click on Configs tab and search atlas.hook.hive.synchronous in the filter text box.

search_hive_config

This property takes a boolean value and specifies whether to run the Atlas-Hive hook synchronously or not. By default, it is false, change it to true so that you can capture the lineage for hive operations.

save_hive_config

Next search hive.warehouse.subdir.inherit.perms in the filter text box. This property takes a boolean value too. By default it is set to true, change it to false to avoid AccessControlException since HDP 2.5 and up comes with a patch for BUG-55664 that adds HIVE_WAREHOUSE_INHERIT_PERMS, which was not part of HDP 2.4. Thus, the hive table directories permissions will be derived from dfs umask.

hive_warehouse_perms_config

Click Save after you make the change. Write Atlas-hive hook enabled in the prompt and then proceed with saving the change. You have to Restart Hive now. Click on Restart and then Restart All Affected.

2: Start Kafka, Storm, HBase, Ambari Infra and Atlas

From the Dashboard page of Ambari, click on Kafka from the list of installed services.

new_select_kafka

2.1: Start Kafka Service

From the Kafka page, click on Service Actions -> Start

start_kafka

Check the Maintenance Mode box and click on Confirm Start:

confirmation_kafka

Wait for Kafka to start (It may take a few minutes to turn green)

new_started_kafka

In the same way you started Kafka above, start other required services (in order):

  1. Storm
  2. HBase
  3. Ambari Infra
  4. Atlas

Enable Atlas Webhook for sqoop and storm

2.2: Stop Services

Stop some services like Spark, Oozie, Flume and Zeppelin which are not required in this tutorial. Turn On the Maintenance mode also.
Your Ambari dashboard page should look like this:

new_ambari_dashboard_rajops

3: Sqoop-Hive Lineage

We need a script for creating a MySQL table, then importing the table using Sqoop into Hive.

3.1: Log into the Sandbox.

First access the Sandbox Web Shell Client at sandbox-hdp.hortonworks.com:4200. The first time password for root user is hadoop.

Alternatively, you could “ssh” into the sandbox from your terminal or Windows Ubuntu Shell. ssh root@localhost -p 2222.

Text you should see on your screen looks similar:

sandbox login: root
root@sandbox.hortonworks.com's password:
Last login: Fri Jan  5 06:05:29 2018 from 10.0.2.2
[root@sandbox-hdp ~]#

3.2: Download & extract the demo script

Run the following command to get to the scripts for the tutorial.

mkdir crosscomponent_demo
cd crosscomponent_demo
wget https://github.com/hortonworks/data-tutorials/raw/master/tutorials/hdp/cross-component-lineage-with-apache-atlas-across-apache-sqoop-hive-kafka-storm/assets/crosscomponent_scripts.zip
unzip crosscomponent_scripts.zip
cd crosscomponent_scripts/sqoop-demo

download_and_extract

download_and_extract2

3.3: Create a mysql table

Run the below command in your terminal to login into mysql shell, create a table called test_table_sqoop1 and then insert two records:

cat 001-setup-mysql.sql | mysql -u root -p

NOTE: default password for mysql root user is hadoop. Enter it then press enter when prompted for password

setup_mysql_script

3.4: Run the SQOOP Job

Before we run the sqoop job, let’s configure the Atlas Sqoop Hook via commands:

cp /etc/atlas/conf/atlas-application.properties /etc/sqoop/conf
ln -s /usr/hdp/2.6.4.0-91/atlas/hook/sqoop/*.jar /usr/hdp/2.6.4.0-91/sqoop/lib/
  • cp copies atlas configuration properties to sqoop configuration directory
  • ln links atlas sqoop-hook library to sqoop library folder

If you want to read up on Sqoop Hook from the documentation, visit Sqoop Atlas Bridge.

Run the below command in your terminal. It is a sqoop import command to transfer the data from mysql table test_table_sqoop1 to the hive table test_hive_table1. The hive table do not have to be pre-created, it would be created on fly.

sh 002-run-sqoop-import.sh

NOTE: default password for mysql root user is hadoop. Enter it then press enter when prompted for password

Here is the screenshot of results you would see in the screen when you run the above script.

sqoop_import_script

It will run the map-reduce job and at the end, you can see your new Hive table created:

sqoop_import_script2

3.5: Create CTAS sql command

CTAS stands for create table as select. We would create one more table in Hive from the table imported by the sqoop job above. The second table name is cur_hive_table1 and we will create the table using beeline shell:
Run the below command in your terminal

cat 003-ctas-hive.sql | beeline -u "jdbc:hive2://localhost:10000/default" -n hive -p hive -d org.apache.hive.jdbc.HiveDriver

ctas_script

3.6: View ATLAS UI for the lineage

Click on http://sandbox-hdp.hortonworks.com:21000. Credentials are:

User name – holger_gov
Password – holger_gov

atlas_login

Click on Search by Text and type cur_hive_table1

search_hive_table

You will see the lineage like given below. You can hover at each one of them to see the operations performed:

hive_lineage

4: Kafka – Storm Lineage

The following steps will show the lineage of data between Kafka topic my-topic-01 to Storm topology storm-demo-topology-01, which stores the output in the HDFS folder (/user/storm/storm-hdfs-test).

4.1: Create a Kafka topic to be used in the demo

Run the following commands to create a new Kafka topic my-topic-01

cd ../storm-demo
sh 001-create_topic.sh

create_topic_script

4.2: Create a HDFS folder for output

Run the following command to create a new HDFS directory under /user/storm

sh 002-create-hdfs-outdir.sh

create_hdfs_directory_script

4.3: Download STORM job jar file (optional)

Source is available at https://github.com/yhemanth/storm-samples.
Run the following command:

sh 003-download-storm-sample.sh

As the jar files is already downloaded in the vm, you would see the below information:

Storm Jar file is already download in /root/crosscomponent_demo/crosscomponent_scripts/storm-demo/lib folder
You can view the source for this at https://github.com/yhemanth/storm-samples

download_storm_script

4.4: Run the Storm Job

Before we run the deploy the Storm Topology, we need to enable Atlas Hook in Storm Configs.

1. Navigate to Ambari UI, click on Storm, then Configs.

2. Search for storm.atlas.hook and to the right of “Enable Atlas Hook,” check the box. Then save the configuration as enable storm atlas hook, click save.

enable_storm_atlas_hook

Run the following command:

sh 004-run-storm-job.sh

run_storm_script

run_storm_script2

4.5: View ATLAS UI for the lineage

Go to the Atlas UI http://localhost:21000/. Search for: kafka_topic this time and Click on: my-topic-01

search_kafka_topic

Scroll down and you will see a lineage of all the operations from Kafka to Storm.

kafka_storm_lineage

Summary

Apache Atlas is the only governance solution for Hadoop that has native hooks within multiple Hadoop components and delivers lineage across these components. With the new preview release, Atlas now supports lineage across data movement in Apache Sqoop, Hive, Kafka, Storm and in Falcon.

Further Reading

Please go through following Hortonworks Community articles to know more about Apache Atlas:

  1. Understanding Taxonomy in Apache Atlas
  2. Hive Data Lineage using Apache Atlas

User Reviews

User Rating
0 No Reviews
5 Star 0%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
Tutorial Name
Cross Component Lineage with Apache Atlas across Apache Sqoop, Hive, Kafka & Storm

To ask a question, or find an answer, please visit the Hortonworks Community Connection.

No Reviews
Write Review

Register

Please register to write a review

Share Your Experience

Example: Best Tutorial Ever

You must write at least 50 characters for this field.

Success

Thank you for sharing your review!