Hello World – Getting Started with HDP
Self Paced Learning Library
On demand learning any time, any where, on any digital deviceSign up now
In this tutorial we will be analyzing geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig and Hive. The processed data is then imported into Microsoft Excel where it can be visualized.
- Hortonworks Sandbox 2.3 (installed and running)
Goals of the Tutorial
The goal of this tutorial is that you get familiar with the basics of following:
- Hadoop and HDP
- Ambari File User Views and HDFS
- Ambari Hive User Views and Apache Hive
- Ambari Pig User Views and Apache Pig
- Data Visualization with Excel
Concepts: Hadoop & HDP
In this section we will learn about Apache Hadoop and what makes it scale to large data sets. We will also talk about various components of Hadoop ecosystem that make Apache Hadoop enterprise ready in form of Hortonworks Data Platform(HDP) distribution. The module discusses Apache Hadoop, its capabilities as a data platform and how the core of Hadoop and its surrounding ecosystem solution vendors provides the enterprise requirements to integrate alongside the Data Warehouse and other enterprise data systems as part of a modern data architecture, and as a step on the journey toward delivering an enterprise ‘Data Lake’
Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop.
The base Apache Hadoop framework is composed of the following modules:
- Hadoop Common Libraries and utilities needed by other Hadoop modules.
- Hadoop Distributed File System (HDFS), a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
- Hadoop YARN, a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
- Hadoop MapReduce, a programming model for large scale data processing.
Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.
Hortonworks Data Platform (HDP)
Hortonworks Data Platform is a packaged software hadoop distribution that aim to ease deployment and management of Hadoop clusters compared with simply downloading the various Apache code bases and trying to run them together a system. Architected, developed, and built completely in the open, Hortonworks Data Platform (HDP) provides an enterprise ready data platform that enables organizations to adopt a Modern Data Architecture.
With YARN as its architectural center it provides a data platform for multi-workload data processing across an array of processing methods – from batch through interactive to real-time, supported by key capabilities required of an enterprise data platform — spanning Governance, Security and Operations.
The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform (HDP). It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. The tutorials and features in the Sandbox are oriented towards exploring how HDP can help you solve your business big data problems. The Sandbox tutorials will walk you through bringing some sample data into HDP and manipulating it using the tools built into HDP. The idea is to show you how you can get started and show you how to accomplish tasks in HDP. HDP is free to download and use in your enterprise and you can download it here: HDP on Sandbox
Lab 0: Set-up
Start the Sandbox VM and Open Ambari
Start the HDP Sandbox following the Sandbox Install Guide to start the VM:
Once you have installed the Sandbox VM, it resolves to the host on your environment, the address of which varies depending upon the Virtual Machine you are using(Vmware, VirtualBox etc). As, a general thumb rule, wait for the installation to complete and confirmation screen will tell you the host your sandbox resolves to. For example:
In case of VirtualBox: host would be 127.0.0.1
If you are using a private cluster or a cloud to run sandbox. Please find the host your sandbox resolves to.
Append the port number :8888 to your host address, open your browser, and access Sandbox Welcome page at http://host:8888/.
Navigate to Ambari welcome page using the url given on Sandbox welcome page.
Both the username and password to login are admin.
If you want to search for the host address your sandbox is running on, ssh into the sandbox terminal upon successful installation and follow subsequent steps:
- login using username as “root” and password as “hadoop”.
- Type ifconfig and look for inet address under eth.
- Use the inet address, append :8080 and open it into a browser. It shall direct you to Ambari login page.
- This inet address is randomly generated for every session and therefore differs from session to session.
The following table has some useful URLs as well:
Sandbox welcome page
Hive User View
Pig User View
FIle User View
SSH web Client
Enter the Ambari Welcome URL and then you should see a similar screen:
There are 5 key capabilities to explore in the Ambari Welcome screen:
- “Operate Your Cluster” will take you to the Ambari Dashboard which is the primary UI for Hadoop Operators
- “Manage Users + Groups” allows you to add & remove Ambari users and groups
- “Clusters” allows you to grant permission to Ambari users and groups
- “Ambari User Views” list the set of Ambari Users views that are part of the cluster
- “Deploy Views” provides administration for adding and removing Ambari User Views
Take a few minutes to quickly explore these 5 capabilities and to become familiar their features.
Enter the Ambari Dashboard URL and you should see a similar screen:
Briefly skim through the Ambari Dashboard links (circled above) by clicking on
- Metrics, Heatmap and Configuration
and then the
- Dashboard, Services, Hosts, Alerts, Admin and User Views icon (represented by 3×3 matrix ) to become familiar with the Ambari resources available to you.
To learn more about Hadoop please explore the HDP Getting Started documentation.
If you have questions, feedback or need help getting your environment ready visit developer.hortonworks.com. Please also explore the HDP documentation. To ask a question check out the Hortonworks Forums.
Lab 1: HDFS – Loading Sensor Data into HDFS
In this section you will download the sensor data and load that into HDFS using Ambari User Views. You will get introduced to the Ambari Files User View to manage files. You can perform tasks like create directories, navigate file systems and upload files to HDFS. In addition you’ll perform a few other file-related tasks as well. Once you get the basics, you will create two directories and then load two files into HDFS using the Ambari Files User View.
- HDFS backdrop
- Step 1.1: Download data – Geolocation.zip
- Step 1.2: Load Data into HDFS
- Suggested readings
A single physical machine gets saturated with its storage capacity as the data grows. Thereby comes impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware. With Hortonworks Data Platform HDP 2.2, HDFS is now expanded to support heterogeneous storage media within the HDFS cluster.
Step 1.1: Download and Extract the Sensor Data Files
- You can download the sample sensor data contained in a compressed (.zip) folder here: Geolocation.zip
- Save the Geolocation.zip file to your computer, then extract the files. You should see a Geolocation folder that contains the following files:
- geolocation.csv – This is the collected geolocation data from the trucks. it contains records showing truck location, date, time, type of event, speed, etc.
- trucks.csv – This is data was exported from a relational database and it shows info on truck model, driverid, truckid, and aggregated mileage info.
Step 1.2: Load the Sensor Data into HDFS
- Go to the Ambari Dashboard and open the HDFS User View by click on the User Views icon and selecting the HDFS Files menu item.
- Starting from the top root of the HDFS file system, you will see all the files the logged in user (admin in this case) has access to see:
- Click tmp. Then click button to create the
/tmp/admindirectory and then create the
- Now traverse to the
/tmp/admin/datadirectory and upload the corresponding geolocation.csv and trucks.csv files into it.
You can also perform the following operations on a file by right clicking on the file: Download, Move, Permissions, Rename and Delete.
Data manipulation with Hive
In this section of tutorial you will be introduced to Apache Hive. In the earlier section we covered how to load data into HDFS. So now you have geolocation and trucks files stored in HDFS as csv files. In order to use this data in Hive we will tell you how to create a table and how to move data into Hive warehouse, from where it can be queried upon. We will analyze this data using SQL queries in Hive User Views and store it as ORC. We will also walk through Apache Tez and how a DAG is created when you specify Tez as execution engine for Hive. Lets start..!!
- Hive basics
- Step 2.1: Use Ambari Hive User Views
- Step 2.2: Define a Hive Table
- Step 2.3: Load Data into Hive Table
- Step 2.4: Define an ORC table in Hive
- Step 2.5: Review Hive Settings
- Step 2.6: Analyze Truck Data
- Suggested readings
Hive is a SQL like query language that enables analysts familiar with SQL to run queries on large volumes of data. Â Hive has three main functions: data summarization, query and analysis. Hive provides tools that enable easy data extraction, transformation and loading (ETL).
Step 2.1: Become Familiar with Ambari Hive User View
Apache Hive presents a relational view of data in HDFS and ensures that users need not worry about where or in what format their data is stored. Â Hive can display data from RCFile format, text files, ORC, JSON, parquet, Â sequence files and many of other formats in a tabular view. Â Â Through the use of SQL you can view your data as a table and create queries like you would in an RDBMS. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called the Ambari Hive User View. Â Â Ambari Hive User View provides an interactive interface to Hive. Â Â We can create, edit, save and run queries, and have Hive evaluate them for us using a series of MapReduce jobs or Tez jobs. Letâ€™s now open the Ambari Hive User View and get introduced to the environment, go to the Ambari User VIew icon and select Hive : The Ambari Hive User View looks like the following: Now let’s take a closer look at the SQL editing capabilities in the User View:
- There are four tabs to interact with SQL:
- Query: This is the interface shown above and the primary interface to write, edit and execute new SQL statements
- Saved Queries: You can save your favorite queries and quickly have access to them to rerun or edit.
- History: This allows you to look at past queries or currently running queries to view, edit and rerun. Â It also allows you to see all SQL queries you have authority to view. Â For example, if you are an operator and an analyst needs help with a query, then the Hadoop operator can use the History feature to see the query that was sent from the reporting tool.
- UDFs: Â Allows you to define UDF interfaces and associated classes so you can access them from the SQL editor.
- Database Explorer: Â The Database Explorer helps you navigate your database objects. Â You can either search for a database object in the Search tables dialog box, or you can navigate through Database -> Table -> Columns in the navigation pane.
- The principle pane to write and edit SQL statements. This editor includes content assist via CTRL + SpaceÂ to help you build queries. Content assist helps you with SQL syntax and table objects.
The command to autocomplete queries is CTRL-Space on all systems including Mac OS X.
- Once you have created your SQL statement you have 3 options:
- Execute: This runs the SQL statement.
- Explain: This provides you a visual plan, from the Hive optimizer, of how the SQL statement will be executed.
- Save as: Â Allows you to persist your queries into your list of saved queries.
- When the query is executed you can see the Logs or the actual query results.
- Logs: When the query is executed you can see the logs associated with the query execution. Â If your query fails this is a good place to get additional information for troubleshooting.
- Results: You can view results in sets of 50 by default.
- There are four sliding views on the right hand side with the following capabilities, which are in context of the tab you are in:
- Query: This is the default operation,which allows you to write and edit SQL.
- Settings: Â This allows you to set properties globally or associated with an individual query.
- Visual Explain: This will generate an explain for the query. Â This will also show the progress of the query.
- TEZ: If you use TEZ as the query execution engine then you can view the DAG associated with the query. Â This integrates the TEZ User View so you can check for correctness and helps with performance tuning by visualizing the TEZ jobs associated with a SQL query.
- Notifications: This is how to get feedback on query execution.
Take a few minutes to explore the various Hive User View features.
Step 2.2 Define a Hive Table
Now that you are familiar with the Hive User View, letâ€™s create the initial staging tables for the geolocation and trucks data. In this section we will learn how to use the Ambari Hive User View to create four tables: geolocaiton_stage, trucking_stage, geolocation, trucking. Â First we are going to create 2 tables to stage the data in their original csv text format and then will create two more tables where we will optimize the storage with ORC. Here is a visual representation of the Data Flow:
- Copy-and-paste the the following table DDL into the empty Worksheet of the Query Editor to define a new table named geolocation_staging:
–Create table geolocation for staging initial load
CREATE TABLE geolocation_stage (truckid string, driverid string, event string, latitude double, longitude double, city string, state string, velocity bigint, event_ind bigint, idling_ind bigint)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’STORED AS TEXTFILE;
- Click the green Execute button to run the command. If successful, you should see the Succeeded status in the Query Process Results section:
- Create a new Worksheet by clicking the blue New Worksheet button:
- Notice the tab of your new Worksheet is labeled Worksheet (1). Double-click on this tab to rename the label to trucks_stage:
- Copy-and-paste the following table DDL into your trucks_stage worksheet to define a new table named trucks_stage: –Create table trucks for staging initial load
CREATE TABLE trucks_stage(driverid string, truckid string, model string, jun13_miles bigint, jun13_gas bigint, may13_miles bigint, may13_gas bigint, apr13_miles bigint, apr13_gas bigint, mar13_miles bigint, mar13_gas bigint, feb13_miles bigint, feb13_gas bigint, jan13_miles bigint, jan13_gas bigint, dec12_miles bigint, dec12_gas bigint, nov12_miles bigint, nov12_gas bigint, oct12_miles bigint, oct12_gas bigint, sep12_miles bigint, sep12_gas bigint, aug12_miles bigint, aug12_gas bigint, jul12_miles bigint, jul12_gas bigint, jun12_miles bigint, jun12_gas bigint,may12_miles bigint, may12_gas bigint, apr12_miles bigint, apr12_gas bigint, mar12_miles bigint, mar12_gas bigint, feb12_miles bigint, feb12_gas bigint, jan12_miles bigint, jan12_gas bigint, dec11_miles bigint, Â dec11_gas bigint, nov11_miles bigint, nov11_gas bigint, oct11_miles bigint, oct11_gas bigint, sep11_miles bigint, sep11_gas bigint, aug11_miles bigint, aug11_gas bigint, jul11_miles bigint, jul11_gas bigint, jun11_miles bigint, jun11_gas bigint, may11_miles bigint, may11_gas bigint, apr11_miles bigint, apr11_gas bigint, mar11_miles bigint, mar11_gas bigint, feb11_miles bigint, feb11_gas bigint, jan11_miles bigint, jan11_gas bigint, dec10_miles bigint, dec10_gas bigint, nov10_miles bigint, nov10_gas bigint, oct10_miles bigint, oct10_gas bigint, sep10_miles bigint, sep10_gas bigint, aug10_miles bigint, aug10_gas bigint, jul10_miles bigint, jul10_gas bigint, jun10_miles bigint, jun10_gas bigint, may10_miles bigint, may10_gas bigint, apr10_miles bigint, apr10_gas bigint, mar10_miles bigint, mar10_gas bigint, feb10_miles bigint, feb10_gas bigint, jan10_miles bigint, jan10_gas bigint, dec09_miles bigint, dec09_gas bigint, nov09_miles bigint, nov09_gas bigint, oct09_miles bigint, oct09_gas bigint, sep09_miles bigint, sep09_gas bigint, aug09_miles bigint, aug09_gas bigint, jul09_miles bigint, jul09_gas bigint, jun09_miles bigint, jun09_gas bigint, may09_miles bigint, may09_gas bigint, apr09_miles bigint, apr09_gas bigint, mar09_miles bigint, mar09_gas bigint, feb09_miles bigint, feb09_gas bigint, jan09_miles bigint, jan09_gas bigint)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’STORED AS TEXTFILE;
- Execute the query and make sure it runs successfully. Letâ€™s review some aspects of the CREATE TABLE statements issued above. Â If you have a SQL background this statement should seem very familiar except for the last 3 lines after the columns definition:
- The ROW FORMAT clause specifies each row is terminated by the new line character.
- The FIELDS TERMINATED BY clause specifies that the fields associated with the table (in our case, the two csv files) are to be delimited by a comma.
- The STORED AS clause specifies that the table will be stored in the TEXTFILE format.
For details on these clauses consult the Apache Hive Language Manual.
- To verify the tables were defined successfully, click the â€œrefreshâ€ icon in the Database Explorer. Under Databases, click default database to expand the list of table and the new tables should appear:
- Â Click on the trucks_stage table name to view its schema. 9. Â Click on the Load sample data icon to generate and execute a select SQL statement to query the table for a 100 rows. Notice your two new tables are currently empty.
You can have multiple SQL statements within each editor worksheet, but each statement needs to be separated by a semicolon â€œ;â€. Â Â If you have multiple statements within a worksheet but you only want to run one of them just highlight the statement you want ran and then click the Execute button.
A few additional commands to explore tables:
show tables;– List the tables created in the database by looking up the list of tables from the metadata stored in HCatalog.
describe _table_name_;– Provides a list of columns for a particular table (ie describe geolocation_stage;)
show create _table_name_;– Provides the DDL to recreate a table (ie show create table geolocation_stage;)
- Â By default, when you create a table in Hive, a directory with the same name gets created in the
/apps/hive/warehousefolder in HDFS. Â Using the Ambari Files User View, navigate to the
/apps/hive/warehousefolder. You should see both a geolocation_stage and trucks_stage directory:
The definition of a Hive table and its associated metadata (i.e., the directory the data is stored in, the file format, what Hive properties are set, etc.) are stored in the Hive metastore, which on the Sandbox is a MySQL database.
Step 2.3: Load Data into a Hive table**
- Let’s load some data into your two Hive tables. In this tutorial we are going to show you two different ways of populating a Hive table with data from our CSV files. One way will involve moving our data file into the correct hive directory, while the other method will involve us executing a simple Hive query to load the data.
The first way to populate a table is to put a file into the directory associated with the table. Using the Ambari Files User View, click on the Move icon next to the file
/tmp/admin/data/geolocation.csv. (Clicking on Move is similar to â€œcutâ€ in cut-and-paste.)
- After clicking on the Move arrow your screen should look like the following: Notice two things have changed:
- The file name geolocation.csv has grayed out some
- The icons associated with the operations on the files are removed.Â This is to indicate that this file is in a special state that is ready to be moved.
- Now navigate to the destination path
/apps/hive/warehouse/geolocation_stage. You might notice that as you navigate through the directories that the file is pinned at the top. Once you get to the appropriate directory click on the Paste icon to move the file:
- Go back to the Ambari Hive View and click on the Load sample data icon next to the geolocation_stage table. Notice the table is no longer empty, and you should see the first 100 rows of the table:
- Now we’re going to show you the second way to load the data using a simple Hive query. Enter the following SQL command into an empty Worksheet in the Ambari Hive User View:
LOAD DATA INPATH ‘/tmp/admin/data/trucks.csv’ OVERWRITE INTO TABLE trucks_stage;
This query is telling us that we want to load the data at the path
/tmp/admin/data/trucks.csv, and then take the data and move it into the trucks_stage table which has all of the columns defined already.
- Â You should now see data in the trucks_stage table:
- From the Files view, navigate to the
/tmp/admin/datafolder. Notice the folder is empty! The LOAD DATA INPATH command moved the trucks.csv file from the
/user/admin/datafolder to the
- Lastly, we need to remove the header rows from each table that were loaded into the table. To do this we just need to use a single command for each table.
ALTER TABLE trucks_stage SET TBLPROPERTIES ("skip.header.line.count"="1");
ALTER TABLE geolocation_stage SET TBLPROPERTIES ("skip.header.line.count"="1");
Now when querying these two tables, the header lines should no longer appear in the results! Step 2.4: Define an ORC Table in Hive
Introducing Apache ORC
The Optimized Row Columnar (new Apache ORC project) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. To use the ORC format, specify ORC as the file format when creating the table: CREATE TABLE STORED AS ORC In this step, you will create two ORC tables (geolocation and trucks) that are created from the text data in your geolocation_stage and trucks_stage tables.
- From the Ambari Hive User View, execute the following table DDL to define a new table named geolocation and trucks:
–Create table geolocation as ORC from geolocation_stage table
CREATE TABLE geolocation STORED AS ORC AS SELECT * FROM geolocation_stage;
–Create table trucks as ORC from trucks_stage table
CREATE TABLE trucks STORED AS ORC AS SELECT * FROM trucks_stage;
- Refresh the Database Explorer and verify you have a table named geolocation and trucks in the default database: 3. View the contents of the geolocation table. Notice it contains the same rows as geolocation_stage. 4. To verify geolocation is an ORC table, execute the following query:
describe formatted geolocation;
- Scroll down to the bottom of the Results tab and you will see a section labeled Storage Information. The output should look like:
If you want to try running some of these commands from the Hive Shell follow the following steps from your terminal shell (ie putty):
- ssh email@example.com -p 2222 Root pwd is hadoop
- su hive
Starts Hive shell and now you can enter commands and SQL
Exits out of the Hive shell.
Step 2.5: Review Hive Settings
- Open the Ambari Dashboard in another tab by right clicking on the Ambari icon
2. Â Go to the Hive page then select the Configs tab then click on Settings tab: Once you click on the Hive page you should see a page similar to above:
- Hive Page
- Hive Configs Tab
- Hive Settings Tab
- Version History of Configuration
Scroll down to the Optimization Settings: In the above screenshot we can see:
- Tez is set as the optimization engine
- Cost Based Optimizer (CBO) is turned on
This shows the new HDP 2.3 Ambari Smart Configurations, which simplifies setting configurations
New in HDP 2.3
Hadoop is configured by a collection of XML files. In early versions of Hadoop operators would need to do XML editing to change settings. Â There was no default versioning. Early Ambari interfaces made it easier to change values by showing the settings page with dialog boxes for the various settings and allowing you to edit them. However, you needed to know what needed to go into the field and understand the range of values. Now with Smart Configurations you can toggle binary features and use the slider bars with settings that have ranges.
By default the key configurations are displayed on the first page. If the setting you are looking for is not on this page you can find additional settings in the Advanced tab: For example, what if we wanted to improve SQL performance by using the new Hive vectorization features, where would we find the setting and how would we turn it on. Â Â You would need to do the following steps:
- Click on the Advanced tab and scroll to find the property
- Or, start typing in the property into the property search field and then this would filter the setting you scroll for.
As you can see from the green circle above the hive.vectorized.execution.enabled is turned on already.
Step 2.6: Analyze the Trucks Data
Next we will be using Hive, Pig and Excel to analyze derived data from the geolocation and trucks tables. The business objective is to better understand the risk the company is under from fatigue of drivers, over-used trucks, and the impact of various trucking events on risk. In order to accomplish this we are going to apply a series of transformations to the source data, mostly though SQL, and use Pig to calculate risk. In Step 10 we will be using Microsoft Excel to generate a series of charts to better understand risk.
Let’s get started with the first transformation. We want to calculate the miles per gallon for each truck. We will start with our truck data table. Â We need to sum up all the miles and gas columns on a per truck basis. Hive has a series of functions that can be used to reformat a table. The keyword LATERAL VIEW is how we invoke things. The stack function allows us to restructure the data into 3 columns labeled rdate, gas and mile with 54 rows. We pick truckid, driverid, rdate, miles, gas from our original table and add a calculated column for mpg (miles/gas) and then we will calculate average mileage.
- Using the Ambari Hive User View, execute the following query:
– Create table truck_mileage from existing trucking data
- To view the data generated by the script, click Load Sample Data icon in the Database Explorer next to truck_mileage. After clicking the next button once, you should see a table that list each trip made by a truck and driver:
Use the Content Assist to build a query
- Create a new SQL Worksheet.
- Start typing in the SELECT SQL command, but only enter the first two letters:
- Press Ctrl+space to view the following content assist pop-up dialog window:
- Type in the following query, using Ctrl+space throughout your typing so that you can get an idea of what content assist can do and how it works:
SELECT truckid, avg(mpg) avgmpg FROM truck_mileage GROUP BY truckid;
- Click the Save as button to save the query as average mpg:
- Notice your query now shows up in the list of â€œSaved Queriesâ€, which is one of the tabs at the top of the Hive User View.
- Execute the average mpg query and view its results.
- Now lets explore the various explain features to better understand the execution of a query: Text Explain, Visual Explain and Tez Explain. Click on the Explain button:
- Verify this added the EXPLAIN command at the beginning of the query:
- Execute the query. The results should look like the following:
- Click on STAGE PLANS: to view its output, which displays the flow of the resulting Tez job:
- To see the Visual Explain click on the Visual Explain icon on the right tabs. This is a much more readable summary of the explain plan:
- If you click on the TEZ tab on the right-hand column, you can see DAG details associated with the query.
- However, you can also view the DAG by going to the Ambari Tez User View. Select the Tez View: Cluster Instanceâ€ User View from the list of User Views. Select the first DAG as it represents the last job that was executed.
- There are six tabs at the top right please take a few minutes to explore the various tabs and then click on the Graphical View tab and hover over one of the nodes with your cursor to get more details on the processing in that node.
- Go back to the Hive UV and save the query by
- To persist these results into a table, This is a fairly common pattern in Hive and it is called Create Table As Select) (CTAS ). Â Paste the following script into a new Worksheet, then click the Execute button: – Create table truck avg_mileage from existing trucks_mileage data
CREATE TABLE avg_mileage
STORED AS ORC
SELECT truckid, avg(mpg) avgmpg
GROUP BY truckid;
- To view the data generated by the script, click Load sample data icon in the Database Explorer next to avg_mileage. You see our table is now a list of each trip made by a truck.
Lab 3: Pig – Risk Factor
Use Pig to compute Driver Risk Factor
In this tutorial you will be introduced to Apache Pig. In the earlier section of lab you learned how to load data into HDFS and then manipulate it using Hive. We are using the Truck sensor data to better understand risk associated with every driver. This section will teach you to compute risk using Apache Pig.
- Pig basics
- Step 3.1: Define Table schema
- Step 3.2: Create Pig Script
- Step 3.3: Quick Recap
- Step 3.4: Execute Pig Script on Tez
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.
Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.
Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Distributed File System.
Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.
Step 3.1: Define table schema
Now we have refined the truck data to get the average mpg for each truck. The next task is to compute the risk factor for each driver which is the total miles driven/abnormal events. We can get the event information from the geolocation table.
If we look at the truck_mileage table, we we have the driverid and the number of miles for each trip. To get the total miles for each driver, we can group those records by driverid and then sum the miles.
- We will start by creating a table named driver_mileage that is created from a query of the columns we want from truck_mileage. The following query groups the records by driverid and sums the miles in the select statement. Execute this query in a new Worksheet:
— Create table DriverMileage from existing truck_mileage data
CREATE TABLE DriverMileage STORED AS ORC AS SELECT driverid, sum(miles) totmiles FROM truck_mileage GROUP BY driverid;
- View the data generated by the script by clicking the Load sample data icon in the Database Explorer next to drivermileage. The results should look like:
- Next, you will use Pig to compute the risk factor of each driver. Before we can run the Pig code, one of the requirements for the HCatStorer() class is that the table must already exist in Hive. The Pig code expects the following structure for a table named riskfactor. Execute the following DDL command:
— Create table avg_mileage from existing trucks_mileage data
CREATE TABLE riskfactor (driverid string,events bigint,totmiles bigint,riskfactor float)STORED AS ORC;
- Verify the riskfactor table was created successfully. It will be empty now, but you will populate it from a Pig script. You are now ready to compute the risk factor using Pig. Let’s take a look at Pig and how to execute Pig scripts from within Ambari.
Step 3.2: Create Pig Script
In this tutorial we create and run a Pig script. We will use the Ambari Pig User View. Let’s get started…
For more information on Pig scripting, you can view the link here for some documentation
a. Log in to Ambari Pig User Views
To get to the Ambari Pig User View, click on the User Views icon at top right and select Pig:
This will bring up the Ambari Pig User View interface. Your Pig View does not have any scripts to display, so it will look like the following:
On the left is a list of your scripts, and on the right is a composition box for writing scripts. A special feature of the interface is the Pig helper at the bottom. The Pig helper will provide us with templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions. At the very bottom are status areas that will show the results of our script and log files.
The following screenshot shows and describes the various components and features of the Pig User View:
b. Create a New Script
Let’s enter a Pig script. Click the New Script button in the upper-right corner of the view:
Name the script riskfactor.pig, then click the Create button:
c. Load Data in Pig using Hcatalog
We are going to use HCatalog to load data into Pig. HCatalog allows us to share schema across tools and users within our Hadoop environment. It also allows us to factor out schema and location information from our queries and scripts and centralize them in a common repository. Since it is in HCatalog we can use the HCatLoader() function. Pig makes it easy by allowing us to give the table a name or alias and not have to worry about allocating space and defining the structure. We just have to worry about how we are processing the table.
- We can use the Pig helper at the bottom of the screen to give us a template for the line. Click on Pig helper -> HCatalog->load template
- The entry %TABLE% is highlighted in red for us. Type the name of the table which is geolocation.
- Remember to add the a = before the template. This saves the results into a. Note the ‘=’ has to have a space before and after it.
- Our completed line of code will look like:
a = LOAD ‘geolocation’ using org.apache.hive.hcatalog.pig.HCatLoader();
Copy-and-paste the above Pig code into the riskfactor.pig window.
- You can use the command
DUMP %VAR%if you want to view the data inside.
- Just replace
%VAR%with the variable you wish to view.
d. Filter your data set
The next step is to select a subset of the records so that we just have the records of drivers for which the event is not normal. To do this in Pig we use the Filter operator. We tell Pig to Filter our table and keep all records where event !=“normal” and store this in b. With this one simple statement Pig will look at each record in the table and filter out all the ones that do not meet our criteria.
- We can use Pig Help again by clicking on Pig helper->Relational Operators->FILTER template
- We can replace %VAR% with “a” (hint: tab jumps you to the next field)
- Our %COND% is “event !=’normal’; ” (note: single quotes are needed around normal and don’t forget the trailing semi-colon)
- Complete line of code will look like:
b = filter a by event != ‘normal’;
Copy-and-paste the above Pig code into the riskfactor.pig window.
e. Iterate your data set
Now that we have the right set of records we can iterate through them. We use the “foreach” operator on the grouped data to iterate through all the records. We would also like to know how many times a driver has a non normal event associated with him. to achieve this we add ‘1’ to every row in the data set.
- Pig helper ->Relational Operators->FOREACH template will get us the code
- Our %DATA% is b and the second %NEW_DATA% is “driverid,event,(int) ‘1’ as occurance;”
- Complete line of code will look like:
c = foreach b generate driverid, event, (int) ‘1’ as occurance;
Copy-and-paste the above Pig code into the riskfactor.pig window:
f. Calculate the total non normal events for each driver
The group statement is important because it groups the records by one or more relations. In this case we would like to group by driver id and iterate over each row again to sum the non normal events.
- Pig helper ->Relational Operators->GROUP %VAR% BY %VAR% template will get us the code
- First %VAR% takes “c” and second %VAR% takes “driverid;”
- Complete line of code will look like:
d = group c by driverid;
Copy-and-paste the above Pig code into the riskfactor.pig window.
- Next use Foreach statement again to add the occurance.
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;
g. Load drivermileage table and perform a join operation
In this section we will load drivermileage table into Pig using Hcatlog and perform a join operation on driverid. The resulting data set will give us total miles and total non normal events for a particular driver.
- Load drivermileage using HcatLoader()
g = LOAD ‘drivermileage’ using org.apache.hive.hcatalog.pig.HCatLoader();
- Pig helper ->Relational Operators->JOIN %VAR% BY template will get us the code
- Replace %VAR% by ‘e’ and after BY put ‘driverid, g by driverid;’
- Complete line of code will look like:
h = join e by driverid, g by driverid;
Copy-and-paste the above two Pig codes into the riskfactor.pig window.
h. Compute Driver Risk factor
In this section we will associate a driver risk factor with every driver. Driver risk factor will be calculated by dividing total miles travelled by non normal event occurrences.
- We will use Foreach statement again to compute driver risk factor for each driver.
- Use the following code and paste it into your Pig script.
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
- As a final step store the data into a table using Hcatalog.
store final_data into ‘riskfactor’ using
Here is the final code and what it will look like once you paste it into the editor.
— Geolocation has data stored in ORC format
a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader(); b = filter a by event != 'normal'; c = foreach b generate driverid, event, (int) '1' as occurance; d = group c by driverid; e = foreach d generate group as driverid, SUM(c.occurance) as t_occ; g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader(); h = join e by driverid, g by driverid; final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor; store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();
Save the file riskfactor.pig by clicking the Save button in the left-hand column.
Step 3.3: Quick Recap
Before we execute the code, let’s review the code again:
- The line a= loads the geolocation table from HCatalog.
- The line b= filters out all the rows where the event is not ‘Normal’.
- Then we add a column called occurrence and assign it a value of 1.
- We then group the records by driverid and sum up the occurrences for each driver.
- At this point we need the miles driven by each driver, so we load the table we created using Hive.
- To get our final result, we join by the driverid the count of events in e with the mileage data in g.
- Now it is real simple to calculate the risk factor by dividing the miles driven by the number of events
You need to configure the Pig Editor to use HCatalog so that the Pig script can load the proper libraries. In the Pig arguments text box, enter
-useHCatalog and click the Add button:
Please note that the argument
-useHCatalog is case sensistive
The Arguments section of the Pig View should now look like the following:
Step 3.4: Execute Pig Script on Tez
- You are now ready to execute the script. Click Execute on Tez checkbox and finally hit the blue Execute button to submit the job. Pig job will be submitted to the cluster. This will generate a new tab with a status of the running of the Pig job and at the top you will find a progress bar that shows the job status.
- Wait for the job to complete. The output of the job is displayed in the Results section. Your script does not output any result – it stores the result into a Hive table – so your Results section will be empty.
Click on the Logs twisty to see what happened when your script ran. This is where you will see any error messages. The log may scroll below the edge of your window so you may have to scroll down.
- Go back to the Ambari Hive User View and browse the data in the riskfactor table to verify that your Pig job successfully populated this table. Here is what is should look like:
At this point we now have our truck miles per gallon table and our risk factor table. The next step is to pull this data into Excel to create the charts for the visualization step.