Loading and Querying Data with Hadoop

Summary

The Hortonworks Sandbox is a fully contained Hortonworks Data Platform (HDP) environment. The sandbox includes the core Hadoop components, as well as all the tools needed for data ingestion and processing. You can access and analyze sandbox data with many Business Intelligence (BI) applications.

In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs. By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.

Prerequisites:

  • Hortonworks Sandbox 2.3 (installed and running)

Overview

To load data into the Hortonworks sandbox, you will:

  • Download sample data to your computer.
  • Upload the data files into the sandbox
  • View and refine the data in the sandbox.

Step 1: Download the Sample Data

You can download a set of sample data contained in a compressed (.zip) folder here:

RefineDemoData.zip

Save the sample data .zip file to your computer, then extract the files and unzip Omniture.0.tsv.gz, user.tsv.gz and products.tsv.gz.

Note: The extracted data files should have a .tsv file extension at the end.

Step 2: Upload the Data Files into the Sandbox

Select the HDFS Files view from the Off-canvas menu at the top. The HDFS Files view allows you to view the Hortonworks Data Platform(HDP) file store. The HDP file system is separate from the local file system.

We navigate to /tmp, create an admin folder

right click on admin and select Permissions:

Now we check the Write buttons and modify recursively and press save.

Verify that the permissions look now like this:

Now, we navigate to /tmp/admin, click on upload and browse the Omniture.0.tsv.

Repeat this procedure for users.tsv file and for products.tsv.

Step 3: Create Hive tables

Let’s open the Hive View by clicking on the Hive button from the views menu.

and create the tables users, products and omniture.

    create table users (swid STRING, birth_dt STRING, gender_cd CHAR(1))
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by '\t'
    stored as textfile 
    tblproperties ("skip.header.line.count"="1");

    create table products (url STRING, category STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by '\t'
    stored as textfile 
    tblproperties ("skip.header.line.count"="1");

    create table omniturelogs (col_1 STRING,col_2 STRING,col_3 STRING,col_4 STRING,col_5 STRING,col_6 STRING,col_7 STRING,col_8 STRING,col_9 STRING,col_10 STRING,col_11 STRING,col_12 STRING,col_13 STRING,col_14 STRING,col_15 STRING,col_16 STRING,col_17 STRING,col_18 STRING,col_19 STRING,col_20 STRING,col_21 STRING,col_22 STRING,col_23 STRING,col_24 STRING,col_25 STRING,col_26 STRING,col_27 STRING,col_28 STRING,col_29 STRING,col_30 STRING,col_31 STRING,col_32 STRING,col_33 STRING,col_34 STRING,col_35 STRING,col_36 STRING,col_37 STRING,col_38 STRING,col_39 STRING,col_40 STRING,col_41 STRING,col_42 STRING,col_43 STRING,col_44 STRING,col_45 STRING,col_46 STRING,col_47 STRING,col_48 STRING,col_49 STRING,col_50 STRING,col_51 STRING,col_52 STRING,col_53 STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by '\t'
    stored as textfile 
    tblproperties ("skip.header.line.count"="1");

Step 4: Load data into new tables

To load the data into the tables, we have to execute the following queries.

LOAD DATA INPATH '/tmp/admin/products.tsv' OVERWRITE INTO TABLE products;

LOAD DATA INPATH '/tmp/admin/users.tsv' OVERWRITE INTO TABLE users;

LOAD DATA INPATH '/tmp/admin/Omniture.0.tsv' OVERWRITE INTO TABLE omniturelogs;

To check if the data was loaded, click on the icon next to the table name. It executes a sample query.



Step 5: View and Refine the Data in the Sandbox

In the previous section, we created sandbox tables from uploaded data files. Now let’s take a closer look at that data.

Here’s a summary of the data we’re working with:

omniturelogs – website logs containing information such as URL, timestamp, IP address, geocoded IP, and session ID.


users – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.

products – CMS data that maps product categories to website URLs.

Now let’s use a Hive script to generate an “omniture” view that contains a subset of the data in the Omniture log table.

        CREATE VIEW omniture AS 
        SELECT col_2 ts, col_8 ip, col_13 url, col_14 swid, col_50 city, col_51 country, col_53 state 
        FROM omniturelogs 

Click Save as. On the “Saving item” pop-up, type “omniture” in the box, then click OK.

You can see your saved query now by clicking on the “Save Queries” button at the top.

Click Execute to run the script.

To view the data generated by the saved script, click on the icon next to the view’s name at the Database Explorer.
The query results will appear, and you can see that the results include the data from the omniturelogs table that were specified in the query.

Finally, we’ll create a script that joins the omniture website log data to the CRM data (registered users) and CMS data (products). Click Query Editor, then paste the following text in the Query box:

    create table webloganalytics as 
    select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, 
    o.country, p.category, CAST(datediff( from_unixtime( unix_timestamp() ), 
    from_unixtime( unix_timestamp(u.birth_dt, 'dd-MMM-yy'))) / 365  AS INT) age, u.gender_cd
    from omniture o 
    inner join products p     
    on o.url = p.url 
    left outer join users u 
    on o.swid = concat('{', u.swid , '}')

Save this script as “webloganalytics” and execute the script.

You can view the data generated by the script as described in the preceding steps.

Now that you have loaded data into the Hortonworks Platform, you can use Business Intelligence (BI) applications such as Microsoft Excel to access and analyze the data.

Comments

Ted Kahn
|
December 26, 2013 at 12:55 pm
|

So, I went through this exercise, and it seems everything is pretty standard SQL and as such could be done with Oracle, for example. I suppose that is your main point: that while this was done with about 100,000 log entries at some point even Oracle will not be able to process the data effectively. My thinking is now turning to how one manages this type of problem as an ongoing process. That is, data are coming in continual, one reasonable use case has these new logs updating dash boards in near realtime. But, my understanding is that updating or even appending HDF files is difficult and in any case you would not want to reprocess existing data. Thus, a time period would need to be determined and this process run for each period with its outputs updating the dashboard. However, we are now back to a much smaller problem for which Oracle etc. might work just fine. Anyway, a train of thought from a novice. -ted

|
January 7, 2014 at 11:33 am
|

Click Query Editor, then paste the following text in the Query box:

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.category, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.birth_dt, ‘dd-MMM-yy’))) / 365 AS INT) age, u.gender_cd gender from omniture o inner join products p on o.url = p.url left outer join users u on o.swid = concat(‘{‘, u.swid , ‘}’)

Respected Sir, Upto her I successfully done the process.But in executing the “create webloganalytics” i am getting error.
Kindly help me in running the above query.-
–TR RAO

S R
|
April 4, 2014 at 1:18 pm
|

Good Article and clear explanation. It would be wonderful if you can also explain little bit on how the same tutorial can be accomplished with Windows Single Node Cluster (without using Sandbox). For example, how the TSV file can be examined and assign default column values and ignore reading the TSV Header row.

Thanks, Satya Raju

Rajesh Toleti
|
July 14, 2014 at 2:56 am
|

Excellent Article. There are slight changes on the screens for Sandbox2.1. It would have been good if you mention MR will run while executing queries in BeesWax

Marcin Jimenez
|
July 28, 2014 at 1:49 pm
|

The query created in the last step is wrong, as it asked to clear the column names earlier in the tutorial and we did not define other tables for product and users table. Here is the correct script, with only the omniture table defined:

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.col_2, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.col_2, ‘dd-MMM-yy’))) / 365 AS INT) age, u.col_3 gender from omniture o inner join products p on o.url = p.col_1 left outer join users u on o.swid = concat(‘{‘, u.col_1 , ‘}’)

Marcin Jimenez
|
July 28, 2014 at 1:50 pm
|

Here is the correct script for the last part, in case you did not define the column values for Product and Users tables.

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.col_2, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.col_2, ‘dd-MMM-yy’))) / 365 AS INT) age, u.col_3 gender from omniture o inner join products p on o.url = p.col_1 left outer join users u on o.swid = concat(‘{‘, u.col_1 , ‘}’)

swathi pai m
|
August 5, 2014 at 11:21 pm
|

I just have a question sir
whether the divided data blocks are encrypted while it store in the HDFS?
if yes which encryption method it use ? if not how the privacy will be managed?

regards,

    francis
    |
    September 11, 2014 at 5:14 am
    |

    the data is stored in hdfs gz format, if you want to go for faster retrieval you can opt for snappy codec.

Peter
|
August 13, 2014 at 4:22 pm
|

when i execute the ‘omniture’ query i am getting nothing selected? any tips on how to correct this? i copied the text exactly.

    Niall
    |
    January 29, 2015 at 4:42 am
    |

    Getting the same “No data available” message. The SELECT query works fine on its own and returns data as expected. However the “CREATE VIEW omniture AS” fails to populate the view. The log file is decidedly blank. Any ideas out there?

Sean Franks
|
August 17, 2014 at 12:13 pm
|

Great run-through and examples of scripts. I just want to emphasize to all that if you clear column headings in Products and Users, the webloganalytics HiveQL script will fail on field names. I made that mistake and had to recreate those files. Wasn’t going to take a chance and try to ALTER the tables after having already screwed up one thing.

Saurabh
|
October 12, 2014 at 11:30 am
|

@Peter: Please create table instead of View and please make sure to have correct col name corresponding to value in select query when you are going to create table or view. In my case query was like this.
CREATE table omniture
AS SELECT
col_2 ts,
col_8 ip,
col_13 url,
col_14 swid,
col_50 city,
col_51 country,
col_53 state
from omniturelogs

Pushkar
|
October 12, 2014 at 7:43 pm
|

When I ran the last query to join the omniture,cms & crm data, I got the following error in hive:
Error occurred executing hive query: OK FAILED: SemanticException [Error 10002]: Line 1:325 Invalid column reference ‘url’

Gabriela
|
October 14, 2014 at 2:04 pm
|

How can I upload a file from my host (windows) system into HDFS on sandbox? I am looking at something the “Upload file” above does, but using command line (hadoop fs…), not the web-based GUI.

    Mike Y
    |
    October 28, 2014 at 2:57 pm
    |

    You can transfer files from the host machine to the virtual machine using either scp command in Putty, or scp-based client such as WinSCP. Once the files are transferred, you need to move them to the HDFS, this can also be done manually using hadoop fs -copyFromLocal [filename] [targetHDFSdirectory] (note that targetHDFSdirectory must exist, if it doesn’t you need to create it before copying the files). Once this is done, the files will be accessible from the Hadoop ecosystem and can be used to create a table etc.

    pradeep
    |
    November 2, 2014 at 10:47 pm
    |

    download hdfs explorer..and configure your host file with sandbox ip as below,
    in my case,
    browse to C:\Windows\System32\Drivers\etc.
    127.0.0.1 sandbox.hortonworks.com.

    to edit host file you need to open notepad with admin previlages.

Prasanna
|
December 2, 2014 at 6:53 am
|

Excellent Article and clear explained. Any chance to add Alter table option instead of re-creating the table.

We need to execute limited records like rownum<10…

Regards,
Prasanna K

Ashish
|
December 4, 2014 at 7:20 am
|

Nice information! But UI looks very old, I see much better in the latest Hue release.

bhupendra
|
January 12, 2015 at 11:48 pm
|

It is mentioned in the tutorial that we can connect to MS Excel to view webanalytics table. Could you please help in how we could connect To excel and view this data..

That would be of great help!, Thanks

|
May 22, 2015 at 5:14 pm
|

this is an awesome tutorial. I am a beginner to hadoop and struggling to visualize the workflow. This tutorial helped me get a clear view of the thing.

Anoop
|
September 2, 2015 at 7:43 pm
|

Real good tutorial for beginners, I enjoyed going through this

wenr70
|
November 5, 2015 at 6:07 am
|

these tutorials are indeed helpful. i appreciate how you wrote them

TJ
|
November 21, 2015 at 12:25 pm
|

I have run the following code on this page to create a Hive table:

create table users (swid STRING, birth_dt STRING, gender_cd CHAR(1))
ROW FORMAT DELIMITED
FIELDS TERMINATED by ‘\t’
stored as textfile
tblproperties (“skip.header.line.count”=”1”);

and am receiving the following two errors:

1.
HiveServer2 Process
Connection failed on host sandbox.hortonworks.com:100
2.
Hive Metastore Process
Metastore on sandbox.hortonworks.com failed (Execution o

Please advise

Thanks

    TJ
    |
    November 21, 2015 at 12:38 pm
    |

    i had to restart both Hive servers for it to work

Mani
|
January 15, 2016 at 8:52 am
|

Awesome

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>