How to Load Data for Hadoop into the Hortonworks Sandbox

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

Summary

This tutorial describes how to load data into the Hortonworks sandbox.

The Hortonworks sandbox is a fully contained Hortonworks Data Platform (HDP) environment. The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. You can access and analyze sandbox data with many Business Intelligence (BI) applications.

In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs. By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.

Prerequisites:

  • Hortonworks ODBC driver installed and configured
  • Hortonworks Sandbox 1.3 (installed and running)

Overview

To load data into the Hortonworks sandbox, you will:

  • Download sample data to your computer.
  • Upload the data files into the sandbox
  • View and refine the data in the sandbox.

Step 1: Download the Sample Data

    • You can download a set of sample data contained in a compressed (.zip) folder here:

RefineDemoData.zip

  • Save the sample data .zip file to your computer, then extract the files.

Note: The extracted data files should have a .tsv.gz file extension. If you download the files using a browser, the file extension may change to .tsv.gz.tsv. If this happens, change the file extensions back to .tsv.gz before uploading the files into the sandbox.

Step 2: Upload the Data Files into the Sandbox

  • In the Hortonworks Sandbox, click the HCat icon to open the HCatalog page, then click Create a new table from a file.
  • On the “Create a new table from a file” page, type “omniturelogs” in the Table Name box, then click Choose a file.
  • On the “Choose a file” pop-up, click Upload a file to display a file browser.
  • Use the file browser to select the Omniture.0.tsv.gz file, then click Open.
  • After you select the select the Omniture.0.tsv.gz file, it appears in the list on the “Choose a file” pop-up. Click the file name to select the Omniture.0.tsv.gz file.
  • The “Create a new table from a file” page is redisplayed with the Omniture.0.tsv.gz in the Input File box.
  • Clear the check box next to “Read column headers” to create the table using default column names (col_1, col_2, etc.).
  • Scroll down to the bottom of the page and click Create Table. A progress indicator will appear at the top of the page while the table is being created.
  • After the table is created, the “omniturelogs” table appears in the list on the HCatalog page.
  • Repeat this procedure for the users.tsv.gz file using the table name “users,” and for the products.tsv.gz using the table name “products.” When creating these tables, leave the “Read column headers” check box selected to create the tables using the first row as the column names.

Step 3: View and Refine the Data in the Sandbox

In the previous section, we created sandbox tables from uploaded data files. Now let’s take a closer look at that data.

Here’s a summary of the data we’re working with:

    • omniturelogs – website logs containing information such as URL, timestamp, IP address, geocoded IP, and session ID.
    • users – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.
    • products – CMS data that maps product categories to website URLs.
    • Let’s start by looking at the raw omniture data. Click the File Browser icon in the toolbar at the top of the page, then select the Omniture.0.tsv.gz file.

  • The raw data file appears in the File Browser, and you can see that it contains information such as URL, timestamp, IP address, geocoded IP, and session ID.
  • Now let’s look at the “users” table using HCatalog. Click the HCat icon in the toolbar at the top of the page, then click Browse Data in the “users” row.
  • The “users” table appears, and you can see the SWID, birth date, and gender columns.
  • You can also use HCatalog to view the data in the “products” table, which maps product categories to website URLs.
  • Now let’s use a Hive script to generate an “omniture” view that contains a subset of the data in the Omniture log table. Click the Beeswax (Hive UI) icon in the toolbar at the top of the page to display the Query Editor, then paste the following text in the Query box:CREATE VIEW omniture AS SELECT col_2 ts, col_8 ip, col_13 url, col_14 swid, col_50 city, col_51 country, col_53 state from omniturelogs
  • Click Save as. On the “Choose a Name” pop-up, type “omniture” in the Name box, then click Save.
  • Click Execute to run the script.
  • To view the data generated by the saved script, click Tables in the menu at the top of the page, select the checkbox next to “omniture,” then click Browse Data.
  • The query results will appear, and you can see that the results include the data from the omniturelogs table that were specified in the query.

Finally, we’ll create a script that joins the omniture website log data to the CRM data (registered users) and CMS data (products). Click Query Editor, then paste the following text in the Query box:

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.category, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.birth_dt, 'dd-MMM-yy'))) / 365  AS INT) age, u.gender_cd gender from omniture o inner join products p on o.url = p.url left outer join users u on o.swid = concat('{', u.swid , '}')

  • Save this script as “webloganalytics” and execute the script. You can view the data generated by the script as described in the preceding steps.

Now that you have loaded data into the Hortonworks Platform, you can use Business Intelligence (BI) applications such as Microsoft Excel to access and analyze the data.

Feedback

We are eager to hear your feedback on this tutorial. Please let us know what you think. Click here to take survey

Comments

Ted Kahn
|
December 26, 2013 at 12:55 pm
|

So, I went through this exercise, and it seems everything is pretty standard SQL and as such could be done with Oracle, for example. I suppose that is your main point: that while this was done with about 100,000 log entries at some point even Oracle will not be able to process the data effectively. My thinking is now turning to how one manages this type of problem as an ongoing process. That is, data are coming in continual, one reasonable use case has these new logs updating dash boards in near realtime. But, my understanding is that updating or even appending HDF files is difficult and in any case you would not want to reprocess existing data. Thus, a time period would need to be determined and this process run for each period with its outputs updating the dashboard. However, we are now back to a much smaller problem for which Oracle etc. might work just fine. Anyway, a train of thought from a novice. -ted

|
January 7, 2014 at 11:33 am
|

Click Query Editor, then paste the following text in the Query box:

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.category, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.birth_dt, ‘dd-MMM-yy’))) / 365 AS INT) age, u.gender_cd gender from omniture o inner join products p on o.url = p.url left outer join users u on o.swid = concat(‘{‘, u.swid , ‘}’)

Respected Sir, Upto her I successfully done the process.But in executing the “create webloganalytics” i am getting error.
Kindly help me in running the above query.-
–TR RAO

S R
|
April 4, 2014 at 1:18 pm
|

Good Article and clear explanation. It would be wonderful if you can also explain little bit on how the same tutorial can be accomplished with Windows Single Node Cluster (without using Sandbox). For example, how the TSV file can be examined and assign default column values and ignore reading the TSV Header row.

Thanks, Satya Raju

Rajesh Toleti
|
July 14, 2014 at 2:56 am
|

Excellent Article. There are slight changes on the screens for Sandbox2.1. It would have been good if you mention MR will run while executing queries in BeesWax

Marcin Jimenez
|
July 28, 2014 at 1:49 pm
|

The query created in the last step is wrong, as it asked to clear the column names earlier in the tutorial and we did not define other tables for product and users table. Here is the correct script, with only the omniture table defined:

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.col_2, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.col_2, ‘dd-MMM-yy’))) / 365 AS INT) age, u.col_3 gender from omniture o inner join products p on o.url = p.col_1 left outer join users u on o.swid = concat(‘{‘, u.col_1 , ‘}’)

Marcin Jimenez
|
July 28, 2014 at 1:50 pm
|

Here is the correct script for the last part, in case you did not define the column values for Product and Users tables.

create table webloganalytics as select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, o.country, p.col_2, CAST(datediff( from_unixtime( unix_timestamp() ), from_unixtime( unix_timestamp(u.col_2, ‘dd-MMM-yy’))) / 365 AS INT) age, u.col_3 gender from omniture o inner join products p on o.url = p.col_1 left outer join users u on o.swid = concat(‘{‘, u.col_1 , ‘}’)

swathi pai m
|
August 5, 2014 at 11:21 pm
|

I just have a question sir
whether the divided data blocks are encrypted while it store in the HDFS?
if yes which encryption method it use ? if not how the privacy will be managed?

regards,

    francis
    |
    September 11, 2014 at 5:14 am
    |

    the data is stored in hdfs gz format, if you want to go for faster retrieval you can opt for snappy codec.

Peter
|
August 13, 2014 at 4:22 pm
|

when i execute the ‘omniture’ query i am getting nothing selected? any tips on how to correct this? i copied the text exactly.

    Niall
    |
    January 29, 2015 at 4:42 am
    |

    Getting the same “No data available” message. The SELECT query works fine on its own and returns data as expected. However the “CREATE VIEW omniture AS” fails to populate the view. The log file is decidedly blank. Any ideas out there?

Sean Franks
|
August 17, 2014 at 12:13 pm
|

Great run-through and examples of scripts. I just want to emphasize to all that if you clear column headings in Products and Users, the webloganalytics HiveQL script will fail on field names. I made that mistake and had to recreate those files. Wasn’t going to take a chance and try to ALTER the tables after having already screwed up one thing.

Saurabh
|
October 12, 2014 at 11:30 am
|

@Peter: Please create table instead of View and please make sure to have correct col name corresponding to value in select query when you are going to create table or view. In my case query was like this.
CREATE table omniture
AS SELECT
col_2 ts,
col_8 ip,
col_13 url,
col_14 swid,
col_50 city,
col_51 country,
col_53 state
from omniturelogs

Pushkar
|
October 12, 2014 at 7:43 pm
|

When I ran the last query to join the omniture,cms & crm data, I got the following error in hive:
Error occurred executing hive query: OK FAILED: SemanticException [Error 10002]: Line 1:325 Invalid column reference ‘url’

Gabriela
|
October 14, 2014 at 2:04 pm
|

How can I upload a file from my host (windows) system into HDFS on sandbox? I am looking at something the “Upload file” above does, but using command line (hadoop fs…), not the web-based GUI.

    Mike Y
    |
    October 28, 2014 at 2:57 pm
    |

    You can transfer files from the host machine to the virtual machine using either scp command in Putty, or scp-based client such as WinSCP. Once the files are transferred, you need to move them to the HDFS, this can also be done manually using hadoop fs -copyFromLocal [filename] [targetHDFSdirectory] (note that targetHDFSdirectory must exist, if it doesn’t you need to create it before copying the files). Once this is done, the files will be accessible from the Hadoop ecosystem and can be used to create a table etc.

    pradeep
    |
    November 2, 2014 at 10:47 pm
    |

    download hdfs explorer..and configure your host file with sandbox ip as below,
    in my case,
    browse to C:\Windows\System32\Drivers\etc.
    127.0.0.1 sandbox.hortonworks.com.

    to edit host file you need to open notepad with admin previlages.

Prasanna
|
December 2, 2014 at 6:53 am
|

Excellent Article and clear explained. Any chance to add Alter table option instead of re-creating the table.

We need to execute limited records like rownum<10…

Regards,
Prasanna K

Ashish
|
December 4, 2014 at 7:20 am
|

Nice information! But UI looks very old, I see much better in the latest Hue release.

bhupendra
|
January 12, 2015 at 11:48 pm
|

It is mentioned in the tutorial that we can connect to MS Excel to view webanalytics table. Could you please help in how we could connect To excel and view this data..

That would be of great help!, Thanks

|
May 22, 2015 at 5:14 pm
|

this is an awesome tutorial. I am a beginner to hadoop and struggling to visualize the workflow. This tutorial helped me get a clear view of the thing.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.