cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Visualize Website Clickstream Data

Introduction

How to Visualize and Analyze Website Clickstream Data

Your home page looks great. But how do you move customers on to bigger things—like submitting a form or completing a purchase? Get more granular with customer segmentation. Hadoop makes it easier to analyze, visualize and ultimately change how visitors behave on your website.

In this demo, we demonstrate how an online retailer can optimize buying paths to reduce bounce rate and improve conversion.

In this tutorial, learn how to:

  • Stream twitter feeds into HDFS with Flume
  • Use HCatalog to build a relational view of the data
  • Use Hive to query and refine the data
  • Import the data into Microsoft Excel with the ODBC connector
  • Visualize data with Powerview

This demo can be completed with the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.

The Hortonworks sandbox is a fully contained Hortonworks Data Platform (HDP) environment. The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. You can access and analyze sandbox data with many Business Intelligence (BI) applications.

In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs. By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.

Clickstream Data

Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log files.

These website log files contain data elements such as a date and time stamp, the visitor’s IP address, the destination URLs of the pages visited, and a user ID that uniquely identifies the website visitor.

Potential Uses of Clickstream Data

One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises of all types can use Hadoop and the Hortonworks Data Platform (HDP) to refine and analyze clickstream data. They can then answer business questions such as:

  • What is the most efficient path for a site visitor to research a product, and then buy it?
  • What products do visitors tend to buy together, and what are they most likely to buy in the future?
  • Where should I spend resources on fixing or enhancing the user experience on my website?

In this tutorial, we will focus on the “path optimization” use case. Specifically: how can we improve our website to reduce bounce rates and improve conversion?

Prerequisites:

  • Hortonworks ODBC driver (64-bit) installed and configured
  • Hortonworks sample data files uploaded and refined as described in “Loading Data into the Hortonworks Sandbox”
  • If you haven’t loaded this data yet, please download it here and import it by following this tutorial: http://hortonworks.com/hadoop-tutorial/loading-data-into-the-hortonworks-sandbox/
  • Microsoft Excel 2013 Professional Plus 64-bit
  • Windows 7 or later(Optional – to run Microsoft Excel 2013 Professional Plus edition)
  • Note this tutorial can still be run with any version of Excel, but your visualizaitons will be limited to the built in charts. You may wish to attempt this with another visualization tool that can accept data via an ODBC connection, like Tableau, Lumira, etc.

There are two options for setting up the Hortonworks Sandbox:

  1. Download & Install Hortonworks Sandbox on your local machine (recommended 8GB of dedicated RAM for the Virtual Machine)
  2. Deploy Hortonworks Sandbox on Microsoft Azure

Overview

To analyze and visualize website clickstream data, we will:

  • Use Hortonworks to view and refine website clickstream data.
  • Access the clickstream data with Excel.
  • Visualize the clickstream data using Excel Power View.


If you need help, please first check HCC for existing Answers to questions on this tutorial.
Find Answers

If you don’t find your answer you can post a new HCC question for this tutorial:
Ask Questions

Lab 1

Lab 1: Loading Data into HDFS and Hive

Prerequisites:

  • Hortonworks Sandbox 2.4 (installed and running)

Overview

To load data into the Hortonworks sandbox, you will:

  • Download sample data to your computer.
  • Upload the data files into the sandbox
  • View and refine the data in the sandbox.

Summary

This tutorial describes how to import

Clickstream Data

Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log files.

These website log files contain data elements such as a date and time stamp, the visitor’s IP address, the destination URLs of the pages visited, and a user ID that uniquely identifies the website visitor.

Potential Uses of Clickstream Data

One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises of all types can use Hadoop and the Hortonworks Data Platform (HDP) to refine and analyze clickstream data. They can then answer business questions such as:

  • What is the most efficient path for a site visitor to research a product, and then buy it?
  • What products do visitors tend to buy together, and what are they most likely to buy in the future?
  • Where should I spend resources on fixing or enhancing the user experience on my website?

In this tutorial, we will focus on the “path optimization” use case. Specifically: how can we improve our website to reduce bounce rates and improve conversion?


Step 1: Download the Sample Data

You can download a set of sample data contained in a compressed (.zip) folder here:

RefineDemoData.zip

Save the sample data .zip file to your computer, then extract the files and unzip Omniture.0.tsv.gz, user.tsv.gz and products.tsv.gz.

Note: The extracted data files should have a .tsv file extension at the end.

Step 2: Upload the Data Files into the Sandbox

First Log in to the Ambari interface at http://localhost:8080. You can log in with the username maria_dev and the password maria_dev

Select the HDFS Files view from the Off-canvas menu at the top. The HDFS Files view allows you to view the Hortonworks Data Platform(HDP) file store. The HDP file system is separate from the local file system.

We navigate to /tmp, create a maria folder

right click on maria and select Permissions:

Now we check the Write buttons and modify recursively and press save.

Verify that the permissions look now like this:

Now, we navigate to /tmp/maria, click on upload and browse the Omniture.0.tsv.

Repeat this procedure for users.tsv file and for products.tsv.

Step 3: Create Hive tables

Let’s open the Hive View by clicking on the Hive button from the views menu.

and create the tables users, products and omniture.

create table users (swid STRING, birth_dt STRING, gender_cd CHAR(1))
ROW FORMAT DELIMITED
FIELDS TERMINATED by 't'
stored as textfile 
tblproperties ("skip.header.line.count"="1");

create table products (url STRING, category STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED by 't'
stored as textfile 
tblproperties ("skip.header.line.count"="1");

create table omniturelogs (col_1 STRING,col_2 STRING,col_3 STRING,col_4 STRING,col_5 STRING,col_6 STRING,col_7 STRING,col_8 STRING,col_9 STRING,col_10 STRING,col_11 STRING,col_12 STRING,col_13 STRING,col_14 STRING,col_15 STRING,col_16 STRING,col_17 STRING,col_18 STRING,col_19 STRING,col_20 STRING,col_21 STRING,col_22 STRING,col_23 STRING,col_24 STRING,col_25 STRING,col_26 STRING,col_27 STRING,col_28 STRING,col_29 STRING,col_30 STRING,col_31 STRING,col_32 STRING,col_33 STRING,col_34 STRING,col_35 STRING,col_36 STRING,col_37 STRING,col_38 STRING,col_39 STRING,col_40 STRING,col_41 STRING,col_42 STRING,col_43 STRING,col_44 STRING,col_45 STRING,col_46 STRING,col_47 STRING,col_48 STRING,col_49 STRING,col_50 STRING,col_51 STRING,col_52 STRING,col_53 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED by 't'
stored as textfile 
tblproperties ("skip.header.line.count"="1");

Step 4: Load data into new tables

To load the data into the tables, we have to execute the following queries.

LOAD DATA INPATH '/tmp/maria/products.tsv' OVERWRITE INTO TABLE products;
LOAD DATA INPATH '/tmp/maria/users.tsv' OVERWRITE INTO TABLE users;
LOAD DATA INPATH '/tmp/maria/Omniture.0.tsv' OVERWRITE INTO TABLE omniturelogs;

To check if the data was loaded, click on the icon next to the table name. It executes a sample query.



Step 5: View and Refine the Data in the Sandbox

In the previous section, we created sandbox tables from uploaded data files. Now let’s take a closer look at that data.

Here’s a summary of the data we’re working with:

omniturelogs – website logs containing information such as URL, timestamp, IP address, geocoded IP, and session ID.


users – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.

products – CMS data that maps product categories to website URLs.

Now let’s use a Hive script to generate an “omniture” view that contains a subset of the data in the Omniture log table.

CREATE VIEW omniture AS 
SELECT col_2 ts, col_8 ip, col_13 url, col_14 swid, col_50 city, col_51 country, col_53 state 
FROM omniturelogs 

Click Save as. On the “Saving item” pop-up, type “omniture” in the box, then click OK.

You can see your saved query now by clicking on the “Save Queries” button at the top.

Click Execute to run the script.

To view the data generated by the saved script, click on the icon next to the view’s name at the Database Explorer.
The query results will appear, and you can see that the results include the data from the omniturelogs table that were specified in the query.

Finally, we’ll create a script that joins the omniture website log data to the CRM data (registered users) and CMS data (products). Click Query Editor, then paste the following text in the Query box:

create table webloganalytics as 
select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, 
o.country, p.category, CAST(datediff( from_unixtime( unix_timestamp() ), 
from_unixtime( unix_timestamp(u.birth_dt, 'dd-MMM-yy'))) / 365  AS INT) age, u.gender_cd
from omniture o 
inner join products p     
on o.url = p.url 
left outer join users u 
on o.swid = concat('{', u.swid , '}')

Save this script as “webloganalytics” and execute the script.

You can view the data generated by the script as described in the preceding steps.

Now that you have loaded data into the Hortonworks Platform, you can use Business Intelligence (BI) applications such as Microsoft Excel to access and analyze the data.


If you need help, please first check HCC for existing Answers to questions on this tutorial.
Find Answers

If you don’t find your answer you can post a new HCC question for this tutorial:
Ask Questions

Lab 2

Lab 2: Visualizing Clickstream Logs with Excel

In this section, You can choose to use Excel Professional Plus 2013 or Apache Zeppelin (built in to the sandbox) to access the refined clickstream data.

Step 1: Connecting Microsoft Excel

  • In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.

  • On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK. The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.

  • After the connection to the sandbox is established, the Query Wizard appears. Select the webloganalytics table in the Available tables and columns box, then click the right arrow button to add the entire webloganalytics table to the query. Click Next to continue.

  • On the Filter Data screen, click Next to continue without filtering the data.

  • On the Sort Order screen, click Next to continue without setting a sort order.

  • Click Finish on the Query Wizard Finish screen to retrieve the query data from the sandbox and import it into Excel.

  • On the Import Data dialog box, click OK to accept the default settings and import the data as a table.

  • The imported query data appears in the Excel workbook.

Now that we have successfully imported Hortonworks Sandbox data into Microsoft Excel, we can use the Excel Power View feature to analyze and visualize the data.


Step 2: Visualize the Website Clickstream Data Using Excel Power View

Data visualization can help you optimize your website and convert more visits into sales and revenue. In this section we will:

  • Analyze the clickstream data by location
  • Filter the data by product category
  • Graph the website user data by age and gender
  • Pick a target customer segment
  • Identify a few web pages with the highest bounce rates

In the Excel workbook with the imported webloganalytics data, select Insert > Power View to open a new Power View report.

  • The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.

  • Let’s start by taking a look at the countries of origin of our website visitors. In the Power View Fields area, leave the country checkbox selected, and clear all of the other checkboxes. The data table will update to reflect the selections.

  • On the Design tab in the top menu, click Map.

  • The map view displays a global view of the data. Now let’s take a look at a count of IP address by state. First, drag the ip field into the SIZE box.

  • Drag country from the Power View Fields area into the Filters area, then select the usa checkbox.

  • Next, drag state into the LOCATIONS box. Remove the country field from the LOCATIONS box by clicking the down-arrow and then Remove Field.

  • Use the map controls to zoom in on the United States. Move the pointer over each state to display the IP count for that state.

  • Our dataset includes product data, so we can display the product categories viewed by website visitors in each state. To display product categories in the map by color, drag the category field into the COLOR box.

  • The map displays the product categories by color for each state. Move the pointer over each state to display detailed category information. We can see that the largest number of page hits in Florida were for clothing, followed by shoes.

  • Now let’s look at the clothing data by age and gender so we can optimize our content for these customers. Select Insert > Power View to open a new Power View report.

To set up the data, set the following fields and filters:

  • In the Power View Fields area, select ip and age. All of the other fields should be unselected.
  • Drag category from the Power View Fields area into the Filters area, then select the clothing checkbox.
  • Drag gender from the Power View Fields area into the Filters area, then select the M (male) checkbox.

After setting these fields and filters, select Column Chart > Clustered Column in the top menu.

  • To finish setting up the chart, drag age into the AXIS box. Also, remove ip from the AXIS box by clicking the down-arrow and then Remove Field. The chart shows that the majority of men shopping for clothing on our website are between the ages of 22 and 30. With this information, we can optimize our content for this market segment.

  • Let’s assume that our data includes information about website pages (URLs) with high bounce rates. A page is considered to have a high bounce rate if it is the last page a user visited before leaving the website. By filtering this URL data by our target age group, we can find out exactly which website pages we should optimize for this market segment. Select Insert > Power View to open a new Power View report.

To set up the data, set the following fields and filters:

  • Drag age from the Power View Fields area into the Filters area, then drag the sliders to set the age range from 22 to 30.
  • Drag gender from the Power View Fields area into the Filters area, then select the M (male) checkbox.
  • Drag country from the Power View Fields area into the Filters area, then select the usa checkbox.
  • In the Power View Fields area, select url. All of the other fields should be unselected.
  • In the Power View Fields area, move the pointer over url, click the down-arrow, and then select Add to Table as Count.

After setting these fields and filters, select Column Chart > Clustered Column in the top menu.

  • The chart shows that we should focus on optimizing four of our website pages for the market segment of men between the ages of 22 and 30. Now we can redesign these four pages and test the new designs based on our target demographic, thereby reducing the bounce rate and increasing customer retention and sales.

  • You can use the controls in the upper left corner of the map to sort by Count of URL in ascending order.

Now that you have successfully analyzed and visualized Hortonworks Sandbox data with Microsoft Excel, you can see how Excel and other BI tools can be used with the Hortonworks platform to derive insights about customers from various data sources.

The data in the Hortonworks platform can be refreshed frequently and used for basket analysis, A/B testing, personalized product recommendations, and other sales optimization activities.


Analyzing Clickstream Data with Apache Zeppelin

If you don’t have access to Microsoft Excel Professional Plus, you can also utilize Apache Zeppelin to do you data visualization as well.

Open up Ambari and make sure Zeppelin is running. If not, start the service. Then use the dropdown menu to access the views and select Zeppelin

Zeppelin View

Once the Zeppelin view is open you can either create a new note and run the commands, or import the following notebook from this URL: `

https://raw.githubusercontent.com/hortonworks/tutorials/hdp-2.4/data/zeppelin-notes/ClickstreamAnalytics.json

	%hive
	select state from webloganalytics

Zeppelin Chart 1

	%hive
	select age, gender_cd, category from webloganalytics where age is not NULL LIMIT 1000

Zeppelin Chart 2

	%hive
	select category from webloganalytics

Zeppelin Chart 3

Feedback

We are eager to hear your feedback on this tutorial. Please let us know what you think.

Click here to take survey


If you need help, please first check HCC for existing Answers to questions on this tutorial.
Find Answers

If you don’t find your answer you can post a new HCC question for this tutorial:
Ask Questions