How To Capitalize on Clickstream data with Hadoop

In the last 60 seconds there were 1,300 new mobile users and there were 100,000 new tweets. As you contemplate what happens in an internet minute Amazon brought in $83,000 worth of sales. What would be the impact of you being able to identify:

  • What is the most efficient path for a site visitor to research a product, and then buy it?
  • What products do visitors tend to buy together, and what are they most likely to buy in the future?
  • Where should I spend resources on fixing or enhancing the user experience on my website?

In the Hortonworks Sandbox, you can run a simulation of website Clickstream behavior to see where users are located and what they are doing on the website. This tutorial provides a dataset of a fictitious website and the behavior of the visitors on the site over a 5 day period. This is a 4 million line dataset that is easily ingested into the single node cluster of the Sandbox via HCatalog.


In this tutorial, you’ll also learn how to combine datasets. Once you have the Clickstream data in the Sandbox, you’ll then combine it with the two other data sets provided: User Data along with Product data. This combination of data is easily achieved using Hive.


Once you have these combined data sets, then you can use a visualization tool to see where the customer are, what products they are looking at. In this tutorial, we show you how to do this in Excel, but you could easily do this in Tableau, Alterx or an Open Source tool like BIRT.


Once you’ve completed the tutorial, you can easily add your own data sets to see how your own customers move through your website and start capitalizaing on each minute you have.

Don’t have the Sandbox? Download it here. Find more use cases for big data analytics here.

Categorized by :
Business Analytics Hive Sandbox


Cheryle Custer
January 6, 2014 at 9:12 am

You can log in as root to get to the command line interface to perform MapReduce instructions. See the Hortonworks Forums for assistance.

General set up:
Login info:
User name: root
Password: hadoop

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Join the Webinar!

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and
Thursday, October 30, 2014
1:00 PM Eastern / 12:00 PM Central / 11:00 AM Mountain / 10:00 AM Pacific

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.