From Raw Data to Insight using HDP and Microsoft Business Intelligence

Overview

In this tutorial we will walk through the process of

  • cleaning and aggregating 10 years of raw stock ticker data from NYSE
  • enriching the data model by looking up additional attributes from Wikipedia
  • creating an interactive visualization on the model

Prerequisites:

The Data

The Zip file you downloaded above containing the NYSE stock ticker data is comprised of two folders below:

DataFiles

These folders contain numerous comma separated value files representing end of day stock ticker and dividend data of all stock symbols listed at NYSE for a period of 10 years.

Staging the data on HDFS

The first step is to stage the data in HDFS. In case you are using HDInsight Service or another cloud provider, you will have to stage the data on the respective cloud storage. For example in the case of Azure HDInsight Service it is Azure Blob Storage.

For the rest of the tutorial we are going to assume we have a local cluster like the Hortonworks Sandbox to work with.

We will use “File Browser” of Hue to stage the data in this case. You could also use hdfs command line among other ways to achieve the same.

FileBrowser

Let’s upload the zip file:

Upload

Select the zip file you downloaded:

FileBrowser

Once you upload the zip file, you should see a directory structure like below on HDFS:

FileBrowser

Note the path of the directories, you will need to use them in the next step.

Creating a Hive schema on the raw data:

We will use the Beeswax UI of Hue to execute the DDL queries:

Beeswax

Use the DDL statement below to create the table ‘price_data':

create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices';

To test it succeeded run a query:

select * from price_data where symbol = 'IBM';

You should see an output like this:

Beeswax

Then use the DDL statement below to create the table ‘dividends_data':

create external table dividends_data (stock_exchange string, symbol string, trade_date string, dividend float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_dividends';

To test it succeeded run a query:

select * from dividends_data where symbol = 'IBM';

Aggregating the stocks and dividend data

Notice in the output of the query on the table price_data above that there are many records for a stock symbol for every year.

We want to work with a table which summarizes the high, low, average and the dividend for every year by stock symbol.

Here’s the DDL statement to create the table ‘yearly_aggregates':

create table yearly_aggregates (symbol string, year string, high float, low float, average_close float, total_dividends float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/stock_aggregates';

Now let’s populate the table with data using the following query:

insert overwrite table yearly_aggregates select a.symbol, year(a.trade_date), max(a.high), min(a.low),  avg(a.close), sum(b.dividend) from price_data a left outer join dividends_data b on (a.symbol = b.symbol and a.trade_date = b.trade_date) group by a.symbol, year(a.trade_date);

Let’s run a query to test the table:

select * from yearly_aggregates where symbol = 'IBM';

Notice in the output below now we have just one record per year per symbol:

Beeswax

Consuming and Shaping resultset from Hive query using Excel

We will use Excel for this step.

Open Excel with a blank workbook and select the Power Query tab on the ribbon:

Excel

Click on “From Other Sources” on the ribbon and select “From Hadoop File (HDFS)”:

Excel

Here you should enter the IP address or server name of you HDP cluster namenode.

Excel

If your cluster needs authentication, you will also need to enter your credentials. For Sandbox, select the anonymous mode.

You will see a listing of all files on the HDFS, which could be quite overwhelming:

Excel

Thankfully Power Query has a nifty capability to filter the listing by clicking on the little triangle on the column name. In this case I filter the Folder path column with the string ‘aggregate':

Excel

Now click on the Binary link on the Content column:

Excel

Now you should see the comma separated records all in one column since excel does not yet know how to process it:

Excel

Right click on the only column and select “Split Column” -> “By Delimiter”:

Excel

Select “Comma” as the delimiter and hit “OK”:

Excel

Now you should see the data properly structured:

Excel

Let’s rename the column by right clicking the column names and selecting rename with name that make more sense:

Excel

Now scroll down and you will notice that some values in the “Dividend” column are “\N” instead of a number. Let’s fix this by right clicking and selecting “Replace Values”:

Excel

Now the “Dividend” Column should be well-formed:

Excel

Let’s change the data type of the columns High, Low, Average, Dividend to Number by right clicking the column and selecting “Change Type”:

Excel

Once you are done changing the type, the number fields should show italicized:

Excel

We are done with shaping this data, lets click “Done”:
Excel

In the query settings on the right hand pane you can disable the download of the data, since we will download the data set only after we enhance it with additional attributes from Wikipedia.

Excel

Acquiring external data from the Internet using Search

Let’s find out which are the stocks that are in S&P 500. Click on “Online Search” on the Power Query tab of the Excel worksheet:

Excel

By hovering over the search result list you can see previews of the datasets:

Excel

Click to bring in the dataset titled “S&P 500 Component stocks – list of S&P 500 companies”:

Excel

We have to shape this data before we use it. Click on “Filter & Shape” button on right hand pane to bring up the filter dialog:

Excel

Remove the uninteresting columns for our analysis like “SEC filings”, “GICS Sub Industry”, “Data first added” and “key” by right-clicking on column names and selecting “Remove Columns”:

Excel

Click “Done”

Excel

Enhancing the resultset acquired from Hadoop with additional attributes using data from Wikipedia

In this section we want to limit the data to only the stocks in the S&P 500 and moreover we want additional attributes for the stocks like the Company Name, Industry and location of the Headquarters.

Go to the “Power Query” tab and select “Merge”:

Excel

Select the resultset you acquired from Hadoop as the first resultset and select the resultset you acquired from Internet as your second resultset. Also select the “Symbol” and “Stock Symbol” as the columns to do the JOIN on and click OK:

Excel

Now let’s expand the NewColumn to its component attributes by clicking on the little T shaped icon beside the column name. Deselect “Ticker Symbol” as we have that attribute already from before:

Excel

You should see the new columns like below:

Excel

Rename the new columns to something like below:

Excel

Scroll down and you will notice that many rows have null values for the new columns.

Excel

These rows are null because these stock symbols are not part of S&P 500. So, we need to filter these rows out by right clicking one of the null values and selecting “Text Filters”:

Excel

Now the dataset is limited to just the symbols in S&P 500:

Excel

Click “Done”:

Excel

To load it into the PowerPivot in-memory model, click on the “Load to data Model” link on the right hand pane:

Excel

To confirm that it has been loaded to the PowerPivot model, click on the PowerPivot tab on Excel and select Manage:

Excel

You should see the following:

Excel

Close the PowerPivot Model.

Visualizing the combined model using Power View in Excel

Open the Insert tab of Excel and select PowerView to insert a Power View slide:

Excel

You will see a Power View canvas with the default model:

Excel

Deselect all the PowerView fields except Dividend and Headquarters. :

Excel

Then ensure the dividend is being averaged and not summed by click on the little triangle beside Dividend

Excel

Now click on the Map button on the Power View ribbon and BOOM!, you got an interactive map:

Excel

Resize and Zoom the Map control to your taste.

As we walked through an end-to-end data pipeline where we started with the raw big data, cleaned, transformed and aggregated it in Hortonworks Data Platform or HDP, before consuming it in Excel for modeling and visualization, the integration was so simple that we barely noticed.

This was your favorite big data community (Apache Hadoop) at work. Hortonworks and others in the Apache community designed and implemented WebHDFS, an open REST API in Apache Hadoop. Microsoft used the API from Power Query for Excel to make the integration to Microsoft Business Intelligence platform seamless.

Happy Hadooping!!!

@saptak

Comments

Bernd
|
October 14, 2014 at 8:53 am
|

Same for me: Added 192.168.56.101 sandbox.hortonworks.com to hosts file, but doesnt work in PowerQuery. Something else I need to do?

stang
|
August 29, 2014 at 6:19 pm
|

Add the following line in the hosts file on your PC (C:Windows\System32\drivers\etc):

127.0.0.1 sandbox.hortonworks.com

Then use “sandbox.hortonworks.com” as your hadoop sever name in Excel Power Query, you will be fine.

Varun
|
July 15, 2014 at 9:13 am
|

Getting an constant error of cannot connect to remote server 127.0.0.1
I tried updating hosts file in Windows 8.1. as well. as
192.168.11.5 sandbox.hortonworks.com

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.