From Raw Data to Insight using HDP and Microsoft Business Intelligence
In this tutorial we will walk through the process of
- cleaning and aggregating 10 years of raw stock ticker data from NYSE
- enriching the data model by looking up additional attributes from Wikipedia
- creating an interactive visualization on the model
- A working HDP cluster
- The NYSE stockticker data
- Microsoft Excel 2013 with PowerPivot and Power View enabled on a Windows system
- Microsoft Power Query for Excel
The Zip file you downloaded above containing the NYSE stock ticker data is comprised of two folders below:
These folders contain numerous comma separated value files representing end of day stock ticker and dividend data of all stock symbols listed at NYSE for a period of 10 years.
Staging the data on HDFS
The first step is to stage the data in HDFS. In case you are using HDInsight Service or another cloud provider, you will have to stage the data on the respective cloud storage. For example in the case of Azure HDInsight Service it is Azure Blob Storage.
For the rest of the tutorial we are going to assume we have a local cluster like the Hortonworks Sandbox to work with.
We will use “File Browser” of Hue to stage the data in this case. You could also use hdfs command line among other ways to achieve the same.
Let’s upload the zip file:
Select the zip file you downloaded:
Once you upload the zip file, you should see a directory structure like below on HDFS:
Note the path of the directories, you will need to use them in the next step.
Creating a Hive schema on the raw data:
We will use the Beeswax UI of Hue to execute the DDL queries:
Use the DDL statement below to create the table ‘price_data':
create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices';
To test it succeeded run a query:
select * from price_data where symbol = 'IBM';
You should see an output like this:
Then use the DDL statement below to create the table ‘dividends_data':
create external table dividends_data (stock_exchange string, symbol string, trade_date string, dividend float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_dividends';
To test it succeeded run a query:
select * from dividends_data where symbol = 'IBM';
Aggregating the stocks and dividend data
Notice in the output of the query on the table price_data above that there are many records for a stock symbol for every year.
We want to work with a table which summarizes the high, low, average and the dividend for every year by stock symbol.
Here’s the DDL statement to create the table ‘yearly_aggregates':
create table yearly_aggregates (symbol string, year string, high float, low float, average_close float, total_dividends float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/stock_aggregates';
Now let’s populate the table with data using the following query:
insert overwrite table yearly_aggregates select a.symbol, year(a.trade_date), max(a.high), min(a.low), avg(a.close), sum(b.dividend) from price_data a left outer join dividends_data b on (a.symbol = b.symbol and a.trade_date = b.trade_date) group by a.symbol, year(a.trade_date);
Let’s run a query to test the table:
select * from yearly_aggregates where symbol = 'IBM';
Notice in the output below now we have just one record per year per symbol:
Consuming and Shaping resultset from Hive query using Excel
We will use Excel for this step.
Open Excel with a blank workbook and select the Power Query tab on the ribbon:
Click on “From Other Sources” on the ribbon and select “From Hadoop File (HDFS)”:
Here you should enter the IP address or server name of you HDP cluster namenode.
If your cluster needs authentication, you will also need to enter your credentials. For Sandbox, select the anonymous mode.
You will see a listing of all files on the HDFS, which could be quite overwhelming:
Thankfully Power Query has a nifty capability to filter the listing by clicking on the little triangle on the column name. In this case I filter the Folder path column with the string ‘aggregate':
Now click on the Binary link on the Content column:
Now you should see the comma separated records all in one column since excel does not yet know how to process it:
Right click on the only column and select “Split Column” -> “By Delimiter”:
Select “Comma” as the delimiter and hit “OK”:
Now you should see the data properly structured:
Let’s rename the column by right clicking the column names and selecting rename with name that make more sense:
Now scroll down and you will notice that some values in the “Dividend” column are “\N” instead of a number. Let’s fix this by right clicking and selecting “Replace Values”:
Now the “Dividend” Column should be well-formed:
Let’s change the data type of the columns High, Low, Average, Dividend to Number by right clicking the column and selecting “Change Type”:
Once you are done changing the type, the number fields should show italicized:
We are done with shaping this data, lets click “Done”:
In the query settings on the right hand pane you can disable the download of the data, since we will download the data set only after we enhance it with additional attributes from Wikipedia.
Acquiring external data from the Internet using Search
Let’s find out which are the stocks that are in S&P 500. Click on “Online Search” on the Power Query tab of the Excel worksheet:
By hovering over the search result list you can see previews of the datasets:
Click to bring in the dataset titled “S&P 500 Component stocks – list of S&P 500 companies”:
We have to shape this data before we use it. Click on “Filter & Shape” button on right hand pane to bring up the filter dialog:
Remove the uninteresting columns for our analysis like “SEC filings”, “GICS Sub Industry”, “Data first added” and “key” by right-clicking on column names and selecting “Remove Columns”:
Enhancing the resultset acquired from Hadoop with additional attributes using data from Wikipedia
In this section we want to limit the data to only the stocks in the S&P 500 and moreover we want additional attributes for the stocks like the Company Name, Industry and location of the Headquarters.
Go to the “Power Query” tab and select “Merge”:
Select the resultset you acquired from Hadoop as the first resultset and select the resultset you acquired from Internet as your second resultset. Also select the “Symbol” and “Stock Symbol” as the columns to do the JOIN on and click OK:
Now let’s expand the NewColumn to its component attributes by clicking on the little T shaped icon beside the column name. Deselect “Ticker Symbol” as we have that attribute already from before:
You should see the new columns like below:
Rename the new columns to something like below:
Scroll down and you will notice that many rows have null values for the new columns.
These rows are null because these stock symbols are not part of S&P 500. So, we need to filter these rows out by right clicking one of the null values and selecting “Text Filters”:
Now the dataset is limited to just the symbols in S&P 500:
To load it into the PowerPivot in-memory model, click on the “Load to data Model” link on the right hand pane:
To confirm that it has been loaded to the PowerPivot model, click on the PowerPivot tab on Excel and select Manage:
You should see the following:
Close the PowerPivot Model.
Visualizing the combined model using Power View in Excel
Open the Insert tab of Excel and select PowerView to insert a Power View slide:
You will see a Power View canvas with the default model:
Deselect all the PowerView fields except Dividend and Headquarters. :
Then ensure the dividend is being averaged and not summed by click on the little triangle beside Dividend
Now click on the Map button on the Power View ribbon and BOOM!, you got an interactive map:
Resize and Zoom the Map control to your taste.
As we walked through an end-to-end data pipeline where we started with the raw big data, cleaned, transformed and aggregated it in Hortonworks Data Platform or HDP, before consuming it in Excel for modeling and visualization, the integration was so simple that we barely noticed.
This was your favorite big data community (Apache Hadoop) at work. Hortonworks and others in the Apache community designed and implemented WebHDFS, an open REST API in Apache Hadoop. Microsoft used the API from Power Query for Excel to make the integration to Microsoft Business Intelligence platform seamless.