From Raw Data to Insight using HDP and Microsoft Business Intelligence


In this tutorial we will walk through the process of
  • cleaning and aggregating 10 years of raw stock ticker data from NYSE
  • enriching the data model by looking up additional attributes from Wikipedia
  • creating an interactive visualization on the model


The Data

The Zip file you downloaded above containing the NYSE stock ticker data is comprised of two folders below: DataFiles These folders contain numerous comma separated value files representing end of day stock ticker and dividend data of all stock symbols listed at NYSE for a period of 10 years.

Staging the data on HDFS

The first step is to stage the data in HDFS. In case you are using HDInsight Service or another cloud provider, you will have to stage the data on the respective cloud storage. For example in the case of Azure HDInsight Service it is Azure Blob Storage. For the rest of the tutorial we are going to assume we have a local cluster like the Hortonworks Sandbox to work with. We will use "File Browser" of Hue to stage the data in this case. You could also use hdfs command line among other ways to achieve the same. FileBrowser Let's upload the zip file: Upload Select the zip file you downloaded: FileBrowser Once you upload the zip file, you should see a directory structure like below on HDFS: FileBrowser
Note the path of the directories, you will need to use them in the next step.

Creating a Hive schema on the raw data:

We will use the Beeswax UI of Hue to execute the DDL queries: Beeswax Use the DDL statement below to create the table 'price_data':
create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices';
To test it succeeded run a query:
select * from price_data where symbol = 'IBM';
You should see an output like this: Beeswax Then use the DDL statement below to create the table 'dividends_data':
create external table dividends_data (stock_exchange string, symbol string, trade_date string, dividend float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_dividends';
To test it succeeded run a query:
select * from dividends_data where symbol = 'IBM';

Aggregating the stocks and dividend data

Notice in the output of the query on the table price_data above that there are many records for a stock symbol for every year. We want to work with a table which summarizes the high, low, average and the dividend for every year by stock symbol. Here's the DDL statement to create the table 'yearly_aggregates':
create table yearly_aggregates (symbol string, year string, high float, low float, average_close float, total_dividends float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/stock_aggregates';
Now let's populate the table with data using the following query:
insert overwrite table yearly_aggregates select a.symbol, year(a.trade_date), max(a.high), min(a.low),  avg(a.close), sum(b.dividend) from price_data a left outer join dividends_data b on (a.symbol = b.symbol and a.trade_date = b.trade_date) group by a.symbol, year(a.trade_date);
Let's run a query to test the table:
select * from yearly_aggregates where symbol = 'IBM';
Notice in the output below now we have just one record per year per symbol: Beeswax

Consuming and Shaping resultset from Hive query using Excel

We will use Excel for this step. Open Excel with a blank workbook and select the Power Query tab on the ribbon: Excel Click on "From Other Sources" on the ribbon and select "From Hadoop File (HDFS)": Excel Here you should enter the IP address or server name of you HDP cluster namenode. Excel If your cluster needs authentication, you will also need to enter your credentials. For Sandbox, select the anonymous mode. You will see a listing of all files on the HDFS, which could be quite overwhelming: Excel Thankfully Power Query has a nifty capability to filter the listing by clicking on the little triangle on the column name. In this case I filter the Folder path column with the string 'aggregate': Excel Now click on the Binary link on the Content column: Excel Now you should see the comma separated records all in one column since excel does not yet know how to process it: Excel Right click on the only column and select "Split Column" -> "By Delimiter": Excel Select "Comma" as the delimiter and hit "OK": Excel Now you should see the data properly structured: Excel Let's rename the column by right clicking the column names and selecting rename with name that make more sense: Excel Now scroll down and you will notice that some values in the "Dividend" column are "\N" instead of a number. Let's fix this by right clicking and selecting "Replace Values": Excel Now the "Dividend" Column should be well-formed: Excel Let's change the data type of the columns High, Low, Average, Dividend to Number by right clicking the column and selecting "Change Type": Excel Once you are done changing the type, the number fields should show italicized: Excel We are done with shaping this data, lets click "Done": Excel In the query settings on the right hand pane you can disable the download of the data, since we will download the data set only after we enhance it with additional attributes from Wikipedia. Excel

Acquiring external data from the Internet using Search

Let's find out which are the stocks that are in S&P 500. Click on "Online Search" on the Power Query tab of the Excel worksheet: Excel By hovering over the search result list you can see previews of the datasets: Excel Click to bring in the dataset titled "S&P 500 Component stocks - list of S&P 500 companies": Excel We have to shape this data before we use it. Click on "Filter & Shape" button on right hand pane to bring up the filter dialog: Excel Remove the uninteresting columns for our analysis like "SEC filings", "GICS Sub Industry", "Data first added" and "key" by right-clicking on column names and selecting "Remove Columns": Excel Click "Done" Excel

Enhancing the resultset acquired from Hadoop with additional attributes using data from Wikipedia

In this section we want to limit the data to only the stocks in the S&P 500 and moreover we want additional attributes for the stocks like the Company Name, Industry and location of the Headquarters. Go to the "Power Query" tab and select "Merge": Excel Select the resultset you acquired from Hadoop as the first resultset and select the resultset you acquired from Internet as your second resultset. Also select the "Symbol" and "Stock Symbol" as the columns to do the JOIN on and click OK: Excel Now let's expand the NewColumn to its component attributes by clicking on the little T shaped icon beside the column name. Deselect "Ticker Symbol" as we have that attribute already from before: Excel You should see the new columns like below: Excel Rename the new columns to something like below: Excel Scroll down and you will notice that many rows have null values for the new columns. Excel These rows are null because these stock symbols are not part of S&P 500. So, we need to filter these rows out by right clicking one of the null values and selecting "Text Filters": Excel Now the dataset is limited to just the symbols in S&P 500: Excel Click "Done": Excel To load it into the PowerPivot in-memory model, click on the "Load to data Model" link on the right hand pane: Excel To confirm that it has been loaded to the PowerPivot model, click on the PowerPivot tab on Excel and select Manage: Excel You should see the following: Excel Close the PowerPivot Model.

Visualizing the combined model using Power View in Excel

Open the Insert tab of Excel and select PowerView to insert a Power View slide: Excel You will see a Power View canvas with the default model: Excel Deselect all the PowerView fields except Dividend and Headquarters. : Excel Then ensure the dividend is being averaged and not summed by click on the little triangle beside Dividend Excel Now click on the Map button on the Power View ribbon and BOOM!, you got an interactive map: Excel Resize and Zoom the Map control to your taste.
As we walked through an end-to-end data pipeline where we started with the raw big data, cleaned, transformed and aggregated it in Hortonworks Data Platform or HDP, before consuming it in Excel for modeling and visualization, the integration was so simple that we barely noticed. This was your favorite big data community (Apache Hadoop) at work. Hortonworks and others in the Apache community designed and implemented WebHDFS, an open REST API in Apache Hadoop. Microsoft used the API from Power Query for Excel to make the integration to Microsoft Business Intelligence platform seamless.
Happy Hadooping!!! @saptak


July 15, 2014 at 9:13 am

Getting an constant error of cannot connect to remote server
I tried updating hosts file in Windows 8.1. as well. as

August 29, 2014 at 6:19 pm

Add the following line in the hosts file on your PC (C:Windows\System32\drivers\etc):

Then use “” as your hadoop sever name in Excel Power Query, you will be fine.

October 14, 2014 at 8:53 am

Same for me: Added to hosts file, but doesnt work in PowerQuery. Something else I need to do?

david barrera
July 21, 2015 at 10:54 am

I have configured the file “hosts” adding the IP and name DNS “” after to do this configuration the Excel can connect successfully to HDFS.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>