How To Analyze Machine and Sensor Data

Hadoop Tutorial: Analyzing Machine and Sensor Data

This tutorial is for HDP version 2.3 of the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download the Hortonworks Sandbox to run this and other tutorials in the series.


This tutorial describes how to refine data from heating, ventilation,
and air conditioning (HVAC) systems using the Hortonworks Data Platform,
and how to analyze the refined sensor data to maintain optimal building

Demo: Here is the video of Enable Predictive Analytics with
as a demo of what
you’ll be doing in this tutorial.

Sensor Data

A sensor is a device that measures a physical quantity and transforms it
into a digital signal. Sensors are always on, capturing data at a low
cost, and powering the “Internet of Things.”

Potential Uses of Sensor Data

Sensors can be used to collect data from many sources, such as:

  • To monitor machines or infrastructure such as ventilation equipment,
    bridges, energy meters, or airplane engines. This data can be used
    for predictive analytics, to repair or replace these items before
    they break.
  • To monitor natural phenomena such as meteorological patterns,
    underground pressure during oil extraction, or patient vital
    statistics during recovery from a medical procedure.

In this tutorial, we will focus on sensor data from building operations.
Specifically, we will refine and analyze the data from Heating,
Ventilation, Air Conditioning (HVAC) systems in 20 large buildings
around the world.



  • In this tutorial, the Hortonworks Sandbox is installed on an Oracle
    VirtualBox virtual machine (VM) – your screens may be different.
  • If you plan on using the Microsoft Excel for the analysis and reporting section install the ODBC driver that matches the version of Excel you are
    using (32-bit or 64-bit).
  • If choosing to use Excelm you will use the Power View feature in Microsoft
    Excel 2013 to visualize the sensor data. Power View is currently
    only available in Microsoft Office Professional Plus and Microsoft
    Office 365 Professional Plus.
  • Note, other versions of Excel will work, but the visualizations will
    be limited to charts. You can connect to any other visualization
    tool you like.
  • If not using Excel, you will be able to use Apache Zeppelin to analyze and report on the data from this tutorial.


To refine and analyze HVAC sensor data, we will:

  • Download and extract the sensor data files.
  • Load the sensor data into the Hortonworks Sandbox.
  • Run two Hive scripts to refine the sensor data.
  • Access the refined sensor data with Microsoft Excel or Apache Zeppelin.
  • Visualize the sensor data using Excel Power View or Apache Zeppelin.

Step 1: Download and Extract the Sensor Data Files

  • You can download the sample sensor data contained in a compressed
    (.zip) folder here:

  • Save the file to your computer, then extract the
    files. You should see a SensorFiles folder that contains the
    following files:

  • HVAC.csv – contains the targeted building temperatures, along with
    the actual (measured) building temperatures. The building
    temperature data was obtained using Apache Flume. Flume can be used
    as a log aggregator, collecting log data from many diverse sources
    and moving it to a centralized data store. In this case, Flume was
    used to capture the sensor log data, which we can now load into the
    Hadoop Distributed File System (HFDS). For more details on Flume,
    refer to Tutorial 13: Refining and Visualizing Sentiment Data

  • building.csv – contains the “building” database table. Apache Sqoop
    can be used to transfer this type of data from a structured database
    into HFDS.

Step 2: Load the Sensor Data into the Hortonworks Sandbox

  • Navigate to the ambari login by going to the web address http://localhost:8080
  • Login with the username admin and password admin.

Once logged in to the sandbox, navigate to the icon at the top right of the header and view the dropdown menu.

  • Select HDFS Files

  • Note the view you are greeted with once you open the HDFS Files Ambari View.

  • You can view the path that you’re currently in
  • You can create new directories and upload files using the button on the top.
  • You can also download whole directories as .zip files, move directories, and delete them as well. These options are highlighted in blue for each folder or file in a directory.

  • Navigate to the /tmp directory.

  • Create the directory data by using New Directory at the top of the page.
  • Once inside the directory, click Upload then Browse. Navigate to the HVAC.csv file that is part of the .zip file that was downloaded earlier.
  • Click Upload to upload the CSV file to HDFS.
  • Follow the same procedure to upload the building.csv file.

  • After uploading both files, the page should look similar to the following image:

  • Now user the menu on the header again to access the dropdown menu containing the link to the Ambari Hive View

  • Here we’re going to define two tables that will contain data from our HVAC.csv and building.csv files.

  • Copy and paste the following query into the worksheet and click Execute to create the table hvac_stage.
create table hvac_stage (
recordDate string,
Time string,
TargetTemp int,
ActualTemp int,
System int,
SystemAge int,
BuildingID int) 

  • After the previous query has executed, Execute this next next query to create the table buildings_stage.
create table buildings_stage
(BuildingID int,
 BuildingMgr string,
 BuildingAge string,
 HVACproduct string,
 Country string) 

  • After both queries have succeeded you should see hvac_stage and buildings_stage in the Database Explorer on the left hand side of the screen.
  • Now use the following query to load our tables with data from our data files that we previously uploaded to HDFS.

  • Repeat the previous step with the following query to load the table buildings_stage with data.
LOAD DATA INPATH '/tmp/data/building.csv' OVERWRITE INTO TABLE buildings_stage;

  • Now refresh the Database Explorer. You should see all of the tables appear on the left hand side after clicking default.
  • If you click the small square icon to the right hand side of any table it will immediately create a query that gives you a small sample of data from the table you selected.
  • Do this for the table hvac_stage and buildings_stage. Make sure both tables have all columns populated and that it is possible to execute and see the results of both queries.

  • Now that we have both tables loaded in, we want to get better performance in Hive, so we’re going to create new tables that utilize the highly efficient ORC file format. This will allow for faster queries when our datasets are much much larger.

  • Execute the following query to create a new table hvac that is stored as an ORC file.

  • Repeat the previous step, except this time we will make a table for buildings.
CREATE TABLE buildings STORED AS ORC AS SELECT * FROM buildings_stage;

Step 3: Run Two Hive Scripts to Refine the Sensor Data

We will now use two Hive scripts to refine the sensor data. We hope to
accomplish three goals with this data:

  • Reduce heating and cooling expenses.
  • Keep indoor temperatures in a comfortable range between 65-70
  • Identify which HVAC products are reliable, and replace unreliable
    equipment with those models.

  • First, we will identify whether the actual temperature was more than
    five degrees different from the target temperature.

  • Create a new worksheet in the Hive view and paste the following Hive query into your window.

CREATE TABLE hvac_temperatures as 
select *, targettemp - actualtemp as temp_diff, 
IF((targettemp - actualtemp) > 5, 'COLD', 
IF((targettemp - actualtemp) < -5, 'HOT', 'NORMAL')) 
AS temprange, 
IF((targettemp - actualtemp) > 5, '1', 
IF((targettemp - actualtemp) < -5, '1', 0)) 
AS extremetemp from hvac;
  • This query creates a new table hvac_temperatures and copies data from the hvac table

  • After you paste the query use Execute to create the new table.

  • Then

  • On the Query Results page, use the slider to scroll to the right. You
    will notice that two new attributes appear in the hvac_temperatures table.

    The data in the temprange column indicates whether the actual
    temperature was:

    • NORMAL within 5 degrees of the target temperature.
    • COLD more than five degrees colder than the target
    • HOT more than 5 degrees warmer than the target

    If the temperature is outside of the normal range, extremetemp is
    assigned a value of 1; otherwise its value is 0.

  • Next we will combine the hvac and hvac_temperatures data sets.

Create a new worksheet in the hive view and use the following query to create a new table hvac_building that contains data from the hvac_temperatures table and the buildings table.

create table if not exists hvac_building 
as select h.*,, b.hvacproduct, b.buildingage, b.buildingmgr 
from buildings b join hvac_temperatures h on b.buildingid = h.buildingid;
  • Use Execute to run the query that will produce the table with the intended data.

  • After you’ve successfully executed the query, use the database explorer to load a sample of the data from the new hvac_building table.

Now that we’ve constructued the data into a useful format, we can use different reporting tools to analyze the results.

Data Reporting

In this tutorial you can choose to report with

Step 4a: Access the Refined Sensor Data with Microsoft Excel

In this section, we will use Microsoft Excel Professional Plus 2013 to
access the refined sentiment data.

  • In Windows, open a new Excel workbook, then select Data > From
    Other Sources > From Microsoft Query

  • On the Choose Data Source pop-up, select the Hortonworks ODBC data
    source you installed previously, then click OK.

    The Hortonworks ODBC driver enables you to access Hortonworks data
    with Excel and other Business Intelligence (BI) applications that
    support ODBC.

  • After the connection to the Sandbox is established, the Query Wizard
    appears. Select the “hvac_building” table in the Available tables
    and columns box, then click the right arrow button to add the entire
    “hvac_building” table to the query. Click Next to continue.

  • On the Filter Data screen, click Next to continue without
    filtering the data.

  • On the Sort Order screen, click Next to continue without setting
    a sort order.

  • Click Finish on the Query Wizard Finish screen to retrieve the
    query data from the Sandbox and import it into Excel.

  • On the Import Data dialog box, click OK to accept the default
    settings and import the data as a table.

  • The imported query data appears in the Excel workbook.

Now that we have successfully imported the refined sensor data into
Microsoft Excel, we can use the Excel Power View feature to analyze and
visualize the data.

Step 5: Visualize the Sensor Data Using Excel Power View

We will begin the data visualization by mapping the buildings that are
most frequently outside of the optimal temperature range.

  • In the Excel worksheet with the imported “hvac_building” table,
    select Insert > Power View to open a new Power View report.

  • The Power View Fields area appears on the right side of the window,
    with the data table displayed on the left. Drag the handles or click
    the Pop Out icon to maximize the size of the data table.

  • In the Power View Fields area, select the checkboxes next to the
    country and extremetemp fields, and clear all of the other
    checkboxes. You may need to scroll down to see all of the check

  • In the FIELDS box, click the down-arrow at the right of the
    extremetemp field, then select Count (Not Blank).

  • Click Map on the Design tab in the top menu.

  • The map view displays a global view of the data. We can see that the
    office in Finland had 814 sensor readings where the temperature was
    more than five degrees higher or lower than the target temperature.
    In contrast, the German office is doing a better job maintaining
    ideal office temperatures, with only 363 readings outside of the
    ideal range.

  • Hot offices can lead to employee complaints and reduced
    productivity. Let’s see which offices run hot.

    In the Power View Fields area, clear the extremetemp checkbox
    and select the temprange checkbox. Click the down-arrow at the
    right of the temprange field, then select Add as Size.

  • Drag temprange from the Power View Fields area to the Filters
    box, then select the HOT checkbox. We can see that the buildings
    in Finland and France run hot most often.

  • Cold offices cause elevated energy expenditures and employee

    In the Filters box, clear the HOT checkbox and select the
    COLD checkbox. We can see that the buildings in Finland and
    Indonesia run cold most often.

  • Our data set includes information about the performance of five
    brands of HVAC equipment, distributed across many types of buildings
    in a wide variety of climates. We can use this data to assess the
    relative reliability of the different HVAC models.

  • Open a new Excel worksheet, then select **Data > From Other Sources

    From Microsoft Query** to access the hvac_building table. Follow
    the same procedure as before to import the data, but this time only
    select the “hvacproduct” and “extremetemp” columns.

  • In the Excel worksheet with the imported “hvacproduct” and
    “extremetemp” columns, select Insert > Power View to open a new
    Power View report.

  • Click the Pop Out icon to maximize the size of the data table. In
    the FIELDS box, click the down-arrow at the right of the extremetemp
    field, then select Count (Not Blank).

  • Select Column Chart > Stacked Columnin the top menu.

  • Click the down-arrow next to sort by hvacproduct in the upper
    left corner of the chart area, then select Count of extremetemp.

  • We can see that the GG1919 model seems to regulate temperature most
    reliably, whereas the FN39TG failed to maintain the appropriate
    temperature range 9% more frequently than the GG1919.

We’ve shown how the Hortonworks Data Platform (HDP) can store and
analyze sensor data. With real-time access to massive amounts of
temperature and other types of data on HDP, your facilities department
can initiate data-driven strategies to reduce energy expenditures and
improve employee comfort.

Step 4b: Access the Refined Sensor Data with Apache Zeppelin

Apache Zeppelin makes data reporting easy on Hadoop. It has direct connections to Apache Spark and Hive in your cluster and allows you to create visualizations and analyze your data on the fly.

To start you’re going to need to open up the Apache Zeppelin view in Ambari.

Start by navigating back to the Ambari Dashboard at http://localhost:8080

  • Use the dropdown menu to open the Zeppelin View.

  • From here we’re going to need to create a new Zeppelin Notebook.

  • Notebooks in Zeppelin is how we differentiate reports from one another.
  • Hove over Notebook. Use the dropdown menu and Create a new note.

  • Name the note HVAC Analysis Report and then Create Note.

  • Head back to the Zeppelin homepage.

  • Use the Notebook dropdown menu to open the new notebook HVAC Analysis Report.

  • Zeppelin integrates with Hadoop by using things called interpreters.

  • In this tutorial we’ll be working with the Hive interpreter to run Hive queries in Zeppelin, then visualize the results from our Hive queries directly in Zeppelin.
  • To specify the Hive interpreter for this note, we need to put %hive at the top of the note. Everything afterwards will be interpreted as a Hive query.

  • Type the following query into the note, then run it by clicking the Run arrow or by using the shortcut Shift+Enter.


select country, extremetemp, temprange from hvac_building

  • After running the previous query we can view a chart of the data by clicking the chart button located just under the query.

  • Click settings to open up more advanced settings for creating the chart. Here you can experiment with different values and columns to create different types of charts.

  • Arrange the fields according to the following image.

  • Drag the field temprange into the groups box.
  • Click SUM on extremetemp and change it to COUNT.
  • Make sure that country is the only field under Keys.

  • Awesome! You’ve just created your first chart using Apache Zeppelin.

  • From this chart we can see which countries have the most extreme temperature and how many NORMAL events there are compared to HOT and COLD.
  • From this data it could be possible to figure out which buildings might need HVAC upgrades, and which do not.

  • Let’s try creating one more note to visualize which types of HVAC systems result in the least amount of extremetemp readings.

  • Paste the following query into the blank Zeppelin note following the chart we made previously.

select hvacproduct, extremetemp from hvac_building
  • Now use Shift+Enter to run the note.

  • Arrange the fields according to the following image so we can recreate the chart below.

  • Make sure that hvacproduct is in the Keys box.
  • Make sure that extremetemp is in the Values box and that it is set to COUNT.

  • Now we can see which HVAC units result in the most extremetemp readings. Thus we can make a more informed decision when purchasing new HVAC systems.

Apache Zeppelin gives you the power to connect right to your Hadoop cluster to quickly obtain results from the data inside of Hadoop without having to export data to any other sources.

It’s also important to note that Zeppelin contains many, many interpreters that can be utilized to obtain data in a variety of ways.

One of the default interpreters included with Zeppelin is for Apache Spark. With the popularity of Apache Spark rising, you can simply write Spark scripts to execute directly on Apache Zeppelin to obtain results from your data in a matter of seconds.


We are eager to hear your feedback on this tutorial. Please let us know
what you think. Click here to take


April 6, 2014 at 5:39 am

Valuable information. Lucky me I found your site unintentionally,
and I am shocked why this twist of fate did not happened earlier!
I bookmarked it.

May 14, 2014 at 8:42 am

I simply could not depart your site prior to suggesting that I extremely loved the usual information
a person provide on your visitors? Is going to be again incessantly in order to
check up on new posts

July 21, 2014 at 3:57 am

where can i get tutorials on how to use a sensor for capturing data and what is the cost of such a sensor for the video you posted . Thank you.

July 21, 2014 at 3:59 am

Where can i get tutorials on how to use a sensor for capturing data and what is the cost of such a sensor for the video you posted . Thank you.Please respond asap.

August 9, 2014 at 11:35 pm

Guys, your tutorials are great! wonderful. Even though there are some errors but with the help of other websites I learnt lot of stuff. Keep it up!! Post more tutorials focusing on the real world scenarios. Infact I start suggesting your website for my students.

Dharmesh Purohit
November 14, 2014 at 3:33 pm

All the real world use cases are simply amazing the only question I have , Is there away to get the sample data for each of these PoC to practice out by my own?

If yes, how can I get access to them?

December 3, 2014 at 9:51 am

Getting Error when connecting from Excel.
(35) Error from Hive: error code:’40000′ error message:’Error while compiling statement: FAILED: ParseExecption line 2:17 cannot recognize input near ‘.”hvac_building”hvac_building’ in table source’.

Did anyone else faced this same issue ?

July 19, 2015 at 1:44 pm

Keep posting nice and clear explanation. Appreciable work.
Thank you.

August 26, 2015 at 11:17 am

Horton odbc hive Dsn is not showing up in data source of excel 2013 from external sources. I followed complete tutorial but no success. Please help me out.

September 22, 2015 at 5:58 am

Good to see a collection of relevant information. I truly hate Blogs with all advertisement, but your blog is something which i can describe a fantastic blog made to provide Info. Fine with your permission allow me to grab your feed to keep up to date with forthcoming post. Thanks a million and please continue the rewarding work.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">