Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
April 06, 2017
prev slideNext slide

Fast Analytics in the Cloud with Hive LLAP

Apache Hadoop has always been associated with storing & processing vasts amount of data. But did you know it’s also an awesome engine to power interactive data exploration and visualization? With the development of Apache Hive LLAP (a recent innovation included in the Hortonworks Data Platform), you can use Hadoop with Business Intelligence tools (like Tableau) to analyze and visualize large data sets. Plus, being able to do this in a cloud environment means you can get started quickly while working seamlessly and directly with data available in cloud storage systems (like Amazon S3).

In this blog, let’s take a look at how we can use Hortonworks Data Cloud and the Hive LLAP Analytics cluster configuration to model, interact and visualize data available in Amazon S3. To do so, you’ll need the following:

First Things First: Hortonworks Data Cloud

For those that aren’t familiar: Hortonworks Data Cloud is a platform for analyzing and processing data in the cloud. “HDCloud” enables businesses to achieve insights more quickly and with greater flexibility than ever before with a greatly reduced operational effort. You can choose from a set of prescriptive cluster configurations to start modeling and analyzing your data sets in minutes. When you are done with your analysis, you can give the resources back to the cloud, reducing your costs.

To get started, you need to launch the HDCloud “cloud controller” into your AWS environment. You can follow the instructions here:

Once the controller is ready, login and create a cluster for “Analytics with Hive LLAP”. The general instructions can be found at the link below but you’ll want to be sure to use the following parameters and options when you create your cluster:

HDP Version HDP 2.6 (Cloud)
Cluster Configuration Type EDW-Analytics: Apache Hive 2 LLAP, Apache Zeppelin 0.7.0
Master Instance Type m4.xlarge
Worker Instance Type m3.2xlarge
Worker Instance Count 3

You Have a Cluster…Now What?

For our scenario, we have some “airports” data available in Amazon S3 to use. To use this data, we need to create the Hive schema for the data. To create the schema, we can use the Hive View that comes pre-installed in the cluster you just created. Open the Hive View from the Ambari dropdown menu on the cluster details page.

From a Worksheet in the Hive View, copy the content from this DDL and execute:

Let’s briefly highlight this particular part of the DDL:

LOCATION 's3a://hw-sampledata/airline_orc/airline_ontime.db/airports'

We are creating schema using LOCATION as the Amazon S3 data. This type of Hive external “managed table” allows the actual table data to live in S3 but the schema be defined in Hive. External tables are optimal when the data is already present in Amazon S3 and once populated, data can be accessed multiple times and it is not deleted even if the tables are accidentally deleted. Plus, we are storing the data in Apache ORC, a high-performant columnar storage format which is great for Hadoop workloads.

Next: even though the DDL creates the schema definition in Hive, we need populate partition-related information. In a new Hive View Worksheet, run the following (be sure to switch to the “hwxdemo” database that was created with the DDL). Now you have some data to play with.


You Have Data…Again, Now What?

Ok…well…you have data. It’s time to point your Tableau to Hive LLAP and connect. We have some instructions here on how to do that:

Time to Model and Visualize

With Tableau connected to Hive LLAP, select the “hwxdemo” schema and start creating a data model. You can using the following 4 tables:

For example, you can create a data model using the following tables loaded in order with associations as described below:

Load Order Table Inner Join Condition
1 flights n/a
2 airlines Uniquecarrier = Code
3 planes Tailnum = Tailnum(Planes)
4 airports Origin = Iata, rename “airports” to “airports_origin”
5 airports Dest = Iata(Airports), rename “airports” to “airports_dest”

The final table associations should look like this:

Once you have the data model built, go to town and starting visualizing airline growth over the years, or airports with the most delays.

Where to go from here?

If you are ready to get started with Hortonworks Data Cloud for AWS, go here for a 5 day free trial from the AWS Marketplace listing. For more information, please refer to the following links.

Product Webpage
Product Documentation
“Get Started with HDCloud” Webinar



sharmila says:

how can i analyze json data using hive. plaese help. i installed hdcloud with ec2 m42xlarge for my POC work. i sftp one sample json file into EC2. not sure how to move next

Will Xu says:

You should try asking your question again in Hortonworks community connection forum with more specifics such as where is your JSON file etc.

Leave a Reply

Your email address will not be published. Required fields are marked *