Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button

Refine and Visualize Server Log Data

Lab 1: Server Logs Analysis using Apache Nifi and Hive


Security breaches happen. And when they do, your server logs may be your best line of defense. Hadoop takes server-log analysis to the next level by speeding and improving security forensics and providing a low cost platform to show compliance.

In this demo, we demonstrate how an enterprise security breach analysis and response might be performed.

In this tutorial, learn how to:

  • Stream server logs into Hadoop with Hortonworks Dataflow powered by Apache NiFi
  • Use Hive to build a relational view of the data
  • Use Elastic Search for high-level visualization
  • Use Oozie to automatically update a firewall
  • Visualize the data with Apache Zeppelin


  • Understand Server Log Data, Become familiar with Server Log Data Use Cases, Hortonworks DataFlow and Apache NiFi
  • Download and configure the script which will generate our server log data
  • Install, configure, and start Hortonworks DataFlow
  • Generate the server log data.
  • Import the server log data into Excel.
  • Visualize the server log data using Excel Power View and Apache Zeppelin.

This Hadoop tutorial can be performed with the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorial presented here is for Sandbox HDP 2.5.




Server Log Data

Server logs are computer-generated log files that capture network and server operations data. They are useful for managing network operations, especially for security and regulatory compliance.

Potential Uses of Server Log Data

IT organizations use server log analysis to answer questions about:

  • Security – For example, if we suspect a security breach, how can we use server log data to identify and repair the vulnerability?
  • Compliance – Large organizations are bound by regulations such as HIPAA and Sarbanes-Oxley. How can IT administrators prepare for system audits?

In this tutorial, we will focus on a network security use case. Specifically, we will look at how Apache Hadoop can help the administrator of a large enterprise network diagnose and respond to a distributed denial-of-service attack.

What Is Hortonworks Dataflow and Apache NiFi?

Apache NiFi is a secure integrated platform for real time data collection, simple event processing, transport and delivery from source to storage. It is useful for moving distributed data to and from your Hadoop cluster. NiFi has lots of distributed processing capability to help reduce processing cost and get real-time insights from many different data sources across many large systems and can help aggregate that data into a single, or many different places.

NiFi lets users get the most value from their data. Specifically NiFi allows users to:

  • Stream data from multiple source
  • Collect high volumes of data in real time
  • Guarantee delivery of data
  • Scale horizontally across many machines

How NiFi Works. NiFi’s high-level architecture is focused on delivering a streamlined interface that is easy to use and easy to set up. There is a little bit of terminology that are an integral part to understanding how NiFi works.

  • Processor: Processors in NiFi are what makes the data move. Processors can help generate data, run commands, move data, convert data, and many many more. NiFi’s architecture and feature set is designed to be extended these processors. They are at the very core of NiFi’s functionality.
  • Processing Group: When data flows get very complex, it can be very useful to group different parts together which perform certain functions. NiFi abstracts this concept and calls them processing groups.
  • FlowFile: A FlowFile in NiFi represents just a single piece of data. It is made of different parts. Attributes and Contents. Attributes help give the data context which are made of key-value pairs. Typically there are 3 attributes which are present on all FlowFiles: uuid, filename, and path
  • Connections and Relationships: NiFi allows users to simply drag and drop connections between processors which controls how the data will flow. Each connection will be assigned to different types of relationships for the FlowFiles (such as successful processing, or a failure to process)

A FlowFile can originate from a processor in NiFi. Processors can also receive the flowfiles and transmit them to many other processors. These processors can then drop the data in the flowfile into various places depending on the function of the processor.


  • In this tutorial, the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM).
  • Install the ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
  • In this tutorial, we will use the Power View feature in Excel 2013 to visualize the server log data. Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.
  • We’re going to install Hortonworks DataFlow (HDF) on the Sandbox, so you’ll need to download the latest HDF release

Step 1 – Configure and Install Hortonworks DataFlow

1.1 – Install NiFi

NiFi will be installed into the Ambari Stack of the Hortonworks Sandbox VirtualBox image because it
will be used to activate server log simulator and transport data to HDFS.

1. If you do not have NiFi installed on your sandbox, refer to Step 2: Download and Install NiFi on Hortonworks Sandbox (Option 1) from Tutorial 0: Download, Install, and Start NiFi of
Learning the Ropes of Apache NiFi for step-by-step instructions.

1.2 – Start NiFi

1. To activate the NiFi service, refer to Step 4: Start NiFi on Sandbox from Tutorial 0: Download, Install, and Start NiFi of
Learning the Ropes of Apache NiFi for step-by-step instructions.

Once you enter the NiFi HTML Interface at http://localhost:9090/nifi, you should see a canvas as below:


Step 2 – Import the Flow

We’re going to import a pre-made data flow from a template which you can download ServerLogGeneratorSecurity.xml.

Use the NiFi interface to upload the flow, and then drag it onto your workspace.

Upload NiFi Template

Once you’ve uploaded the template into NiFi you can instantiate it by dragging the template icon template_icon onto the screen. It will ask you to select your template’s name and the flow will appear as in the image below.

Instantiate NiFi Template

Step 3 – Generate the Server Log Data

Now that you’ve imported the data flow and everything it set up, simply click the Run at the top of the screen. (Make sure you haven’t selected a specific processor, or else only one of the processors will start)


If you receive an error message stating “LzoCodec not found” and “LzopCodec not found” similar to the image below,


Login to Ambari Dashboard at http://localhost:8080 as raj_ops user. The login credentials are raj_ops/raj_ops. Once inside Ambari, click HDFS in the left hand sidebar, then click on the Configs tab. Scroll down the page, next to Settings tab, click on the Advanced tab to open the Advanced configs. Navigate to Advanced core-site file, we will need to change property value for the io.compression.codecs setting.


when you find the LzoCodec and LzopCodec, which are values within the io.compression.codecs property, the line will look as below:,,,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

Remove com.hadoop.compression.lzo.LzoCodec and com.hadoop.compression.lzo.LzopCodec from the config values.

You should see the updated content in the io.compression.codecs value field as below:,,

Save the configuration change. Ambari will notify us that we need to restart HDFS, scroll to the top of HDFS Configs page, you should see a Orange Restart button, select Restart All Affected.


Now that we are done setting configuration changes, navigate back to the dataflow at http://localhost:9090/nifi Now that we made the change, let’s rerun the DataFlow, those errors should be resolved.


Now that everything is running we can check in the places where we see the data being deposited in HDFS. Since we will be checking if the NiFi DataFlow performs as expected, log out of raj_ops user, then login to Ambari as maria_dev user. Login to the Ambari Dashboard at http://localhost:8080. Login credentials are maria_dev/maria_dev.

3.1 Verify NiFi Populates HDFS with Data Files

Open up the Files View, and then navigate to /tmp/server-logs/. Files should start appearing a few seconds after you start the flow. You can click on them to view the content.

Explore Output

  • Next we will create an Hive table from the log file.

3.2 Format the Data with Hive

Open the Ambari UI and head to the views dropdown list. Select Hive and then paste the following query.

  time STRING,
  ip STRING,
  country STRING,
  status INT
CLUSTERED BY (time) into 25 buckets
LOCATION '/tmp/server-logs';

We will define an ORC Table in Hive Create table using Apache ORC file format.


Note if the query doesn’t run successfully due to a permissions error you then might need to update the permission on the directory. Run the following commands over SSH on the Sandbox

sudo -u hdfs hadoop fs -chmod -R 777 /tmp

When the table has been created you should now be able to query the data table for data using a query like

SELECT time, ip, country, cast(status as BOOLEAN) FROM firewall_logs LIMIT 100;


Note: In the table above, 1 is for true, 0 is for false.

Step 4 – Visualize Server Log Data with Apache Zeppelin

Why is Data Visualization important?
Helps you analyze network data and determine effective responses to network issues.

Go to browser and type to open Zeppelin UI

You should be greeted by the following screen where you can choose to view notes, or create a new one.


4.1 Import notebook NiFiServerLogs (Option 1)

You can choose to Import note from this tutorial. Click on Import note, an Import new note window will appear, select Add from URL. The window will transition to a new window that has two fields: Import AS and URL. Copy & Paste the following URL into the URL field:

Then click the Import note button.

4.2 Create notebook NiFiServerLogs (Option 2)

If you prefer to go through the steps to create the notebook, follow the steps below. Click on the Create new note in the Zeppelin Welcome page. Enter NiFiServerLogs as the notebook name and create the notebook. Click on the new notebook you created in the list of notebooks to open the NiFiServerLogs notebook. Our notebook will have 3 sections in which we will analyze data for denial-of-service attack:

  • review network traffic by country
  • zoom in one particular country
  • generate list of attacking IP addresses

Find Network Traffic Per Country

Let’s filter the data in our hive table, so it only shows country and status. Copy & Paste the query into Zeppelin:

SELECT country, cast(status as BOOLEAN) FROM firewall_logs

We were able to generate a table in zeppelin that shows us country associated with whether the network connection status was a success or not. Let’s visualize this table as a Bar Graph, select the option that looks like a Bar Graph. Now let’s setup the Advanced settings: Keys, Groups, Values. You should have two fields: country and success. Drag the country field into the Keys box, then do the same operation for the status field. To sum up the status values for true and false, drag another status field into the Values. Now the countries will have true or false association and when you hover over a piece of the bar graph, you should see the sum value for status: false or true.


Blue represents successful, authorized network connections
Light Blue represents connections from unauthorized sources

Zoom into One Particular Country

Let’s assume the recent denial-of-service attacks originated in INDIA. Let’s further filter the data by selecting the country with value IND. Copy & Paste the query into the second text space available:

SELECT country, cast(status as BOOLEAN) FROM firewall_logs WHERE country = 'IND'

Let’s make an assumption that the recent denial-service attacks originate in INDIA, we will use Zeppelin’s data visualization feature to zoom in on the status errors or unauthorized network connections for INDIA. Under the query, there are 6 visualization choices, we will choose 2 leftmost choice, a bar graph.

Remember we want to know the total for unauthorized network connections, so drag the success square into the Values box. Since there are multiple values for success, “true” or “false”, let’s drag another success box into the groups box. The key that is associated with both types of values for success is “IND”, so drag the country square into the Keys box. Refer to the image as below:


Blue represents successful, authorized network connections
Light Blue represents connections from unauthorized sources

Generate List of Attacking IP Addresses

Let’s generate a record of unauthorized IP addresses. We will need to query the table with the parameters time, ip, country, success. Copy & Paste the following query into Zeppelin:

Run the following query:

SELECT time, ip, country, cast(status as BOOLEAN) FROM firewall_logs WHERE status = '0'

Generates List of IP Address Attacks Per Country. The ‘0’ signifies connections from unauthorized sources. Therefore, we can update the network firewall to deny requests from those attacking IP addresses

Let’s visualize the IP Address Attacks in a Pie Chart.

Drag country into the Keys box and move ip field into the values box. Make sure that you selected the count option for the ip.


Our visualization counts the number of unauthorized IP Addresses per country.
As you can see each piece of the pie represents a country and when you hover over one slice, you can see the total count for unauthorized IP addresses.


We’ve shown how the Hortonworks Data Platform can help system administrators capture, store, and analyze server log data. With real-time access to massive amounts of data on the Hortonworks Data Platform, we were able to block unauthorized access, restore VPN access to authorized users.

With log data flowing continuously into the Hortonworks Data Platform “data lake,” we can protect the company network from similar attacks in the future. The data can be refreshed frequently and accessed to respond to security threats, or to prepare for compliance audits.

Further Reading

Lab 2: Server Logs Analysis using Apache Nifi, Spark and Zeppelin


In the previous lab, you learned how to use Nifi, Hive and Zeppelin to analyze server logs. In this lab, you will explore Apache Spark to analyze the server logs. Using Spark, we can enhance the power of the log data which come from web servers, files and even user generated data.

In this tutorial, learn how to:

  • Stream server logs and and preparation of data into Hadoop with Hortonworks Dataflow powered by Apache NiFi
  • Use PySpark to clean the data
  • Visualize the data with Apache Zeppelin
  • Tableau to visualize the geolocation in the world map


  • Hortonworks Sandbox (installed and running)
  • Hortonworks DataFlow – Download here
  • If you’d like to use Tableau to explore the data, download a free trial version
  • Server log tutorial files (included in this tutorial)
  • Lab 1 of this tutorial



Download the dataset here.

The dataset which we are going to use in this lab is of NASA-HTTP. It has HTTP requests to the NASA Kennedy Space Center WWW server in Florida.
The logs are an ASCII file with one line per request, with the following columns:

  1. host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
  2. timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
  3. request given in quotes.
  4. HTTP reply code.
  5. bytes in the reply.

Step 1 – Configure and Install Hortonworks DataFlow

1.1 – Install NiFi

NiFi needs to be installed into the Ambari Stack of the Hortonworks Sandbox VirtualBox image because it will be used to activate server log simulator and transport data to HDFS.
If you do not have NiFi installed on your sandbox, refer to Step 2: Download and Install NiFi on Hortonworks Sandbox (Option 1) from Tutorial 0: Download, Install, and Start NiFi of Analyze Traffic Patterns using Apache Nifi for step-by-step instructions.

1.2 – Start NiFi

To activate the NiFi service, refer to Step 4: Start NiFi on Sandbox from Tutorial 0: Download, Install, and Start NiFi of Analyze Traffic Patterns using Apache Nifi for step-by-step instructions.
Once you enter the NiFi HTML Interface at, you should see a canvas as below:


Step 2 – Download Input Data

First you’re going to need to login to your Sandbox via SSH. If you’re using Virtualbox you can log in with the command:

ssh root@ -p 2222

The first time password to log in is: hadoop


Now we have to create a directory from where Nifi would pick up the data. There will be two different directories, one is for sample input dataset and other is for GeoLite dataset that Nifi would use to provide Geolocation attributes.

mkdir -p /tmp/nifi/input
mkdir /tmp/nifi/GeoFile
chmod 777 -R /tmp/nifi


Download files using wget utility:

cd /tmp/nifi/input


cd /tmp/nifi/GeoFile


Step 3 – Import the Workflow

We’re going to import a data flow from a template which you can download WebServerLogs.xml.
Use the NiFi interface to upload the flow, and then drag it onto your workspace.


Once you’ve uploaded the template into NiFi you can instantiate it by dragging the template icon template_icon onto the screen. It will ask you to select your template’s name and the flow will appear as in the image below.


Let’s us give some time to explain the role of each processor here:

1. GetFile Processor – It creates Flow Files from files in a directory. We specify the input path of the file here.

2. SplitText Processor – It splits a text file into smaller text files limited by maximum number of lines or total size of fragment. In our case, the value of line split count is 1, that means each split file will have 1 record.

3. ExtractText Processor – It evaluates the regular expression to extract the content of a Flow File and the result will be placed in the attribute which can be utilized in other processors of the workflow. If there is no match it will be sent to the ‘unmatched’ result which is a simple way of filtering out different logs. The regular expressions that are used to extract different attributes are as follows:

  • IP – (\b(?:\d{1,3}.){3}\d{1,3}\b) – Captures everything which has 4 numbers separated by dots.
    Example –

  • Time – [(.*?)] – Captures everything between double square brackets.
    Example – 01/Aug/1995:02:56:28 -0400

  • Request_Type – “(.*?)” – Captures everything between double quotes
    Example – GET /images/NASA-logosmall.gif HTTP/1.0

  • Response_Code – HTTP\/\d.\d” (\d{3}) – Captures the next 3 digit number after it finds HTTP 1.0 in the text of Flow File.
    Example – 200

4. RouteOnAttribute Processor – It routes flow based on their attributes. Routing flow fioles if IP attribute starts with 1 or 2, Time and Request_Type are not empty.

5. GeoEnrichIP Processor – This processor takes the IP attribute generated in the previous processor and compares it to a geo-database(‘mmdb’). It adds the Geo information like Latitude, Longitude, City, Country and Isocode to the Flow Files attributes. It generates these attributes and give them names like IP.geo.latitude,,, and IP.geo.isocode.

6. RouteOnAttribute Processor – There might be a possibility that GeoEnrichIP Processor is not able to find all geo details for an IP so this processor checks whether the exists or not. It routes only those Flow Files further which has values for city attribute.

7. ReplaceText Processor – It updates the content of Flow File by evaluating a regular expression against it and replacing the section of the content that matches the Regular Expression with some alternate value. In this case, we are replacing the content with his format –

${IP} ${Time} ${Request_Type} ${Response_Code} ${} ${} ${} ${IP.geo.latitude} ${IP.geo.longitude}
$ character extracts the value of the attribute and is being used as a delimiter between these attributes.              

8. MergeContent Processor – It merges a group of Flow Files into a single Flow Files. The minimum number of entries to make a single file is 20 in our case. That means, the processor is merging 20 Flow Files into 1.

9. UpdateAttribute Processor – It update the Flow Files’ attributes by using the Attribute Expression Language of NiFi. We are giving each FlowFile a unique name using the timestamp. The expression used here is


10. PutHDFS Processor – It writes Flow File to the HDFS directory. The output HDFS directory used here is /tmp/nifioutput.

Step 4 : Run the Workflow

Make sure you have removed Lzo and Lzop Codec from the list of compression codecs. Refer this step to do that.
Now click SHIFT and select the entire workflow, then locate Start button in the Operate box and click it.


You can see your workflow running.

Keep the workflow running for next 10 minutes so that we have enough data to ingest into Spark.

Step 5 : Verify NiFi Output Data in HDFS

Next, logout from raj_ops user from Ambari and re-login to Ambari using user credentials maria_dev/maria_dev. Click on 9 square menu and select Files View:


After just 30 seconds, you will start seeing the output files coming to /tmp/nifioutput directory, so navigate to that directory. Click on one of the row and then Open.


You would see some content like the one given below. You can easily locate its attributes – IP address, Time, Request Type, Response Code, City, Country, Isocode, Latitude and Longitude.


Wait for 10 minutes to store more files like this in the folder. In the meantime, you can re-open Ambari in a new tab and start Spark.

Step 6 – Turn OFF Maintenance Mode and Open Zeppelin UI

We will be using Spark version 1.6.2 in this tutorial. Go to Ambari dashboard, and follow the steps as mentioned below:


Next, click on Zeppelin Notebook and then Quick Links -> Zeppelin UI to open the Zeppelin UI on the browser:


You will see a Zeppelin UI with a bunch of already loaded notebooks. Click on Create New Note and give the name Web Server Log Analysis using Spark:


Before moving ahead, do not forget to stop the workflow in Nifi. Go back to Nifi UI, select all components and click Stop button in the Operate box:


Step 7 – Logs Analysis using Spark and Zeppelin

7.1 – Loading External Library

As you explore Zeppelin you will probably want to use one or more external libraries.
We are going to use the %dep interpreter to import the library. Copy paste the following set of lines in your Zeppelin notebook:


Then click on Play button next to Ready. Alternatively, you can press Shift+Enter.


7.2 Load the DataFrame from HDFS directory

Next, let us create a dataframe in Spark using PySpark. Using here to load the dataframe from the HDFS directory /tmp/nifioutput. show() function shows the content of the dataframe,

from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([
# Represents a field in a StructType
  StructField("IP",             StringType()),
  StructField("Time",           StringType()),
  StructField("Request_Type",   StringType()),
  StructField("Response_Code",  StringType()),
  StructField("City",           StringType()),
  StructField("Country",        StringType()),
  StructField("Isocode",        StringType()),
  StructField("Latitude",       DoubleType()),
  StructField("Longitude",      DoubleType())

logs_df =\
                    .option("header", "false")\
                    .option("delimiter", "|")\


In my case, there are 700 rows in the dataframe. Your count might differ.

7.3 Parse the Timestamp


from pyspark.sql.functions import udf

months = {
  'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12

def parse_timestamp(time):
    """ This function takes a Time string parameter of logs_df dataframe
    Returns a string suitable for passing to CAST('timestamp') in the format YYYY-MM-DD hh:mm:ss
    return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(

udf_parse_timestamp = udf(parse_timestamp)

parsed_df ='*',
                .alias('Timestamp')).drop('Time')   # Assigning the Timestamp name to the new column and dropping the old Time column
parsed_df.cache()                                   # Stores the dataframe in cache for the future use                                    # Displays the results


7.4 Data Cleaning in Request_Type Column

Let us try to do some data cleaning in Request_Type column of the parsed_df dataframe. Run the following line to find out how Request_type looks like right now:



We will remove GET from the beginning and HTTP/1.0 at the end. Using regexp_extract to extract these two groups identified by a java regex, from the Request_Type string column.

from pyspark.sql.functions import split, regexp_extract
path_df ='*',regexp_extract('Request_Type', r'^.*\w+\s+([^\s]+)\s+HTTP.*', 1)
path_df.cache()                                                        # Cache the dataframe                                           # Displays the results


7.5 Analysis of Most Frequent Hosts

Next, we want to know which hosts has hit the server most times

most_frequent_hosts = parsed_df.groupBy("IP").count()           # Groups the dataframe by IP column and then counting		                                # Displays the results
most_frequent_hosts.registerTempTable("most_frequent_hosts")    # Registering most_frequest_hosts variable as a temporary table


Let us view the data from the temporary table using the %sql interpreter:

SELECT * FROM most_frequent_hosts ORDER BY count DESC LIMIT 20


You can also view this data in the form of charts, click the button which shows pie chart. You will see something like this:


You can hover in the circle to find out actual count of each hosts. Zeppelin provides other charts like bar chart, area chart, scatter chart, etc. as well.

7.6 Analysis of Response Code

Next, we want to know which response code has occurred how many times in the dataframe. Further, we also store the result in the temporary table called status_count.

status_count = path_df.groupBy('Response_Code').count()      # Groups the dataframe by Response_Code column and then counting                                          # Displays the results
status_count.registerTempTable("status_count")               # Registering status_count variable as a temporary table


Let us view the data from the temporary table:

SELECT * FROM status_count ORDER BY Response_Code


Click the bar chart button:


Next, let us analyze only those records where the Response_Code is 200.

success_logs_df ='*').filter(path_df['Response_Code'] == 200)       # Creating dataframe where Response Code is 200
success_logs_df.cache()                                                               # Cache the dataframe                                                                # Displays the results


So now we have all the records where the Response_Code is 200. We will count the number of successful hits per hour.

from pyspark.sql.functions import hour
success_logs_by_hours_df ='Timestamp').alias('Hour')).groupBy('Hour').count()      # Extracting the Hour                                                                                 # Displays the results


Let us view the same data in the form of table

SELECT * FROM success_logs_by_hours_df ORDER BY Hour


7.7 Data Cleansing in Request_Path

Next, let us find out the extension of resource that was requested by the server or given to the server. We have to extract the extension from the Request_Path column. As earlier, we will make use of regular expression:

from pyspark.sql.functions import split, regexp_extract
extension_df ='Request_Path','(\\.[^.]+)$',1).alias('Extension'))


As you can see here, there is a lot of data cleansing required. Let’s start with removing the dot from the extension format first. We will use regexp_replace to do this operation. It will look for dot character(.) in the records and replace it with the blank character(‘’)

from pyspark.sql.functions import split, regexp_replace
extension_df ='Extension','\.','').alias('Extension'))  # Replace the dot character with the blank character                                                           # Displays the results


There is still some cleansing required. As you can observe that there are some blank rows, we are going to replace those blank rows with the value None.

from pyspark.sql.functions import *
extension_df = extension_df.replace('','None','Extension').alias('Extension')   # Replaces the blank value with the value 'None' in Extension
extension_df.cache()                                               # Shows the results


7.8 Analysis of Type of Extensions

Now since we have the type of extensions available with us, we can find out the number of different extensions available in our data set. We will use the same approach of grouping the column and then counting the records in each group

from pyspark.sql.functions import *
extension_df_count = extension_df.groupBy('Extension').count()                  # Groups the dataframe by Extension and then count the rows                                                       # Displays the results
extension_df_count.registerTempTable('extension_df_count')                      # Registers the temporary table


SELECT * FROM extension_df_count ORDER BY count DESC


EXERCISE FOR YOU – If you see clearly, the gif extension is in both uppercase and lowercase. Try to replace either one of them to make the data more accurate.

7.9 Analysis of Network Traffic per Location

Let us create a temporary table for the dataframe path_df so that we can visualize from which location, the hits are coming from.



Next, create a Hive table from this dataframe so that we can use it for the visualization using external tools like Tableau, Microsoft Excel, etc

Run the following query in the new paragraph of the notebook:



EXERCISE FOR YOU – Verify whether this table is created in Hive or not.

Next, run the query given below to analyze the network traffic per each country:

SELECT country,count(country) AS count from path_df GROUP BY country ORDER BY count


We can also find out the network traffic for cities within just United States, run the following and switch over to Pie Chart:

SELECT city, count(city) AS count from path_df where country='United States' GROUP BY city ORDER BY count



In this tutorial, we learned how to use Apache Nifi for data preparation and how to convert the raw server logs to the readable form. Next we learned about how Apache Spark can be used for further cleansing of data. We also used Apache Zeppelin interpreters which allows any language/data-processing-backend to be plugged into it. Currently Apache Zeppelin supports many interpreters such as Apache Spark, Python, JDBC, Markdown and Shell. We used pyspark and sql in this tutorial.

Further Reading

  1. Regular Expression Wiki
  2. Apache Nifi User Guide
  3. PySpark SQL Module
  4. Nifi Server Logs Geo-Enrichment and Routing
  5. Spark RDDs vs Dataframes vs SparkSQL
  6. Hive on Tez vs PySpark for Web Logs Parsing