cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Refine and Visualize Server Log Data

Lab 1: Server Logs Analysis using Apache Nifi and Hive

Security breaches happen. And when they do, your server logs may be your best line of defense. Hadoop takes server-log analysis to the next level by speeding and improving security forensics and providing a low cost platform to show compliance.

In this demo, we demonstrate how an enterprise security breach analysis and response might be performed.

In this tutorial, learn how to:

  • Stream server logs into Hadoop with Hortonworks Dataflow powered by Apache NiFi
  • Use Hive to build a relational view of the data
  • Use Elastic Search for high-level visualization
  • Use Oozie to automatically update a firewall
  • Visualize the data with Apache Zeppelin

This Hadoop tutorial can be performed with the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorial presented here is for Sandbox HDP 2.5.

Prerequisites:

Outline

Overview

  • Understand Server Log Data, Become familiar with Server Log Data Use Cases, Hortonworks DataFlow and Apache NiFi
  • Download and configure the script which will generate our server log data
  • Install, configure, and start Hortonworks DataFlow
  • Generate the server log data.
  • Import the server log data into Excel.
  • Visualize the server log data using Excel Power View and Apache Zeppelin.

Background

Server Log Data

Server logs are computer-generated log files that capture network and server operations data. They are useful for managing network operations, especially for security and regulatory compliance.

Potential Uses of Server Log Data

IT organizations use server log analysis to answer questions about:

  • Security – For example, if we suspect a security breach, how can we use server log data to identify and repair the vulnerability?
  • Compliance – Large organizations are bound by regulations such as HIPAA and Sarbanes-Oxley. How can IT administrators prepare for system audits?

In this tutorial, we will focus on a network security use case. Specifically, we will look at how Apache Hadoop can help the administrator of a large enterprise network diagnose and respond to a distributed denial-of-service attack.

What Is Hortonworks Dataflow and Apache NiFi?

Apache NiFi is a secure integrated platform for real time data collection, simple event processing, transport and delivery from source to storage. It is useful for moving distributed data to and from your Hadoop cluster. NiFi has lots of distributed processing capability to help reduce processing cost and get real-time insights from many different data sources across many large systems and can help aggregate that data into a single, or many different places.

NiFi lets users get the most value from their data. Specifically NiFi allows users to:

  • Stream data from multiple source
  • Collect high volumes of data in real time
  • Guarantee delivery of data
  • Scale horizontally across many machines

How NiFi Works. NiFi’s high-level architecture is focused on delivering a streamlined interface that is easy to use and easy to set up. There is a little bit of terminology that are an integral part to understanding how NiFi works.

  • Processor: Processors in NiFi are what makes the data move. Processors can help generate data, run commands, move data, convert data, and many many more. NiFi’s architecture and feature set is designed to be extended these processors. They are at the very core of NiFi’s functionality.
  • Processing Group: When data flows get very complex, it can be very useful to group different parts together which perform certain functions. NiFi abstracts this concept and calls them processing groups.
  • FlowFile: A FlowFile in NiFi represents just a single piece of data. It is made of different parts. Attributes and Contents. Attributes help give the data context which are made of key-value pairs. Typically there are 3 attributes which are present on all FlowFiles: uuid, filename, and path
  • Connections and Relationships: NiFi allows users to simply drag and drop connections between processors which controls how the data will flow. Each connection will be assigned to different types of relationships for the FlowFiles (such as successful processing, or a failure to process)

A FlowFile can originate from a processor in NiFi. Processors can also receive the flowfiles and transmit them to many other processors. These processors can then drop the data in the flowfile into various places depending on the function of the processor.

Notes:

  • In this tutorial, the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM).
  • Install the ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
  • In this tutorial, we will use the Power View feature in Excel 2013 to visualize the server log data. Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.
  • We’re going to install Hortonworks DataFlow (HDF) on the Sandbox, so you’ll need to download the latest HDF release

Step 1 – Configure and Install Hortonworks DataFlow

1.1 – Install NiFi

NiFi will be installed into the Ambari Stack of the Hortonworks Sandbox VirtualBox image because it
will be used to activate server log simulator and transport data to HDFS.

1. If you do not have NiFi installed on your sandbox, refer to Step 2: Download and Install NiFi on Hortonworks Sandbox (Option 1) from Tutorial 0: Download, Install, and Start NiFi of
Learning the Ropes of Apache NiFi for step-by-step instructions.

1.2 – Start NiFi

1. To activate the NiFi service, refer to Step 4: Start NiFi on Sandbox from Tutorial 0: Download, Install, and Start NiFi of
Learning the Ropes of Apache NiFi for step-by-step instructions.

Once you enter the NiFi HTML Interface at http://localhost:9090/nifi, you should see a canvas as below:

nifi-html-interface


Step 2 – Import the Flow

We’re going to import a pre-made data flow from a template which you can download ServerLogGeneratorSecurity.xml.

Use the NiFi interface to upload the flow, and then drag it onto your workspace.

Upload NiFi Template

Once you’ve uploaded the template into NiFi you can instantiate it by dragging the template icon template_icon onto the screen. It will ask you to select your template’s name and the flow will appear as in the image below.

Instantiate NiFi Template


Step 3 – Generate the Server Log Data

Now that you’ve imported the data flow and everything it set up, simply click the Run at the top of the screen. (Make sure you haven’t selected a specific processor, or else only one of the processors will start)

start-flow

If you receive an error message stating “LzoCodec not found” and “LzopCodec not found” similar to the image below,

error_puthdfs_troubleshoot

Login to Ambari Dashboard at http://localhost:8080 as raj_ops user. The login credentials are raj_ops/raj_ops. Once inside Ambari, click HDFS in the left hand sidebar, then click on the Configs tab. Scroll down the page, next to Settings tab, click on the Advanced tab to open the Advanced configs. Navigate to Advanced core-site file, we will need to change property value for the io.compression.codecs setting.

advanced_core_site_change_config

when you find the LzoCodec and LzopCodec, which are values within the io.compression.codecs property, the line will look as below:

org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

Remove com.hadoop.compression.lzo.LzoCodec and com.hadoop.compression.lzo.LzopCodec from the config values.

You should see the updated content in the io.compression.codecs value field as below:

org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec

Save the configuration change. Ambari will notify us that we need to restart HDFS, scroll to the top of HDFS Configs page, you should see a Orange Restart button, select Restart All Affected.

restart_all_affected

Now that we are done setting configuration changes, navigate back to the dataflow at http://localhost:9090/nifi Now that we made the change, let’s rerun the DataFlow, those errors should be resolved.

start-flow

Now that everything is running we can check in the places where we see the data being deposited in HDFS. Since we will be checking if the NiFi DataFlow performs as expected, log out of raj_ops user, then login to Ambari as maria_dev user. Login to the Ambari Dashboard at http://localhost:8080. Login credentials are maria_dev/maria_dev.

3.1 Verify NiFi Populates HDFS with Data Files

Open up the Files View, and then navigate to /tmp/server-logs/. Files should start appearing a few seconds after you start the flow. You can click on them to view the content.

Explore Output

  • Next we will create an Hive table from the log file.

3.2 Format the Data with Hive

Open the Ambari UI and head to the views dropdown list. Select Hive and then paste the following query.

CREATE TABLE FIREWALL_LOGS(
  time STRING,
  ip STRING,
  country STRING,
  status INT
)
CLUSTERED BY (time) into 25 buckets
STORED AS ORC
TBLPROPERTIES("transactional"="true")
FIELDS TERMINATED BY '|'
LOCATION '/tmp/server-logs';

We will define an ORC Table in Hive Create table using Apache ORC file format.

CREATE TABLE FIREWALL AS STORED AS ORC SELECT * FROM FIREWALL_LOGS;

Note if the query doesn’t run successfully due to a permissions error you then might need to update the permission on the directory. Run the following commands over SSH on the Sandbox

sudo -u hdfs hadoop fs -chmod -R 777 /tmp

When the table has been created you should now be able to query the data table for data using a query like

SELECT time, ip, country, cast(status as BOOLEAN) FROM firewall_logs LIMIT 100;

test-query-results

Note: In the table above, 1 is for true, 0 is for false.


Step 4 – Visualize Server Log Data with Apache Zeppelin

Why is Data Visualization important?
Helps you analyze network data and determine effective responses to network issues.

Go to browser and type sandbox.hortonworks.com:9995 to open Zeppelin UI

You should be greeted by the following screen where you can choose to view notes, or create a new one.

zeppelin_create_note

4.1 Import notebook NiFiServerLogs (Option 1)

You can choose to Import note from this tutorial. Click on Import note, an Import new note window will appear, select Add from URL. The window will transition to a new window that has two fields: Import AS and URL. Copy & Paste the following URL into the URL field:

https://raw.githubusercontent.com/hortonworks/tutorials/hdp-2.5/data/zeppelin-notes/NiFiServerLogs.json

Then click the Import note button.

4.2 Create notebook NiFiServerLogs (Option 2)

If you prefer to go through the steps to create the notebook, follow the steps below. Click on the Create new note in the Zeppelin Welcome page. Enter NiFiServerLogs as the notebook name and create the notebook. Click on the new notebook you created in the list of notebooks to open the NiFiServerLogs notebook. Our notebook will have 3 sections in which we will analyze data for denial-of-service attack:

  • review network traffic by country
  • zoom in one particular country
  • generate list of attacking IP addresses

Find Network Traffic Per Country

Let’s filter the data in our hive table, so it only shows country and status. Copy & Paste the query into Zeppelin:

%jdbc(hive)
SELECT country, cast(status as BOOLEAN) FROM firewall_logs

We were able to generate a table in zeppelin that shows us country associated with whether the network connection status was a success or not. Let’s visualize this table as a Bar Graph, select the option that looks like a Bar Graph. Now let’s setup the Advanced settings: Keys, Groups, Values. You should have two fields: country and success. Drag the country field into the Keys box, then do the same operation for the status field. To sum up the status values for true and false, drag another status field into the Values. Now the countries will have true or false association and when you hover over a piece of the bar graph, you should see the sum value for status: false or true.

network_traffic_per_country

Legend:
Blue represents successful, authorized network connections
Light Blue represents connections from unauthorized sources

Zoom into One Particular Country

Let’s assume the recent denial-of-service attacks originated in INDIA. Let’s further filter the data by selecting the country with value IND. Copy & Paste the query into the second text space available:

%jdbc(hive)
SELECT country, cast(status as BOOLEAN) FROM firewall_logs WHERE country = 'IND'

Let’s make an assumption that the recent denial-service attacks originate in INDIA, we will use Zeppelin’s data visualization feature to zoom in on the status errors or unauthorized network connections for INDIA. Under the query, there are 6 visualization choices, we will choose 2 leftmost choice, a bar graph.

Remember we want to know the total for unauthorized network connections, so drag the success square into the Values box. Since there are multiple values for success, “true” or “false”, let’s drag another success box into the groups box. The key that is associated with both types of values for success is “IND”, so drag the country square into the Keys box. Refer to the image as below:

network_traffic_by_one_country

Legend:
Blue represents successful, authorized network connections
Light Blue represents connections from unauthorized sources

Generate List of Attacking IP Addresses

Let’s generate a record of unauthorized IP addresses. We will need to query the table with the parameters time, ip, country, success. Copy & Paste the following query into Zeppelin:

Run the following query:

%jdbc(hive)
SELECT time, ip, country, cast(status as BOOLEAN) FROM firewall_logs WHERE status = '0'

Generates List of IP Address Attacks Per Country. The ‘0’ signifies connections from unauthorized sources. Therefore, we can update the network firewall to deny requests from those attacking IP addresses

Let’s visualize the IP Address Attacks in a Pie Chart.

Drag country into the Keys box and move ip field into the values box. Make sure that you selected the count option for the ip.

generate_list_unauthorized_ip_addresses

Our visualization counts the number of unauthorized IP Addresses per country.
As you can see each piece of the pie represents a country and when you hover over one slice, you can see the total count for unauthorized IP addresses.

Summary

We’ve shown how the Hortonworks Data Platform can help system administrators capture, store, and analyze server log data. With real-time access to massive amounts of data on the Hortonworks Data Platform, we were able to block unauthorized access, restore VPN access to authorized users.

With log data flowing continuously into the Hortonworks Data Platform “data lake,” we can protect the company network from similar attacks in the future. The data can be refreshed frequently and accessed to respond to security threats, or to prepare for compliance audits.

Further Reading

Lab 2: Server Logs Analysis using Apache Nifi, Spark and Zeppelin

Introduction

In the previous lab, you learned how to use Nifi, Hive and Zeppelin to analyze server logs. In this lab, you will explore Apache Spark to analyze the server logs. Using Spark, we can enhance the power of the log data which come from web servers, files and even user generated data.

In this tutorial, learn how to:

  • Stream server logs and and preparation of data into Hadoop with Hortonworks Dataflow powered by Apache NiFi
  • Use PySpark to clean the data
  • Visualize the data with Apache Zeppelin
  • Tableau to visualize the geolocation in the world map

Prerequisites:

  • Hortonworks Sandbox (installed and running)
  • Hortonworks DataFlow – Download here
  • If you’d like to use Tableau to explore the data, download a free trial version
  • Server log tutorial files (included in this tutorial)
  • Lab 1 of this tutorial

Outline

Dataset

Download the dataset here.

The dataset which we are going to use in this lab is of NASA-HTTP. It has HTTP requests to the NASA Kennedy Space Center WWW server in Florida.
The logs are an ASCII file with one line per request, with the following columns:

1. host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
2. timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
3. request given in quotes.
4. HTTP reply code.
5. bytes in the reply.

Step 1 – Configure and Install Hortonworks DataFlow

1.1 – Install NiFi

NiFi needs to be installed into the Ambari Stack of the Hortonworks Sandbox VirtualBox image because it will be used to activate server log simulator and transport data to HDFS.
If you do not have NiFi installed on your sandbox, refer to Step 2: Download and Install NiFi on Hortonworks Sandbox (Option 1) from Tutorial 0: Download, Install, and Start NiFi of Analyze Traffic Patterns using Apache Nifi for step-by-step instructions.

1.2 – Start NiFi

To activate the NiFi service, refer to Step 4: Start NiFi on Sandbox from Tutorial 0: Download, Install, and Start NiFi of Analyze Traffic Patterns using Apache Nifi for step-by-step instructions.
Once you enter the NiFi HTML Interface at http://sandbox.hortonworks.com:9090/nifi, you should see a canvas as below:

nifi_html_interface

Step 2 – Download Input Data

First you’re going to need to login to your Sandbox via SSH. If you’re using Virtualbox you can log in with the command:

ssh root@127.0.0.1 -p 2222

The first time password to log in is: hadoop

sshTerminal

Now we have to create a directory from where Nifi would pick up the data. There will be two different directories, one is for sample input dataset and other is for GeoLite dataset that Nifi would use to provide Geolocation attributes.

mkdir -p /tmp/nifi/input
mkdir /tmp/nifi/GeoFile
chmod 777 -R /tmp/nifi

mkdir

Download files using wget utility:

cd /tmp/nifi/input
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp-2.5/tutorials/hortonworks/analyze-server-log-data/logsample.txt

download_logsample

cd /tmp/nifi/GeoFile
wget https://github.com/hortonworks/tutorials/blob/hdp-2.5/tutorials/hortonworks/analyze-server-log-data/GeoLite2-City.mmdb?raw=true
mv GeoLite2-City.mmdb?raw=true GeoLite2-City.mmdb

download_geofile

Step 3 – Import the Workflow

We’re going to import a data flow from a template which you can download WebServerLogs.xml.
Use the NiFi interface to upload the flow, and then drag it onto your workspace.

upload_template

Once you’ve uploaded the template into NiFi you can instantiate it by dragging the template icon template_icon onto the screen. It will ask you to select your template’s name and the flow will appear as in the image below.

nifi_workflow

Let’s us give some time to explain the role of each processor here:

1. GetFile Processor – It creates Flow Files from files in a directory. We specify the input path of the file here.

2. SplitText Processor – It splits a text file into smaller text files limited by maximum number of lines or total size of fragment. In our case, the value of line split count is 1, that means each split file will have 1 record.

3. ExtractText Processor – It evaluates the regular expression to extract the content of a Flow File and the result will be placed in the attribute which can be utilized in other processors of the workflow. If there is no match it will be sent to the ‘unmatched’ result which is a simple way of filtering out different logs. The regular expressions that are used to extract different attributes are as follows:

  • IP – (\b(?:\d{1,3}.){3}\d{1,3}\b) – Captures everything which has 4 numbers separated by dots.
    Example – 130.104.51.7
  • Time – [(.*?)] – Captures everything between double square brackets.
    Example – 01/Aug/1995:02:56:28 -0400
  • Request_Type – “(.*?)” – Captures everything between double quotes.
    Example – GET /images/NASA-logosmall.gif HTTP/1.0
  • Response_Code – HTTP\/\d.\d” (\d{3}) – Captures the next 3 digit number after it finds HTTP 1.0 in the text of Flow File.
    Example – 200

4. RouteOnAttribute Processor – It routes flow based on their attributes. Routing flow files if IP attribute starts with 1 or 2, Time and Request_Type are not empty.

5. GeoEnrichIP Processor – This processor takes the IP attribute generated in the previous processor and compares it to a geo-database(‘mmdb’). It adds the Geo information like Latitude, Longitude, City, Country and Isocode to the Flow Files attributes. It generates these attributes and give them names like IP.geo.latitude, IP.ge.longitude, IP.geo.city, IP.geo.country and IP.geo.isocode.

6. RouteOnAttribute Processor – There might be a possibility that GeoEnrichIP Processor is not able to find all geo details for an IP so this processor checks whether the IP.geo.city exists or not. It routes only those Flow Files further which has values for city attribute.

7. ReplaceText Processor – It updates the content of Flow File by evaluating a regular expression against it and replacing the section of the content that matches the Regular Expression with some alternate value. In this case, we are replacing the content with his format -${IP}|${Time}|${Request_Type}|${Response_Code}|${IP.geo.city}|${IP.geo.country}

|${IP.geo.country.isocode}|${IP.geo.latitude}|${IP.geo.longitude}
$ character extracts the value of the attribute and | is being used as a delimiter between these attributes.

8. MergeContent Processor – It merges a group of Flow Files into a single Flow Files. The minimum number of entries to make a single file is 20 in our case. That means, the processor is merging 20 Flow Files into 1.

9. UpdateAttribute Processor – It update the Flow Files’ attributes by using the Attribute Expression Language of NiFi. We are giving each FlowFile a unique name using the timestamp. The expression used here is

logsample-${now():format(“HHmmssSSS”)}-.txt

10. PutHDFS Processor – It writes Flow File to the HDFS directory. The output HDFS directory used here is /tmp/nifioutput.

Step 4 : Run the Workflow

Make sure you have removed Lzo and Lzop Codec from the list of compression codecs. Refer this step to do that.
Now click SHIFT and select the entire workflow, then locate Start button in the Operate box and click it.

start_all

You can see your workflow running.

Keep the workflow running for next 10 minutes so that we have enough data to ingest into Spark.

Step 5 : Verify NiFi Output Data in HDFS

Next, logout from raj_ops user from Ambari and re-login to Ambari using user credentials maria_dev/maria_dev. Click on 9 square menu and select Files View:

select_files_view

After just 30 seconds, you will start seeing the output files coming to /tmp/nifioutput directory, so navigate to that directory. Click on one of the row and then Open.

open_file

You would see some content like the one given below. You can easily locate its attributes – IP address, Time, Request Type, Response Code, City, Country, Isocode, Latitude and Longitude.

file_preview

Wait for 10 minutes to store more files like this in the folder. In the meantime, you can re-open Ambari in a new tab and start Spark.

Step 6 – Turn OFF Maintenance Mode and Open Zeppelin UI

We will be using Spark version 1.6.2 in this tutorial. Go to Ambari dashboard, and follow the steps as mentioned below:

turn_off_maintenance_mode

Next, click on Zeppelin Notebook and then Quick Links -> Zeppelin UI to open the Zeppelin UI on the browser:

quick_links_zeppelin

You will see a Zeppelin UI with a bunch of already loaded notebooks. Click on Create New Note and give the name Web Server Log Analysis using Spark:

create_new_note

Before moving ahead, do not forget to stop the workflow in Nifi. Go back to Nifi UI, select all components and click Stop button in the Operate box:

stop_all

Step 7 – Logs Analysis using Spark and Zeppelin

7.1 – Loading External Library

As you explore Zeppelin you will probably want to use one or more external libraries.
We are going to use the %dep interpreter to import the library. Copy paste the following set of lines in your Zeppelin notebook:

%dep
z.reset()
z.load("com.databricks:spark-csv_2.11:1.4.0")

Then click on Play button next to Ready. Alternatively, you can press Shift+Enter.

dep

7.2 Load the DataFrame from HDFS directory

Next, let us create a dataframe in Spark using PySpark. Using sqlContext.read.format() here to load the dataframe from the HDFS directory /tmp/nifioutput. show() function shows the content of the dataframe,

%pyspark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([       
# Represents a field in a StructType                    
  StructField("IP",             StringType()),  
  StructField("Time",           StringType()),
  StructField("Request_Type",   StringType()),
  StructField("Response_Code",  StringType()),
  StructField("City",           StringType()),
  StructField("Country",        StringType()),
  StructField("Isocode",        StringType()),
  StructField("Latitude",       DoubleType()),
  StructField("Longitude",      DoubleType())
])

logs_df = sqlContext.read\
                    .format("com.databricks.spark.csv")\
                    .schema(schema)\
                    .option("header", "false")\
                    .option("delimiter", "|")\
                    .load("/tmp/nifioutput")
logs_df.show(truncate=False)

load_dataframe

In my case, there are 700 rows in the dataframe. Your count might differ.

7.3 Parse the Timestamp

%pyspark

from pyspark.sql.functions import udf

months = {
  'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12
}

def parse_timestamp(time):
    """ This function takes a Time string parameter of logs_df dataframe
    Returns a string suitable for passing to CAST('timestamp') in the format YYYY-MM-DD hh:mm:ss
    """
    return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
      int(time[7:11]),
      months[time[3:6]],
      int(time[0:2]),
      int(time[12:14]),
      int(time[15:17]),
      int(time[18:20])
    )

udf_parse_timestamp = udf(parse_timestamp)

parsed_df = logs_df.select('*',             
                udf_parse_timestamp(logs_df['Time'])  
                .cast('timestamp')              
                .alias('Timestamp')).drop('Time')   # Assigning the Timestamp name to the new column and dropping the old Time column                       
parsed_df.cache()                                   # Stores the dataframe in cache for the future use
parsed_df.show()                                    # Displays the results

parse_timestamp

7.4 Data Cleaning in Request_Type Column

Let us try to do some data cleaning in Request_Type column of the parsed_df dataframe. Run the following line to find out how Request_type looks like right now:

%pyspark
parsed_df.select('Request_type').show(truncate=False)

show_request_type

We will remove GET from the beginning and HTTP/1.0 at the end. Using regexp_extract to extract these two groups identified by a java regex, from the Request_Type string column.

%pyspark
from pyspark.sql.functions import split, regexp_extract
path_df = parsed_df.select('*',regexp_extract('Request_Type', r'^.*\w+\s+([^\s]+)\s+HTTP.*', 1)
                   .alias('Request_Path')).drop('Request_Type')                                                                       
path_df.cache()                                                        # Cache the dataframe
path_df.show(truncate=False)                                           # Displays the results

show_pathdf

7.5 Analysis of Most Frequent Hosts

Next, we want to know which hosts has hit the server most times

%pyspark
most_frequent_hosts = parsed_df.groupBy("IP").count()           # Groups the dataframe by IP column and then counting
most_frequent_hosts.show()		                                # Displays the results
most_frequent_hosts.registerTempTable("most_frequent_hosts")    # Registering most_frequest_hosts variable as a temporary table

show_most_frequent_hosts

Let us view the data from the temporary table using the %sql interpreter:

%sql
SELECT * FROM most_frequent_hosts ORDER BY count DESC LIMIT 20

select_most_frequent_hosts

You can also view this data in the form of charts, click the button which shows pie chart. You will see something like this:

piechart_most_frequent_hosts

You can hover in the circle to find out actual count of each hosts. Zeppelin provides other charts like bar chart, area chart, scatter chart, etc. as well.

7.6 Analysis of Response Code

Next, we want to know which response code has occurred how many times in the dataframe. Further, we also store the result in the temporary table called status_count.

%pyspark
status_count = path_df.groupBy('Response_Code').count()      # Groups the dataframe by Response_Code column and then counting
status_count.show()                                          # Displays the results
status_count.registerTempTable("status_count")               # Registering status_count variable as a temporary table

show_statuscount

Let us view the data from the temporary table:

%sql
SELECT * FROM status_count ORDER BY Response_Code

select_status_count

Click the bar chart button:

barchart_status_count

Next, let us analyze only those records where the Response_Code is 200.

%pyspark
success_logs_df = parsed_df.select('*').filter(path_df['Response_Code'] == 200)       # Creating dataframe where Response Code is 200  
success_logs_df.cache()                                                               # Cache the dataframe
success_logs_df.show()                                                                # Displays the results

show_success_logs_df

So now we have all the records where the Response_Code is 200. We will count the number of successful hits per hour.

%pyspark
from pyspark.sql.functions import hour
success_logs_by_hours_df = success_logs_df.select(hour('Timestamp').alias('Hour')).groupBy('Hour').count()      # Extracting the Hour
success_logs_by_hours_df.show()                                                                                 # Displays the results
success_logs_by_hours_df.registerTempTable("success_logs_by_hours_df")                                           

show_success_logs_by_hours_df

Let us view the same data in the form of table

%sql
SELECT * FROM success_logs_by_hours_df ORDER BY Hour

select_success_logs_by_hour_df

7.7 Data Cleansing in Request_Path

Next, let us find out the extension of resource that was requested by the server or given to the server. We have to extract the extension from the Request_Path column. As earlier, we will make use of regular expression:

%pyspark
from pyspark.sql.functions import split, regexp_extract
extension_df = path_df.select(regexp_extract('Request_Path','(\\.[^.]+)$',1).alias('Extension'))
extension_df.show(truncate=False)

show_extension_df

As you can see here, there is a lot of data cleansing required. Let’s start with removing the dot from the extension format first. We will use regexp_replace to do this operation. It will look for dot character(.) in the records and replace it with the blank character(”)

%pyspark
from pyspark.sql.functions import split, regexp_replace
extension_df = extension_df.select(regexp_replace('Extension','\.','').alias('Extension'))  # Replace the dot character with the blank character
extension_df.show(truncate=False)                                                           # Displays the results

show_extension_df1

There is still some cleansing required. As you can observe that there are some blank rows, we are going to replace those blank rows with the value None.

%pyspark
from pyspark.sql.functions import *
extension_df = extension_df.replace('','None','Extension').alias('Extension')   # Replaces the blank value with the value 'None' in Extension
extension_df.cache()
extension_df.show(truncate=False)                                               # Shows the results

show_extension_df2

7.8 Analysis of Type of Extensions

Now since we have the type of extensions available with us, we can find out the number of different extensions available in our data set. We will use the same approach of grouping the column and then counting the records in each group

%pyspark
from pyspark.sql.functions import *
extension_df_count = extension_df.groupBy('Extension').count()                  # Groups the dataframe by Extension and then count the rows
extension_df_count.show()                                                       # Displays the results
extension_df_count.registerTempTable('extension_df_count')                      # Registers the temporary table

show_extension_df_count

%sql
SELECT * FROM extension_df_count ORDER BY count DESC

select_extension_df_count

EXERCISE FOR YOU – If you see clearly, the gif extension is in both uppercase and lowercase. Try to replace either one of them to make the data more accurate.

7.9 Analysis of Network Traffic per Location

Let us create a temporary table for the dataframe path_df so that we can visualize from which location, the hits are coming from.

%pyspark
path_df.registerTempTable("path_df")

register_path_df

Next, create a Hive table from this dataframe so that we can use it for the visualization using external tools like Tableau, Microsoft Excel, etc

Run the following query in the new paragraph of the notebook:

%sql
CREATE TABLE path_df AS SELECT * FROM path_df

create_table_path_df

EXERCISE FOR YOU – Verify whether this table is created in Hive or not.

Next, run the query given below to analyze the network traffic per each country:

%sql
SELECT country,count(country) AS count from path_df GROUP BY country ORDER BY count

select_country

We can also find out the network traffic for cities within just United States, run the following and switch over to Pie Chart:

%sql
SELECT city, count(city) AS count from path_df where country='United States' GROUP BY city ORDER BY count

select_city

Summary

In this tutorial, we learned how to use Apache Nifi for data preparation and how to convert the raw server logs to the readable form. Next we learned about how Apache Spark can be used for further cleansing of data. We also used Apache Zeppelin interpreters which allows any language/data-processing-backend to be plugged into it. Currently Apache Zeppelin supports many interpreters such as Apache Spark, Python, JDBC, Markdown and Shell. We used pyspark and sql in this tutorial.

Further Reading

1. Regular Expression Wiki
2. Apache Nifi User Guide
3. PySpark SQL Module

HCC Articles

1. Nifi Server Logs Geo-Enrichment and Routing
2. Spark RDDs vs Dataframes vs SparkSQL
3. Hive on Tez vs PySpark for Web Logs Parsing