How to Analyze Server Logs with Cascading and Apache Hadoop on Hortonworks Data Platform

{Taps, Pipes, Sources, Sinks, Flows == Big Data Driven Apps

This is the second in the series of blogs exploring how to write data-driven applications in Java using the Cascading SDK. The series are:

  1. WordCount
  2. Log Parsing

Historically, programming languages and software frameworks have evolved in a singular direction, with a singular purpose: to achieve simplicity, hide complexity, improve developer productivity, and make coding easier. And in the process, foster innovation to the degree we have seen today—and benefited from.

Anyone among you is “young” enough to admit writing code in microcode and assembly language?

Yours truly wrote his first lines of “Hello World” in assembly language on the VAX and PDP 11, the same computer hardware (and software) that facilitated the genesis of “C” and “UNIX” at the Bell Labs. Indisputably, we have come a long way from microcode to assembly, from to C to Java, which has facilitated writing high-level abstraction frameworks and enabled innovative technologies, such as J2EE web services frameworks and Apache Hadoop and MapReduce computing frameworks, to mention a few.

Add to that long list Cascading Java SDK for writing data-driven applications for the Apache Hadoop running on the Hortonworks Data Platform (HDP). And even more, the Cascading 3.0 will support Apache Tez, the high-performance parallel execution MapReduce engine.

Data Flow and Data Pipeline

In the previous blog, I explored the genesis of Java Cascading Framework. I argued that at the core of any data-driven application, there exists a pattern of data transformation that mimics aspects of Extract, Transform, and Load operations (ETL).

Screen Shot 2014-04-17 at 10.03.26 PM

I showed how Cascading framework embraces those common ETL operations by providing high-level logical building blocks as Java composite classes, allowing a developer to write data-driven apps without resorting to or knowing about the MapReduce Java API or having the know-how of underlying Apache Hadoop infrastructure complexity.

Parsing Logs with MapReduce

In this blog we examine a common usage of reading, parsing, transforming, sorting, storing, and extracting data value from a large server blog. The value extracted is the list of top-ten in-bound IP addresses. For this example, we’ve curtailed one server log to 160 MB. In reality, these could be weeks’ of servers logs, with gigabytes of data.

Screen Shot 2014-05-14 at 6.05.38 PM

Keeping the above flow in mind, we can write a very simple Java MapReduce program, without writing to the Java MapReduce API or without knowledge of the underlying Apache Hadoop complexity. For example, below is a complete source listing of the above transformation—in less than 40 lines of code: that’s simple!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.filter.Sample;
import cascading.operation.filter.Limit;
import cascading.operation.regex.RegexParser;
import cascading.operation.text.DateParser;
import cascading.pipe.*;
import cascading.property.AppProps;
import cascading.scheme.hadoop.TextDelimited;
import cascading.scheme.hadoop.TextLine;
import cascading.tap.SinkMode;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;
import java.util.Properties;
 
public class Main {
    public static void main(String[] args) {
 
    	// input (taps) and output (sinks)
        String inputPath 		= args[0];
        String outputPath 	= args[1];
        // sources and sinks
        Tap inTap 	= new Hfs(new TextLine(), inputPath);
        Tap outTap  = new Hfs(new TextDelimited(true, "\t"), outputPath, SinkMode.REPLACE);
        // Parse the line of input and break them into five fields
        RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"), 
        		"^([^ ]*) \\S+ \\S+ \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5});
        // Create a pipe for processing each line at a time
        Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS);
        // Group the stream within the pipe by the field "ip"
        processPipe = new GroupBy(processPipe, new Fields("ip"));
        // Aggregate each "ip" group using the Cascading built in Count function
        processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL);
        // After aggregation counter for each "ip," sort the counts
        Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true);
        // Limit them to the first 10, in the descending order
        sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));
        // Join the pipe together in the flow, creating inputs and outputs (taps)
        FlowDef flowDef = FlowDef.flowDef()
    		   .addSource(processPipe, inTap)
    		   .addTailSink(sortedCountByIpPipe, outTap)
    		   .setName("DataProcessing");
        Properties properties = AppProps.appProps()
        		.setName("DataProcessing")
        		.buildProperties();
        Flow parsedLogFlow = new HadoopFlowConnector(properties).connect(flowDef);
        //Finally, execute the flow.
        parsedLogFlow.complete();
    }
}

Code Walk

First, we create taps (sources and sinks) using the HFS() constructor, followed by how we want each line to be split, using RegexParser(), into five fields. Second, we create a series of Pipes() to process each line, aggregate or GroupBy() IPs, sort the IP’s count, and limit the count to top ten IP addresses. Third, we connect the pipes into a FlowDef, Flow() and HadoopFlowConnector(), with pipes, input taps, and output taps. And finally, we execute the flow with Flow.complete().

So in four easy programming steps we created a MapReduce program without a single reference to the MapReduce API.  Instead, we used only high-level logical constructs and classes in Cascading SDK.

The compiled jar file and the run on the HDP 2.1 Sandbox resulted in the following output. Note the number of MapReduce jobs and the final submission to YARN.

Screen Shot 2014-05-13 at 6.01.55 PM

What Next?

To dabble your feet and whet your appetite, you can try out some Cascading Impatient Series.

For your hands-on experience on HDP 2.1 Sandbox and the above example, which is derived from Concurrent Inc.’s Developer Training (courtesy of Alexis Roos, @alexisroos), you can follow this tutorial.

Categorized by :
Apache Hadoop Architect & CIO Developer HDP 2.1 Java MapReduce Sandbox Server Logs Tez YARN

Comments

|
June 12, 2014 at 1:59 am
|

Great blog piece loaded with full of information. One must share this post as it helps to solve industry mates issues. Appreciate your efforts for drafting this post from A to Z.

    Jules S. Damji
    |
    June 12, 2014 at 11:08 am
    |

    We are happy you found the blog post informative and instructive. Let us what more you like to see to expedite your adoption of Hadoop.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Integrating to YARN natively (part 1 of 3)
Thursday, July 24, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.