Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
November 20, 2012
prev slideNext slide

Big Data Security Part Two: Introduction to PacketPig


Packetpig is the tool behind Packetloop. In Part One of the Introduction to Packetpig I discussed the background and motivation behind the Packetpig project and problems Big Data Security Analytics can solve. In this post I want to focus on the code and teach you how to use our building blocks to start writing your own jobs.

The ‘building blocks’ are the Packetpig custom loaders that allow you to access specific information in packet captures. There are a number of them but two I will focus in this post are;

  • Packetloader() allows you to access protocol information (Layer-3 and Layer-4) from packet captures.
  • SnortLoader() inspects traffic using Snort Intrusion Detection software.

Calculating Bandwidth and Binning Time

The Packetloader() provides access to IP, TCP and UDP headers for each packet in the capture. A great example of it’s use is the ‘binning.pig‘ script. This script allows you to calculate the bandwidth used by TCP and UDP packets as well as total bandwidth at any period you define. You might want to calculate these totals every minute, hour, day, week or month to produce a graph.

Firstly run the binning script using the following command.

./ -x local -r data/web.pcap -f pig/examples/binning.pig

Then open up output/binning/part-r-00000 in a text editor to see the output.

Now let’s walk through the script. Firstly let’s include all the jar’s required for Packetpig and binning.pig to run;

%DEFAULT includepath pig/include.pig
RUN $includepath;
Then the amount of time you want to bin your values into. In this case I want to output the values every minute (60 seconds) but I could easily change this to an hour (3600 seconds) by commenting and uncommenting the following lines;
%DEFAULT time 60
--%DEFAULT time 3600
Then I load the data out of the packet captures into quite a large schema using the Packetloader();
packets = load '$pcap' using com.packetloop.packetpig.loaders.pcap.packet.PacketLoader() AS (

This is a very rich data model and through leveraging the timestamp (ts), size of the IP packet (ip_total_length), and size of the TCP (tcp_len) and UDP (udp_len) we can calculate total and respective bandwidths at any interval.  The beauty of pig is that I could easily hone in on specific hosts by grouping on the Source IP, Destination IP and Destination Port – but let’s keep things simple in this post.

The ip_proto field allows be to filter all packets based on protocol. TCP is IP protocol 6 and UDP is IP protocol 17.

tcp = FILTER packets BY ip_proto == 6;
udp = FILTER packets BY ip_proto == 17;

Once filtered we can bin each packet into a time period and then project a summary of the data with the size of all TCP packets in that time period (bin) summed.

tcp_grouped = GROUP tcp BY (ts / $time * $time);
tcp_summary = FOREACH tcp_grouped GENERATE group, SUM(tcp.tcp_len) AS tcp_len;

And then the same for UDP.

udp_grouped = GROUP udp BY (ts / $time * $time);
udp_summary = FOREACH udp_grouped GENERATE group, SUM(udp.udp_len) AS udp_len;

To get calculate total bandwidth of all IP packets we bin all packets using the same time period and then sum ip_total_length.

bw_grouped = GROUP packets BY (ts / $time * $time);
bw_summary = FOREACH bw_grouped GENERATE group, SUM(packets.ip_total_length) AS bw;
The output we were looking for is basically comma separated values for timestamp, tcp bandwidth, udp bandwidth and total bandwidth. This is produced by a final join and projection.
joined = JOIN tcp_summary BY group, udp_summary BY group, bw_summary BY group;
summary = FOREACH joined GENERATE tcp_summary::group, tcp_len, udp_len, bw;

It may seem a little cryptic but basically the JOIN statement is joining using the group that all the summaries share which is the time period. If you ILLUSTRATE the joined variable you will see the data is there but not in the format we are looking for.

| joined | tcp_summary::group:int | tcp_summary::tcp_len:long | udp_summary::group:int | udp_summary::udp_len:long | bw_summary::group:int | bw_summary::bw:long |
| | 1322644980 | 2080 | 1322644980 | 81 | 1322644980 | 2305 |

However the summary projection generates the output the way we want it and we store that in a CSV format using PigStorage(‘,’).

STORE summary INTO '$output/binning' USING PigStorage(',');

Threat Detection

The SnortLoader() can be used to replay all conversations through Snort IDS and output attacks that it finds. The SnortLoader() can also take a snort.conf as a parameter so you can scan packet captures with specific Snort versions.

Run the basic snort.pig script to get an idea of the output.

./ -x local -r data/web.pcap -f pig/examples/snort.pig

Now let’s run through the snort.pig script. Again we include all the jar’s we need for Packetpig.

%DEFAULT includepath pig/include.pig
RUN $includepath;

The script is constructed so that you can pass parameters to either scan all traffic for attacks or zero in on specific source and destination IP addresses. By leaving most of these null we inspect all traffic. Also note we are again binning time every 60 seconds. Lastly the Packetpig includes a number of versions of Snort. The default snort.conf we include ensures you use the latest one.

%DEFAULT time 60
%DEFAULT src null
%DEFAULT dst null
%DEFAULT sport null
%DEFAULT dport null
%DEFAULT snortconfig 'lib/snort/etc/snort.conf'

The SnortLoader() receives the snortconfig paramter and starts inspection the packet capture for attacks and provides them back to you in defined schema.

snort_alerts =
  LOAD '$pcap'
  USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig')
  AS (

Using this schema you can access the timestamp (ts), Snort Signature ID (sig), Severity/Priority (priority), Description of the attack (message) and the Source (src), Source Port (sport), Destination (dst) and Destination port (dport) of the attack.

If you ran the script and opened up output/snort/part-m-00000 you will see a number of attacks matching the schema output of the SnortLoader(). One thing to note is Snort using Priority 1 for the highest severity, Priority 2 for the next highest etc.

1322645240 120_3 3 (http_inspect) NO CONTENT-LENGTH OR TRANSFER-ENCODING IN HTTP RESPONSE TCP 80 34299
1322645387 139_1 2 (spp_sdf) SDF Combination Alert DIVERT 0 0
1322645603 120_3 3 (http_inspect) NO CONTENT-LENGTH OR TRANSFER-ENCODING IN HTTP RESPONSE TCP 80 41791
1322645907 120_3 3 (http_inspect) NO CONTENT-LENGTH OR TRANSFER-ENCODING IN HTTP RESPONSE TCP 80 54222
1322645689 120_3 3 (http_inspect) NO CONTENT-LENGTH OR TRANSFER-ENCODING IN HTTP RESPONSE TCP 80 42514
1322645739 138_5 2 SENSITIVE-DATA Email Addresses TCP 80 42514

The snort.pig script is our most basic example but hopefully you are already thinking about what you could filter on (e.g. Severity) as well as re projecting the data you access out of SnortLoader() to find the top ten attackers and top ten victims.

In my next post I will show you how to find Zero Day attacks in past network packet captures.



Antonio Barbuzzi says:

I’m wondering how does InputSplits are generated, since pcap format is not splittable. Is there only one single mapper?

Michael Baker says:

Hi Antonio. Currently we don’t split the packet captures but try and keep them at relatively consistent sizes by rolling them every 100Mb or so. Each pcap then gets allocated to a mapper.

Antonio Barbuzzi says:

Very smart solution, I prefer it over

pn says:
Your comment is awaiting moderation.

“Currently we don’t split the packet captures but try and keep them at relatively consistent sizes by rolling them every 100Mb or so…” But I already have my pcap files which I want to analyze, and some of them are over 1G in size. Will packetpig’s approach be still effective for me? How do I handle my large pcaps?

Also, packetpig does support creation of standard 5-tuple flows from pcaps, right?

James Solderitsch says:

Tried this on a fresh install of packetpig into OS X 10.8.3. The script runs but instead of an output folder with csv results, I see bunch of ascii text written to the terminal. What did I do wrong?

James Solderitsch says:

In the copy of binning.pig that I obtained via a git clone of the current packetpig source, the STORE command at the end of the script was commented out. This was why I did not get the summary format as expected. Don’t know why the source for this script changed since this original blog was written.

Pete says:

Have you considered integrating this with Qosient’s Argus? I’ve used Argus to turn packet captures into conversations in much the way you describe. The benefits I see is Argus can be a distributed collector with filtering, and the native file format is more compact than pcap, while retaining the ability to do deep packet inspection, plus splitting, merging, etc of the files is all supported. I was looking for a Big Data integration with Argus when I found this.

BTW is there a part 3?


pn says:
Your comment is awaiting moderation.

Packetpig can create traditional 5-tuple flows out of pcaps?
Also, I already have many large pcaps (~GBs). Since you say that packetpig does not currently split pcaps, can it efficiently handle such large pcaps?

youness says:

please any tutorial on how to installit on centos distribution ? thank you

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums