Category Archives: Hadoop Security


The Fastest Path to Innovation: Community Driven Open Source

 

blogpicLast week, we outlined our approach for delivering an enterprise viable Apache Hadoop distribution in the open.  Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.

In support of our approach, this week we’ve announced the submission of two new incubation projects to the Apache Software foundation together with the launch of the “Stinger Initiative”, all aimed at enhancing the security and performance of Hadoop applications.  These efforts focus on enterprise requirements that are essential to enable broad adoption across the Hadoop ecosystem.

  • The Stinger initiative aims to dramatically speed up Apache Hive in support of interactive query use cases.
  • The Knox Gateway addresses the need for a single point of authentication and secure access for Apache Hadoop services in a cluster.
  • The Tez framework provides an alternative next-generation runtime built on Hadoop YARN that significantly improves latency and throughput of Hadoop applications.

We feel these efforts are strong examples of our commitment to driving innovation from within the open source community, and as stated in our approach blog, we do this by::

  • identifying and articulating the enterprise requirements within the community,
  • taking an active role in addressing those requirements within the community, and
  • applying enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

Since it takes a community to build enterprise-class platforms like Hadoop, if you have interest in helping with Knox, Tez, or Stinger, we encourage you to work with us and the others in the Apache community!

Securing Hadoop with Knox Gateway

 

Back in the day, in order to secure a Hadoop cluster all you needed was a firewall that restricted network access to only authorized users. This eventually evolved into a more robust security layer in Hadoop… a layer that could augment firewall access with strong authentication. Enter Kerberos.  Around 2008, Owen O’Malley and a team of committers led this first foray into security and today, Kerberos is still the primary way to secure a Hadoop cluster.

Fast-forward to today… Widespread adoption of Hadoop is upon us.  The enterprise has placed requirements on the platform to not only provide perimeter security, but to also integrate with all types of authentication mechanisms. Oh yeah, and all the while, be easy to manage and to integrate with the rest of the secured corporate infrastructure. Kerberos can still be a great provider of the core security technology but with all the touch-points that a user will have with Hadoop, something more is needed.

The time has come for Knox.

The only path to security in Hadoop is the community

Screen Shot 2013-02-19 at 6.16.28 AM

The Knox Gateway aims to provide perimeter security that will integrate easily into existing security infrastructure.  Delivering this key component of the Apache Hadoop ecosystem is a critical community project.  Security is not an afterthought.  It needs to be woven into the very fabric of Hadoop in order to be effective. Being a part of the community will allow Knox to accomplish just that.

Already the community has rallied around the project and the vote has been positive thus far.  Tomorrow we should see community approval of a new incubation project in the Apache Software Foundation for Knox, a security layer for the Hadoop ecosystem.  The initial mentor list contains resources from Hortonworks, Microsoft and NASA among others.

What comprises the Knox Gateway?

The Knox Gateway (“Gateway” or “Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The Gateway runs as a server (or cluster of servers) that serve one or more Hadoop clusters.  It has few key functions:

  • Provide perimeter security to make Hadoop security setup easier
  • Support authentication and token verification security scenarios
  • Deliver users a single cluster end-point that aggregates capabilities for data and jobs
  • Enable integration with enterprise and cloud identity management environments
  • Manage security across multiple clusters and multiple versions of Hadoop

Knox will be able to provide a security layer for multiple clusters and multiple versions of Hadoop simultaneously and will deliver a simple intuitive management interface.  Playing nice with others is always a security imperative, so Knox will integrate with the existing frameworks for Active Directory /LDAP and it will allow for extensions for custom authentication mechanisms.

Availability

The short term plan for the Knox team is to deliver a solid, working release in late March so that early adopters can begin to evaluate and provide valuable feedback.  This critical step will ensure that the gateway fits nicely into customers’ infrastructure and makes Hadoop easier to use… and more secure.

Big Data Security Part Three: PacketPig Finding Zero Day Attacks

Introduction

This is part three of a Big Data Security blog series. You can read the previous two posts here: Part One / Part Two.

When Russell Jurney and I first teamed up to write these posts we wanted to do something that no one had done before to demonstrate the power of Big Data, the simplicity of Pig and the kind of Big Data Security Analytics we perform at Packetloop. Packetpig was modified to support Amazon’s Elastic Map Reduce (EMR) so that we could process a 600GB set of full packet captures. All that we needed was a canonical Zero Day attack to analyse. We were in luck!

In August 2012 a vulnerability in Oracle JRE 1.7 created huge publicity when it was disclosed that a number of Zero Day attacks had been report to Oracle in April but had still not been addressed in late August 2012. To make matters worse Oracle’s scheduled patch for JRE was months away (October 16). This position subsequently changed and a number of out-of-band patches for JRE were released for what became known as CVE-2012-4681 on the 30th of August.

The vulnerability exposed around 1 Billion systems to exploitation and the exploit was 100% effective on Windows, Mac OSX and Linux. A number of security researchers were already seeing the exploit in the wild as it was incorporated into exploit packs for the delivery of malware.

What is a Zero Day?

Put simply it’s any vulnerability that can be exploited without an available mitigation. The mitigation most people measure Zero Days by is a patch from the software vendor (in this case Oracle).

If we look at the timeline of this exploit you can see how long it was Zero Day for;

  • The Bug was introduced to JRE on July 28th 2011.
  • It was Disclosed to the public on April 2nd 2012.
  • The Exploit was available in the Metasploit Framework on August 26th 2012. With other PoC’s publicly available around the same time.
  • Detection was available via Snort IDS/IPS on August 28th 2012.
  • Lastly a Patch was available from Oracle on 30th August 2012.

If you compare the date the Bug was introduced and the date of the Patch the Zero Day time is 399 days. Comparing the date of Disclosure with the Patch date is still a staggering 150 days. To put this in perspective, a software bug that affects around 1 Billion devices was able to be exploited for well over a year and certainly was being seen in the wild. Whether you take the view that the Zero Day period is around 150 days (from disclosure)  or over a year (from introduction) both are extremely scary.

So how can you tell whether you were exploited using this JRE bug in the last 6 months or year? How can you prove your network or important systems haven’t been exploited using this vulnerability?

Finding Zero Day attacks

Packetpig provides you with the ability to search vast amounts of network packet captures for Zero Day attacks. To demonstrate this I executed the Metasploit Exploit for the JRE bug against a Windows XP workstation and recorded the packet capture. I then went and hid this 500KB capture amongst 600GB of Full Packet Captures from a system we monitor on the Internet. Every packet is captured to an S3 bucket so we can quickly scan the S3 bucket for Zero Days using Amazon’s Elastic Map Reduce.

So for the purpose of this demonstration as soon as the Snort Signatures were updated on the 28th of August I downloaded them. This allowed me to scan the 600GB of packet captures with the old signatures (in this case 2905) and then again with the new signatures (in this case 2931).

Let’s run through the Packetpig job ‘snort_comparison.pig‘ to see how this was done. The key to understanding the job is that we use the Packetpig SnortLoader() to scan the network packet captures with the old signatures and again with the new signatures. Anything in the old signature scan is removed from the new signature scan leaving only the Zero Day attacks.

In the same way as our last post we setup a number of variables using an include.pig file. After that we define old_snort_conf and new_snort_conf;

%DEFAULT includepath pig/include.pig
RUN $includepath;
 
%DEFAULT time 60
 
-- for local mode: uncomment the next line and comment the one after that
--%DEFAULT old_snort_conf 'lib/snort-2905/etc/snort.conf'
%DEFAULT old_snort_conf '/mnt/var/lib/snort-2905/etc/snort.conf'
 
-- for local mode: uncomment the next line and comment the one after that
--%DEFAULT new_snort_conf 'lib/snort-2931/etc/snort.conf'
%DEFAULT new_snort_conf '/mnt/var/lib/snort-2931/etc/snort.conf'

The SnortLoader() is used with the old snort.conf and the new snort.conf to scan the packet captures;

snort_old_alerts =
    LOAD '$pcap'
    USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$old_snort_conf')
    AS (
        ts:long,
        sig:chararray,
        priority:int,
        message:chararray,
        proto:chararray,
        src:chararray,
        sport:int,
        dst:chararray,
        dport:int
);
 
snort_new_alerts =
    LOAD '$pcap'
    USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$new_snort_conf')
    AS (
        ts:long,
        sig:chararray,
        priority:int,
        message:chararray,
        proto:chararray,
        src:chararray,
        sport:int,
        dst:chararray,
        dport:int
);
Next we group (COGROUP) the old and the new Snort scans and we filter out any signatures that appear in both;

snort_joined = COGROUP snort_old_alerts BY sig, snort_new_alerts BY sig;
new_only_filtered = FILTER snort_joined BY (COUNT(snort_old_alerts) == 0);

Lastly we re-project the data and then store it. The snort_comparison_new/part-r-00000 file is a verbose version of snort_comparison/summary/part-r-00000.

new_only_flattened = FOREACH new_only_filtered GENERATE FLATTEN(snort_new_alerts);
new_only_summary = FOREACH new_only_filtered GENERATE group, COUNT(snort_new_alerts);
 
STORE new_only_flattened INTO '$output/snort_comparison_new';
STORE new_only_summary INTO '$output/snort_comparison_summary';

To demonstrate this in practice I test the job on a small number of packet captures on my local development laptop. Watch the video to see how to do it.

Next I take it to the cloud and use 80 x m2.4large instances to process 600GB of full packet captures to find the Oracle JRE 1.7 attack. The 80 nodes spin up, install all the Packetpig software (bootstrap) and then go to work crunching the network packet captures. Check out the video to see the full process.

Fine-Tune Your Apache Hadoop Security Settings

Apache Hadoop is equipped with a robust and scalable security infrastructure. It is being used at some of the biggest cluster installations in the world, where hundreds of terabytes of sensitive and critical data are processed every day.

Owen O’Malley provided a nice overview of Apache Hadoop security in his blog Motivations for Apache Hadoop Security. Devaraj Das also covered some of the core pieces of Apache Hadoop’s security architecture in his blog The Role of Delegation Tokens in Apache Hadoop Security.

The intent of this blog is to cover some of the features of the Apache Hadoop security infrastructure that will help cluster administrators fine-tune the security settings of their clusters.

Read More

The Role of Delegation Tokens in Apache Hadoop Security

Delegation tokens play a critical part in Apache Hadoop security, and understanding their design and use is important for comprehending Hadoop’s security model.


Authentication in Apache Hadoop
Apache Hadoop provides strong authentication for HDFS data. All HDFS accesses must be authenticated:

1. Access from users logged in on cluster gateways
2. Access from any other service or daemon (e.g. HCatalog server)
3. Access from MapReduce tasks

Read More

Motivations for Apache Hadoop Security

Overview
As the former technical lead for the Yahoo! team that added security to Apache Hadoop, I thought I would provide a brief history.

The motivation for adding security to Apache Hadoop actually had little to do with traditional notions of security in defending against hackers since all large Hadoop clusters are behind corporate firewalls that only allow employees access. Instead, the motivation was simply that security would allow us to use Hadoop more effectively to pool resources between disjointed groups. Larger clusters are much cheaper to operate and require fewer copies of duplicated data.

Read More