Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
March 29, 2017
prev slideNext slide

Detecting Hackers and Impersonators with Machine Learning

The 2014 Yahoo email hack is a good illustration how a big data security analytics platform such as Apache Metron can make it easier to detect, investigate, assess, and remediate threats in your environment.  In this article I will describe how to setup and configure Apache Metron to detect a recent cyber attack on Yahoo, as described in the United States vs Dokuchaev at all, as well as contrast how the Apache Metron approach, philosophy, and methodology stands apart from that of contemporary point tool solutions.

What Did The Recent Cyber Attack on Yahoo Look Like?

Prior to venturing into implementation details let’s quickly recap the incident.  According to the indictment the incident began at some point in 2014 and the attacker’s presence on Yahoo networks lasted through December of 2016, somewhere between one and two years.  The attack was targeted, allegedly sponsored by a Russian intelligence agency, and started with a spear phishing email to a “semi-privileged” Yahoo employee.  We can assume that the email contained either a link or a malware attachment that the target clicked, a back door was installed on that infected machine, and the attackers were able to steal that user’s credentials (and perhaps also the service accounts on that machine).  Attackers then downloaded additional tools to maintain and conceal unauthorized access and proceeded to engage in reconnaissance, which can mean looking through that host’s logs, passively listening to network traffic, actively scanning the network, dumping memory, and much more.  This roughly lasted for 6+ months, from beginning of 2014 until December of the same year.

At some point attackers were able to locate Yahoo’s User Database (UDB) and the Account Management Tool (AMT).  Attackers then engaged in misuse of the AMT, using it to mint authentication cookies for Yahoo’s email accounts, making it appear that they have previously obtained a valid login into a user’s email and did not need to authenticate again.  Furthermore, a backup copy of the AMT and UDB was made and exfiltrated by attackers over a 2-month period over unencrypted FTP, allowing them to mint cookies on demand outside of Yahoo’s network. Using these methods attackers were able to get access to thousands of Yahoo’s users emails and mined the information they found in these emails for credit card numbers, personal information, additional account information, and much more.  They used this information for personal gain as well as for sale to the Russian intelligence agency.

Hackers engaged in a complex attack with 5 phases with multiple dynamic actions per phase

Phase 1: Phishing Email → Back door installed → Stolen Credentials

Phase 2: Conceal unauthorized access → Reconnaissance for 6-12 months

Phase 3: Locate and infiltrate targets (user database & account management tool) → hack emails with compromised authentication information

Phase 4: Copy user database and account management tool → exfiltrate over unsecured network

Phase 5: Mine emails for credit card numbers, personal information and more


Machine Learning To Mitigate A Yahoo-style attack

We can effectively use Apache Metron to detect a Yahoo-style attack, learn from its generalized behavior, and apply what we learned in terms of ML (Machine Learning), statistical profiles, and triage rules back into the system to detect similar behaviors in real time.  This big data analytics approach to cyber is fundamentally different from a deterministic rules-based approaches offered by SIEMs today.  It’s more proactive and adaptable to complex dynamic behaviors as we mine our cyber data lake for behavioral insights and “normal” operating behavior for users and assets and then encode this behavior in terms of features into our statistical baselines and machine learning  models.

Security datalake is key asset to machine learning strategy

The security data lake, and specifically the quality and volume of data in this data lake, are key assets that make this type of strategy possible.  We want rich contextual data from a variety of point tools such as network probes, deep packet inspection tools, application logs, identity stores, host-based sensors, IDS devices, and even HR databases and physical access logs.  The more sources of data we have, the more perspective and insight we can mine across different data sets that we can later encode into our detection models.

Profiles enable comparisons between normal and abnormal behavior

Now let us examine how this applies specifically to the Yahoo attack.  First, the attackers sent a spear phishing email that was obviously missed by one or more email scan tools and delivered to a Yahoo employee. Using Apache Metron it is possible to layer on top of the analytics already done by the email scanner and profile the email server logs.  Using Apache Metron’s profiler it is possible to compute the probability of a user X getting an email from a source Y at a time T with a specific combination of email headers.  In Apache Metron this is called a profile.  We can then setup a dynamic alert rule to alert us when outliers in this profile are detected.

Complete and Contextual Data Enables Detection of Impersonators

After the spear phishing attack is successful we are now dealing with a case of a compromised account of a legitimate user that has now been hijacked by the attackers. This is where Metron’s sheer scale, ability to perform in-line enrichment, and the ability to derive user and entity profiles in real time really shines.  Apache Metron can identify and analyze user behavior from a variety of feeds, create statistical baselines of this behavior, and then based on the set of profiles configured within the system build a risk-based view of anomalous behaviors exhibited by an entity.  Hence, if a user starts connecting to assets he/she doesn’t generally connect to, during the time he/she is generally not active, from a location he/she generally is not associated with, using tools or mechanisms he/she does not use, etc., the risk-based score will be amplified accordingly.  The more feeds and context we have for user behavior, the better our profiles are.  Hence, building up the data lake is the key component in this strategy.

Detection of hacking tools is enabled by Hadoop based platforms such as Apache Metron

After the attackers performed network reconnaissance they brought in tooling to help them maintain foothold or further exploit Yahoo’s assets.  For the most part these tools are widely known to security analysts and their presence can be easily checked for using Apache Metron, Hadoop and Spark.  Filenames, hashes, and other signatures associated with these tools can be checked for in real-time using Apache Metron‘s threat intelligence module.  We can also check for their presence by running periodic batch queries against our cybersecurity data lake.  In both cases we can do so on a massive scale, a feature that is unique to Hadoop-based platforms like Apache Metron.

Volume and cost effectiveness of big data is only way to access breadth and depth of data necessary to investigate impact of an attack

Lastly, let’s take a look at data exfiltration.  The process itself lasted for around 2 months, while the attacker’s presence on Yahoo’s network lasted for about 2 years.  Hence, in order to properly investigate this incident we need to build a large enough data lake to be able to reach back and examine logs and network metadata that are over a year old.  We also need raw packet capture capability as metadata and logs alone rarely contain enough information to properly assess the impact of the breach. Apache Metron provides a set of parsers to easily process and enrich data to build the cyber data lake and also provides probes to stream PCAP for high-fidelity forensics. Apache Metron then provides a capability to correlate PCAP to logs and metadata, making investigations of Yahoo-style attacks and assessing their impacts possible.

Learn More

To learn more, join us for this live webinar on April 27th pm – Combating Phishing Attacks: How Big Data Helps Detect Impersonators on how big data cybersecurity solutions can reduce the exposure time of a phishing attack.


AppValley VIP iOS says:

Machine Learning wow it’s creating extreme enthusiasm in me. Thanks for wonderful article.

AppValley VIP says:

Machine Learning is growing at a fast phase, We are seeing machine learning in each and every part of the life. Now machine learning detects hackers that is super cool and will be a super strong protection for many firms which cannot afford separate services for protection.

BBK5 Vote says:

Machine Learning always amazes me. I came across hundreds of projects which aren’t possible without machine learning. It is now used in many fields and It will rule the technology field in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums