Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
March 04, 2015
prev slideNext slide

Using PageRank to Detect Anomalies and Fraud in Healthcare

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University.


PageRank[1]is the poster-child of graph algorithms, used by Google in its original search engine system to determine which web pages are most influential. The incredible success of PageRank led do increased interest and research in the field of graph algorithms, resulting in innovative extensions such as personalized PageRank [2]. Furthermore, social networking companies such as Facebook, LinkedIn and Twitter, continue to invest in graph algorithms for various purposes, most notably identifying “people you may know.”

With the massive adoption of Apache Hadoop as part of a modern data architecture at companies in nearly every industry, there is increased interest in applying advanced data science algorithms to large datasets enabled by the Hadoop platform, resulting in tangible business value and a competitive business edge. However, to-date most of the focus from enterprise data science teams has been directed toward supervised learning techniques such as regression modeling or random forest as well as clustering techniques such as K-means, whereas graph algorithms are relatively new and not as widely used.

In this three-part blog post, we will explore a variant of PageRank called “personalized PageRank,” and demonstrate how to apply this graph algorithm for the purpose of fraud detection on a real-world healthcare payment dataset.

The Business Use-case

It is estimated that in the US approximately [3-10% of total health care spending – $70B to $260B – is lost each year to healthcare fraud. In Europe the problem is of similar scale, estimated at 5-6% of national healthcare budgets, or $30B-$100B each year] [3]

With these staggering numbers, the importance of identifying fraud and abuse in healthcare cannot be under-stated, so healthcare providers invest significant resources in building systems to automatically identify fraud, waste and abuse to minimize its (negative) impact on their business results.

Existing Solutions

Today, healthcare providers regularly collect and store granular payment data and can utilize this data to improve their ability to identify fraud and abuse. Many existing fraud detection solutions are based on a set of business rules that detect suspicious or anomalous payment events. For example, a rule might look for payments for a given medical procedure where the amount is significantly above the average payment for that procedure.

Although these rules are effective for some known cases, sophisticated fraudsters can often circumvent them. In addition, the rules often do not automatically discover new patterns of fraud and abuse, and are difficult to maintain and update over time.

More recently, however, the healthcare providers have been applying advanced machine-learning-based anomaly detection techniques to identify fraud. For example, a healthcare provider might utilize a supervised learning model that uses various features of the current transaction as well as historical transaction stream to detect fraud or an anomalous transaction. In a typical setting, the system will identify an anomalous transaction, raise a flag, and route the transaction to a human, who will review it as a fraud or valid transaction.

In this context, our approach in using Personalized PageRank is not replacing such systems, but instead is generating additional features to feed into such models, which are expected to improve the overall detection accuracy of the system.

The Dataset – Real World Healthcare Data

In 2014, CMS (the United States Center for Medicare and Medicaid Services) released into the public domain a dataset known as “Medicare Part-B.”

The Medicare-B dataset includes a set of records documenting a wealth of information about transactions between over 880,000 medical providers and CMS, totaling over $77B for the year 2012.

For the purpose of our analysis we only extract 4 fields from each record:

  • Provider NPI: In the US healthcare system, medical providers (doctors, nurses, etc) are associated with an NPI [4] (national Provider ID) – a unique and publicly available identifier for each medical provider.
  • Specialty: a text string representing the medical specialty of the medical provider, such as pediatrics, internal medicine, OB/Gyn or radiology.
  • Procedure code: each medical procedure or service is associated with a “HCPCS code” (CMS procedure code), also known as a “CPT code.”
  • Count: the field bene_day_srvc_cnt represents the number of distinct Medicare beneficiary-per-day services listed, representing a normalized count of the number of medical procedures billed by the provider.

Note that the dataset includes other fields that could be used for “count” (e.g. “amount submitted” or “amount paid”), but for simplicity we just use bene_day_srvc_cnt.

Here’s an example of the data we extract for a record in the dataset:

1003000530, Internal Medicine, 81002, 16

Here “1003000530” is the NPI, or the identifier of the provider, “Internal medicine” is the specialty, 81002 is the procedure code, and a count of 16.

Page Rank – A Quick Introduction

PageRank is a link-analysis algorithm first used by Google to rank its search engine results. It considers the web as a graph, where websites are nodes and hyperlinks are edges, and computes an “importance” score for each node.

The PageRank algorithm computes the probability that a web surfer randomly surfing the web will end up at a particular web page. The web surfer starts from a random web page and follows a sequence of steps. In each step, the surfer may take 1 of 2 possible actions: (1) choose one of the hyperlinks on the page and follow it (2) “teleport” to a randomly selected web page.

The resulting PageRank scores for each node indicate the importance of web pages (the higher a PageRank score is, the more likely a random surfer will visit that page), and are computed as follows:

Screen Shot 2015-03-03 at 4.23.10 PM

pi represents the i-th page, L(pj) represents the number of outbound links on page p¬j, and N is the total number of pages (nodes in the graph).

Figure 1 below shows an example of PageRank scores for a small example graph of medical providers (doctors and nurses). Note that all names have been fictionalized to protect privacy:

Figure 1: example provider graph with PageRank scores (provider names are fictitious)
Figure 1: example provider graph with PageRank scores (provider names are fictitious)

This graph is a similarity graph, where nodes in the graph are (fictitious) medical providers, and an edge between any two nodes indicates that the two providers are similar to each other in terms of medical procedures. We color-coded the nodes by specialty — Doctors Ng, Jones and Miller are Dermatologists, Dr. Cheng is a Surgeon, and Doctors Ortega, Das, Seo and Padian are all internists.

We interpret the probabilities on each node as the likelihood of a random walker on the graph to visit that node. So for example, the likelihood of our random walker to visit the node representing Dr. Cheng is 13.3%, and for the node representing Dr. Padian is 17.6%. Note that all probabilities (PageRank scores) sum up to 100%.

Personalized Page Rank and identifying anomalies

Personalized-PageRank (also known as topic-sensitive-PageRank) is a variant of PageRank in which the algorithm is modified to compute PageRank scores biased towards a specific “topic of interest”, defined by a pre-selected group S of vertices (called “source vertices”).

Figure 2 below shows an example of Personalized-PageRank scores for our small provider graph, personalized towards providers with the Dermatology specialty. Note again,that all names have been fictionalized to protect privacy:

Figure 2: example provider graph with Personalized-PageRank scores (provider names are fictitious)
Figure 2: example provider graph with Personalized-PageRank scores (provider names are fictitious)

With personalized PageRank, our “random surfer” can only start from a random page that’s part of a designated set of nodes (in our case the Dermatology doctors – Ng, Miller and Jones), and can only “teleport” into a node included in that set of nodes.

As you can see, with personalized PageRank, the scores for the Dermatologists are higher than any other doctor in the graph. Interestingly, Dr. Cheng is a surgeon but due to the edges between him and doctors Miller and Ng his personalized PageRank score is rather high (15.4%). We use this phenomenon to identify potential anomalies – providers with a high score but with the “wrong” specialty.

Identifying Anomalies – our approach

So how do we apply Personalized PageRank to identify anomalies in healthcare transaction records?

Our approach is as follows the following steps:

Step 1

Compute similarity between providers, based on shared procedure (CPT) codes. We expect any two providers with a shared specialty to have relatively similar CPT codes. We then generate an undirected graph where providers (NPI codes) are nodes, and an edge between two providers A and B exists if these two providers are “similar” enough to each other, in terms of CPT patterns.

Computing the similarity between each two pairs of providers (all-pairs similarity) is a rather intensive task. Our dataset includes ~880K providers, which translates into nearly 774 Billion similarity computations.

To implement this analysis, we use Apache Hadoop and Apache Pig to scale the similarity computation across multiple nodes( in part 2, we’ll describe the implementation details).

It is worth mentioning, however, that some techniques such as Locality Sensitive Hashing and twitter’s DIMSUM enable the computation of similarity in an approximate manner while further reducing the amount of computation. Our code does not demonstrate these techniques, but they could certainly be applied to this problem.

Step 2

We loop over all specialties. For each specialty (e.g., Dermatology) we set our “source vertices” set to include all medical providers with that specific specialty. We then apply a “Personalized PageRank” algorithm to our undirected graph using this set of source vertices, and the resulting score is the specialty-specific personalized-PageRank score. We use PRspecialty(node) to denote the specialty-specific PageRank score for a given node.

We identify anomalous providers as those nodes whose PRspecialty(node) score is high but whose specialty is not the one used for the PageRank calculation. Going back to our simple example graph above, if we compute personalized-PageRank for the Dermatology group, Dr. Cheng is in Surgery, but has a relatively high PageRank score, so therefore will be identified as an anomalous provider.

To implement this PageRank variant we use SociaLite—a large-scale graph analysis framework with high-level query language. We will describe SocialLite in more detail in part 3 of this blog post series, and show how we use it to implement the personalized PageRank algorithm and identify anomalous providers.

Our Results

As a result of our pre-processing step we ended up with a rather large graph, comprised of ~446K nodes and ~673M edges.

Now let’s take a look at two example anomalies that out algorithm has discovered:

Our first example is from the Ophthalmology specialty, a cluster of 38715 medical providers. Running personalized-PageRank on this group, we see an internal medicine provider that is anomalous due to the following list of procedures billed to Medicare:

Internal eye photography
Cmptr ophth img optic nerve
Echo exam of eye thickness
Revise eyelashes
Ophthalmic biometry
Eye exam new patient
Eye exam established pat
After cataract laser surgery
Eye exam & treatment
Eye exam with photos
Cataract surg w/iol 1 stage
Visual field examination(s)

Clearly a good find, as these procedures seem relevant for an eye specialist (Ophthalmologist) but not for internal medicine.

Our second example is from the Plastic Surgery specialty, a cluster of 1228 medical providers. Running personalized-PageRank on this group, we see an Otolaryngologist that is anomalous, with the following list of procedures billed to Medicare:

Skin tissue rearrangement (multiple variants)
Cmptr ophth img optic nerve
Biopsy skin lesion

Again, we see a set of procedures that belong to plastic surgery, but performed here by an Otolaryngologist.

As mentioned above, in practice the resulting PageRank scores (one score per specialty) for each provider are not used in isolation, but instead integrated into an existing anomaly detection algorithm as additional features.


In this blog post we described our approach in generating a similarity graph between providers and using Personalized PageRank to identify anomalies in healthcare payment data.

In the 2nd part of this blog post series, we will demonstrate in detail how to use Apache Hadoop and Apache PIG to compute the similarity graph at scale.

And finally, in the 3rd part of this blog post series, we will demonstrate how to use SocialLite to compute personalized PageRank on the graph and identify anomalies.

All the code associated with this blog post series is available on Github.



Fadi Maali says:

Nice post! I am a big fan of using graph measures and algorithms to enrich the signal gleaned from the data. However, I found the way the graph is built (based on similarity) very strange. It feels like forcing a graph structure over the data and consequently I can’t get my head around the meaning that the personalised pagerank scores represent. Over the web, PageRank works because a link from an important page is deemed as more important as a result of the social process people use to include links in their web pages. I am finding it hard to find an analogy to the example presented.

shilpi says:

If a internal medicine bill for eye procedures, why is it an anomaly ??How does it helps the business in identifying such anomalies. Does he get paid more for procedures done outside of his speciality??
Examples with some background objective will help understand the concept better.

Jules S. Damji says:

As mentioned in the post, the idea is to create scores that reflect suspected anomaly relative to the specialty. Ideally in a real-world setting you can plug these scores into an existing overpayment or fraud model (as additional features, one per specialty) to improve its overall performance.

rouf says:

thanks for its news, may be useful for everyone.

Pradeep Athmakuri says:

How will the graph behave when the provider is a multi-specialty and has multiple NPI(s). This is very common scenario in the Healthcare world. Also, does this apply for non-human providers as well.

Tom Fawcett says:

Maybe I’m missing something, but it appears that all you’re doing here is finding specialists who perform a lot of procedures apparently outside their specialty. If that’s the case, what’s the point of doing personalized PageRank and generating a social graph at all? I don’t see that it’s necessary, or reduces the number of false positives, or makes the process more efficient.

Nitin Yadav says:

Agree with Tom Fawcett here. Using Personalized Pagerank here seems like an overkill. In fact, creating the graph in such a way is an overkill if all you find out is (as Tom says) the set of specialists who perform a lot of procedures outside their specialty.

I would recommend a different approach:
[1] Consider each specialty as a community in the graph.
[2] Find out the “strength of membership” of vertices for each community it is connected with. A simple measure is (number of edges with community)/(degree of the vertex). For each vertex you should get a vector with size equal to the number of communities the vertex is connected with. Normalize this vector for each vertex.
[3] Mark the least “stable” vertices as anomalous nodes. Stability is calculated as the maximum value in the “strength of membership” vector. Because of the normalization of the vector in the previous step, the stability of each vertex would be between 0 and 1. The vertices which strongly belong to their community would have the “stability” as 1.

Liza says:

What a brilliant approach to detecting healthcare fraud. I believe we can add some more procedures that is included in our healthcare. It’s not very easy to detect a fraud in healthcare but with the use of this algorithm, it will probably make it simpler if not perfect.

Steve says:

Your ‘github’ code doesn’t work anymore. It is missing some of the datafile, and setup won’t function properly. Any help?

antoine moreau says:

Hello, I have written an article about detecting fraud with personalized page rank:

Among other things, I talk about your great work !

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums