PageRankis the poster-child of graph algorithms, used by Google in its original search engine system to determine which web pages are most influential. The incredible success of PageRank led do increased interest and research in the field of graph algorithms, resulting in innovative extensions such as personalized PageRank . Furthermore, social networking companies such as Facebook, LinkedIn and Twitter, continue to invest in graph algorithms for various purposes, most notably identifying “people you may know.”
With the massive adoption of Apache Hadoop as part of a modern data architecture at companies in nearly every industry, there is increased interest in applying advanced data science algorithms to large datasets enabled by the Hadoop platform, resulting in tangible business value and a competitive business edge. However, to-date most of the focus from enterprise data science teams has been directed toward supervised learning techniques such as regression modeling or random forest as well as clustering techniques such as K-means, whereas graph algorithms are relatively new and not as widely used.
In this three-part blog post, we will explore a variant of PageRank called “personalized PageRank,” and demonstrate how to apply this graph algorithm for the purpose of fraud detection on a real-world healthcare payment dataset.
It is estimated that in the US approximately [3-10% of total health care spending – $70B to $260B – is lost each year to healthcare fraud. In Europe the problem is of similar scale, estimated at 5-6% of national healthcare budgets, or $30B-$100B each year] 
With these staggering numbers, the importance of identifying fraud and abuse in healthcare cannot be under-stated, so healthcare providers invest significant resources in building systems to automatically identify fraud, waste and abuse to minimize its (negative) impact on their business results.
Today, healthcare providers regularly collect and store granular payment data and can utilize this data to improve their ability to identify fraud and abuse. Many existing fraud detection solutions are based on a set of business rules that detect suspicious or anomalous payment events. For example, a rule might look for payments for a given medical procedure where the amount is significantly above the average payment for that procedure.
Although these rules are effective for some known cases, sophisticated fraudsters can often circumvent them. In addition, the rules often do not automatically discover new patterns of fraud and abuse, and are difficult to maintain and update over time.
More recently, however, the healthcare providers have been applying advanced machine-learning-based anomaly detection techniques to identify fraud. For example, a healthcare provider might utilize a supervised learning model that uses various features of the current transaction as well as historical transaction stream to detect fraud or an anomalous transaction. In a typical setting, the system will identify an anomalous transaction, raise a flag, and route the transaction to a human, who will review it as a fraud or valid transaction.
In this context, our approach in using Personalized PageRank is not replacing such systems, but instead is generating additional features to feed into such models, which are expected to improve the overall detection accuracy of the system.
In 2014, CMS (the United States Center for Medicare and Medicaid Services) released into the public domain a dataset known as “Medicare Part-B.”
The Medicare-B dataset includes a set of records documenting a wealth of information about transactions between over 880,000 medical providers and CMS, totaling over $77B for the year 2012.
For the purpose of our analysis we only extract 4 fields from each record:
Note that the dataset includes other fields that could be used for “count” (e.g. “amount submitted” or “amount paid”), but for simplicity we just use bene_day_srvc_cnt.
Here’s an example of the data we extract for a record in the dataset:
1003000530, Internal Medicine, 81002, 16
Here “1003000530” is the NPI, or the identifier of the provider, “Internal medicine” is the specialty, 81002 is the procedure code, and a count of 16.
PageRank is a link-analysis algorithm first used by Google to rank its search engine results. It considers the web as a graph, where websites are nodes and hyperlinks are edges, and computes an “importance” score for each node.
The PageRank algorithm computes the probability that a web surfer randomly surfing the web will end up at a particular web page. The web surfer starts from a random web page and follows a sequence of steps. In each step, the surfer may take 1 of 2 possible actions: (1) choose one of the hyperlinks on the page and follow it (2) “teleport” to a randomly selected web page.
The resulting PageRank scores for each node indicate the importance of web pages (the higher a PageRank score is, the more likely a random surfer will visit that page), and are computed as follows:
pi represents the i-th page, L(pj) represents the number of outbound links on page p¬j, and N is the total number of pages (nodes in the graph).
Figure 1 below shows an example of PageRank scores for a small example graph of medical providers (doctors and nurses). Note that all names have been fictionalized to protect privacy:
This graph is a similarity graph, where nodes in the graph are (fictitious) medical providers, and an edge between any two nodes indicates that the two providers are similar to each other in terms of medical procedures. We color-coded the nodes by specialty — Doctors Ng, Jones and Miller are Dermatologists, Dr. Cheng is a Surgeon, and Doctors Ortega, Das, Seo and Padian are all internists.
We interpret the probabilities on each node as the likelihood of a random walker on the graph to visit that node. So for example, the likelihood of our random walker to visit the node representing Dr. Cheng is 13.3%, and for the node representing Dr. Padian is 17.6%. Note that all probabilities (PageRank scores) sum up to 100%.
Personalized-PageRank (also known as topic-sensitive-PageRank) is a variant of PageRank in which the algorithm is modified to compute PageRank scores biased towards a specific “topic of interest”, defined by a pre-selected group S of vertices (called “source vertices”).
Figure 2 below shows an example of Personalized-PageRank scores for our small provider graph, personalized towards providers with the Dermatology specialty. Note again,that all names have been fictionalized to protect privacy:
With personalized PageRank, our “random surfer” can only start from a random page that’s part of a designated set of nodes (in our case the Dermatology doctors – Ng, Miller and Jones), and can only “teleport” into a node included in that set of nodes.
As you can see, with personalized PageRank, the scores for the Dermatologists are higher than any other doctor in the graph. Interestingly, Dr. Cheng is a surgeon but due to the edges between him and doctors Miller and Ng his personalized PageRank score is rather high (15.4%). We use this phenomenon to identify potential anomalies – providers with a high score but with the “wrong” specialty.
So how do we apply Personalized PageRank to identify anomalies in healthcare transaction records?
Our approach is as follows the following steps:
Compute similarity between providers, based on shared procedure (CPT) codes. We expect any two providers with a shared specialty to have relatively similar CPT codes. We then generate an undirected graph where providers (NPI codes) are nodes, and an edge between two providers A and B exists if these two providers are “similar” enough to each other, in terms of CPT patterns.
Computing the similarity between each two pairs of providers (all-pairs similarity) is a rather intensive task. Our dataset includes ~880K providers, which translates into nearly 774 Billion similarity computations.
It is worth mentioning, however, that some techniques such as Locality Sensitive Hashing and twitter’s DIMSUM enable the computation of similarity in an approximate manner while further reducing the amount of computation. Our code does not demonstrate these techniques, but they could certainly be applied to this problem.
We loop over all specialties. For each specialty (e.g., Dermatology) we set our “source vertices” set to include all medical providers with that specific specialty. We then apply a “Personalized PageRank” algorithm to our undirected graph using this set of source vertices, and the resulting score is the specialty-specific personalized-PageRank score. We use PRspecialty(node) to denote the specialty-specific PageRank score for a given node.
We identify anomalous providers as those nodes whose PRspecialty(node) score is high but whose specialty is not the one used for the PageRank calculation. Going back to our simple example graph above, if we compute personalized-PageRank for the Dermatology group, Dr. Cheng is in Surgery, but has a relatively high PageRank score, so therefore will be identified as an anomalous provider.
To implement this PageRank variant we use SociaLite—a large-scale graph analysis framework with high-level query language. We will describe SocialLite in more detail in part 3 of this blog post series, and show how we use it to implement the personalized PageRank algorithm and identify anomalous providers.
As a result of our pre-processing step we ended up with a rather large graph, comprised of ~446K nodes and ~673M edges.
Now let’s take a look at two example anomalies that out algorithm has discovered:
Our first example is from the Ophthalmology specialty, a cluster of 38715 medical providers. Running personalized-PageRank on this group, we see an internal medicine provider that is anomalous due to the following list of procedures billed to Medicare:
|Internal eye photography|
|Cmptr ophth img optic nerve|
|Echo exam of eye thickness|
|Eye exam new patient|
|Eye exam established pat|
|After cataract laser surgery|
|Eye exam & treatment|
|Eye exam with photos|
|Cataract surg w/iol 1 stage|
|Visual field examination(s)|
Clearly a good find, as these procedures seem relevant for an eye specialist (Ophthalmologist) but not for internal medicine.
Our second example is from the Plastic Surgery specialty, a cluster of 1228 medical providers. Running personalized-PageRank on this group, we see an Otolaryngologist that is anomalous, with the following list of procedures billed to Medicare:
|Skin tissue rearrangement (multiple variants)|
|Cmptr ophth img optic nerve|
|Biopsy skin lesion|
Again, we see a set of procedures that belong to plastic surgery, but performed here by an Otolaryngologist.
As mentioned above, in practice the resulting PageRank scores (one score per specialty) for each provider are not used in isolation, but instead integrated into an existing anomaly detection algorithm as additional features.
In this blog post we described our approach in generating a similarity graph between providers and using Personalized PageRank to identify anomalies in healthcare payment data.
In the 2nd part of this blog post series, we will demonstrate in detail how to use Apache Hadoop and Apache PIG to compute the similarity graph at scale.
And finally, in the 3rd part of this blog post series, we will demonstrate how to use SocialLite to compute personalized PageRank on the graph and identify anomalies.
All the code associated with this blog post series is available on Github.