This is the first part of a series written by Charles Boicey from the UC Irvine Medical Center. The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.
With a single observation in early 2011, the Hadoop strategy at UC Irvine Medical Center started. While using Twitter, Facebook, LinkedIn and Yahoo we came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with healthcare data.
In healthcare, data shares many of the same qualities as that found in the large web properties. Each has a seemingly infinite volume of data to ingest and it is all types and formats across structured, unstructured, video and audio. We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know.
We started to draw parallels to the challenges we were having with the typical characteristic of Big Data, volume, velocity and variety.
In the beginning, our first project was to build an environment capable of ingesting Continuity of Care Documents (CCD) via a JSON pipeline, store them in MongoDB and then render them via a web user interface that had search capabilities. From that initial success project Saritor was launched.
Saritor is the Roman god for cultivation, in this case the cultivation of healthcare data for the purposes of rapidly progressing through the data to information, to knowledge, to wisdom continuum. We saw this project as vehicle for demonstrating the value of Applied Clinical Informatics and promoting the translational effects of rapidly moving from “code side to bedside”.
Why Saritor? The Electronic Medical Record (EMR) cannot handle complex operations such as anomaly detection, machine learning, building complex algorithms or pattern set recognition and the Enterprise Data Warehouse (EDW) supports quality, operations, clinicians & researchers. We, like many organizations with data warehouses run ETL processes at night to minimize the load on the production systems. We have some have real time interfaces with the data warehouse,but not all data is ingested in real time. In turn, our data suffers from a latency factor of up to 24 hours in many cases making this environment suboptimal. An adjunctive environment is needed to fill in the gaps.
Why Apache Hadoop?
Hadoop has a very attractive scale to cost ratio because it is A) open source and B) the server requirements are minimal and VM is an option. We currently deploy eight nodes, which is a far cry from the multiple 4000+ node clusters that Yahoo employs but our small environment is providing us big value.
Hadoop is uniquely capable of storing a wide range of healthcare environment data not matter the type or amount of structure. For us, this includes:
Any electronically generated data in a healthcare environment can be ingested and stored in Hadoop and most importantly on commodity hardware.
But wait, that’s not all. The Hadoop ecosystem is modular and within those modules lays the functionality to build algorithms for surveillance, detection and notification of conditions such as sepsis or the prediction of potential 30 day readmits. Other uses cases we are working on include monitoring “Sink Time”, that is how much time caregivers spend washing their hands; patient throughput with the ability to capture actual hand off times; patient scorecards pushed to the patient portal and the ability to discover the unknown unknowns in our data.
Hadoop has also answered the problem of legacy data. UC Irvine Healthcare like many healthcare organizations has a legacy system, clinicians and researchers needed access to the data. Data conversion from the legacy system to the new EMR or data warehouse was not feasible. Our legacy system like others has the ability to print to text the patient record. For UCI that meant 1.2 million patients and over 3 million records. Those records are now in Saritor and are searchable. Solving this use case was our first deliverable with a demonstrable ROI.
We believe that Hadoop is the right environment for developing an analytic ecosystem to aide in the delivery of quality care at the lowest possible cost and an environment to enable clinical researchers to examine healthcare data in its entirety.
Next time we’ll dive deepr into the Saritor Hadoop ecosystem, ongoing and future development as well as collaborations with our partners.