Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
March 09, 2015
prev slideNext slide

Journey to a Health Care Data Lake: Hadoop at Mercy

Paul Boal, Director of Data Management & Analytics at Mercy, is our guest blogger. He shares his thoughts and insights about Apache Hadoop, Hortonworks Data Platform and Mercy’s journey to the Data Lake.

Technology at Mercy

Mercy has long been committed to using technology to improve medical outcomes for patients. We were among the first health care organizations in the U.S. to have a comprehensive, integrated electronic health record (EHR) providing real-time, paperless access to patient information.

We use an EHR from Epic Systems. Every patient activity is entered into the Epic database, including both clinical and financial interactions. All reporting and analysis against the Epic database is done via an associated Oracle-based data warehouse called Clarity. At Mercy’s size, Clarity poses the usual challenges associated with data warehouses: it is expensive to scale, it requires a rigid data schema, and it is slow for some queries.

To overcome these challenges, Mercy has partnered with Hortonworks to create the Mercy Data Library, a Hadoop-based data lake running on Hortonworks Data Platform (HDP). The Data Library will contain volumes of batch data extracts from relational systems like Clarity and Lawson as well as real-time data directly from Epic. We will soon integrate other data sources, including social media and weather information for specialized projects.

The strength of Hadoop as a data platform is its ability to ingest and combine data sets from all these sources and formats. The combination of all of these data sets in a common platform enables us to ask questions that we weren’t able to ask previously, and we can ask those on an increasingly larger scale. Because of the low cost of storage on the platform, we can store information that we might have otherwise ignored if it were at a higher storage cost.

Intensive Data for Intensive Care

To understand the advanced analytic applications that we plan at Mercy using HDP, take as an example our patient vitals project. Today, when a patient is in the ICU, the devices reading the patient’s vitals send a record of their vitals to the EHR once every second. Periodically, a nurse in the ICU will review the patient’s vitals in the EHR and select one set of readings as a “good reading.” All of the other data is erased from the system. There are very good reasons for this practice when Epic is part of our data architecture:

  • First, each ICU patient generates a lot of data. The practice of selecting a particular reading reduces the scope of the data collected by almost three orders of magnitude (15 minutes multiplied by 60 readings per minute = 900 readings, which are then reduced to 1).
  • Second, patient data is inherently noisy. Monitors fall off. Patients take off their sensors. Patients go to the bathroom. We need the ability to save a data point that is indicative of the 15-minute period.
  • Third, the data point selected usually does a good job at summarizing the patient’s state over a 15-minute interval. Many conditions can be detected with changes at this scale of change.

However, the frequency of readings captured in Clarity doesn’t allow analysis of some questions. What if a researcher was interested in determining which medicines bring down fever fastest? The readings that are recorded in Epic do not give the researcher a fine-grained measure to determine the efficacy of the medicine over seconds or a few minutes.

Also, the noisiness of the vital readings may give the clinical staff a valuable indication about how much the patient is moving around within those fifteen minutes. There may also be a correlation between movement and heart rate, breathing, or pain. But without detailed readings, these correlations may remain hidden behind the coarseness of the data that we were collecting with Epic.

Finally, the Clarity reporting database is updated only once per night with the previous day’s Epic data. In order for data analysis to have an immediate impact on patients under our care, the data being used for decision making, has to be nearly real-time.

With our Hadoop-based Data Library we hope to more closely approach a real-time data-on-demand model for researchers and clinicians. We currently use a combination of Apache Sqoop, Storm and HBase for more granular updates. These apply updates every hour, and we expect to shorten this to only two or three minutes in the future.

Lessons Learned

One important thing that we’ve learned is to not neglect the knowledge already in the existing Clarity data model. Instead, we try to leverage that knowledge when we replicate the data into Hadoop. We wrap the Oracle data with additional metadata, allowing us to introduce functionality and features not available via Clarity into our analysis and reports.

In addition, we use Apache Hive extensively at Mercy. Hive has allowed us to capitalize on our familiarity with SQL and the scalability of our Hadoop data lake.

While open source is not necessarily a priority for Mercy, we have benefited significantly from the rate of innovation in the open-source Hadoop ecosystem. We are also grateful to have a partner in Hortonworks, whose attitudes on effectively servicing both their customers and the community have created a strong customer relationship for Mercy.

About Mercy

Mercy is the fifth largest Catholic health care system in the U.S. and serves millions annually. Mercy includes 35 acute care hospitals, four heart hospitals, two children’s hospitals, three rehab hospitals and two orthopedic hospitals, nearly 700 clinic and outpatient facilities, 40,000 co-workers and more than 2,000 Mercy Clinic physicians in Arkansas, Kansas, Missouri and Oklahoma. Mercy also has outreach ministries in Louisiana, Mississippi and Texas. For specific information about Mercy’s commercial technology services, visit



Will Byron says:
Your comment is awaiting moderation.

Great innovation! We are working on a similar project capturing alarm data from devices. Interested in what modules on epic are used to capture the device information in clarity? Also what front end tools are you using to ask questions to hive tables? Thanks

Will Byron says:
Your comment is awaiting moderation.

Not sure if my previous post went through. We were interested through what module the device data got into clarity. What tables in clarity have that data. Also what tools are used to analyze the hadoop data in the Hive tables. Thanks.

heart of vegas coins generator says:

I do agree that we can have journey to a health care data lake and it is really fun to do it.

Steven Kenneth says:
Your comment is awaiting moderation.

So generally, the vitals of a patient in ICU will be updated for every 15 minutes is it ?

Leave a Reply

Your email address will not be published. Required fields are marked *