Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 04, 2015
prev slideNext slide

HDP for Manufacturing Yield Optimization in Pharma

This is a guest blog post from Jerry Megaro, Merck’s Director of Innovation and Manufacturing Analytics. Jerry established the practice of Data Excellence and Data Sciences within the Merck Manufacturing Division and now leads initiatives to transform Merck Manufacturing into a data-driven organization that enhances the company’s performance across the supply chain.

Hortonworks experience working with top pharma manufacturers indicates an exciting opportunity to improve manufacturing performance by proactively managing process variability. Vaccine production is a great example to consider, since it involves the use of live, genetically engineered molecules, as well as a highly technical manufacturing process with many steps.

As a result, manufacturers need to monitor hundreds of upstream and downstream parameters to ensure the quality and purity of the ingredients and vaccines being produced. Two batches of a particular vaccine, produced using an identical manufacturing process, can exhibit significant yield variances. This unexplained variability negatively impacts manufacturing yield and overall business performance.

The potential benefits for manufacturers are tremendous. IDC Manufacturing Insights estimates that manufacturers, on average, still sacrifice between 200 and 400 basis points in margin to adverse quality. IDC also estimates that world class quality can translate to 20 to 30 percent more revenue from loyal customers over the lifetime of their relationship with the company as well as improve conquest (taking customers from competitors) sales as much as 25 percent. All of this adds up to a potential before-taxes margin improvement—assuming a manufacturer with $10B in revenue and 20% margins—of upwards to 40%.

At Merck, we generate a huge amount of data in our manufacturing operations. But despite the huge volumes, we have wrestled in the past with two key challenges that barred us from being able to make use of the all the data to completely understand all aspects of our manufacturing processes, which in can turn improve production performance.

Neither of these challenges is specific to pharmaceuticals or to manufacturing (my colleagues in other industries face them as well).

Challenge #1: Data Silos

The first challenge had to do with data silos. Large datasets couldn’t help us improve our yields if they were siloed across many disparate systems and data repositories, which made them extremely difficult to combine in one place for a single view of our manufacturing operations.

Merck has established many highly tuned, specialized systems to gather data. Each system gives us a different view into the manufacturing process.

We gather real-time shop floor data in time series. As we make a batch, we capture data from machine sensors to monitor values like temperature trends, humidity levels, flow rates, pressure, and agitator speeds.

We retain maintenance and calibration records on our equipment. For example, a specialized instrument like a mass spectrometer measures a concentration of off-gas. Instrument sensor data could be useful for both real-time decisions and also for historical cause-effect analysis over many batches and many years.

Throughout the various stages of our manufacturing process, we capture many quality measures on each batch. Understanding quality data is particularly important, because just one batch lost to quality issues could cost the company one million dollars and could jeopardize supply of our medicines.

Other systems manipulate and control the manufacturing facility to maintain the conditions required for the sensitive biological processes that Merck manufacturing must control very precisely. These complex processes generate huge volumes of data in a variety of data types and formats, which become siloed across and ultimately trapped in disparate manufacturing, quality and maintenance systems, each with their own separate file systems.

Today the Hadoop ecosystem lets us bring all of that diverse data together into one environment. I have long regarded this single view of data as the “Holy Grail of manufacturing process optimization”. We now have this aggregate view, which compliments our existing underlying systems, and maintains the fine-grain detail of the raw data required for end-to-end process visibility and optimization.

Challenge #2: Limited Data Retention

Some manufacturing questions are asked on a batch-by-batch level, but others need to be asked over a large number of batches that span years. Two economic factors in existing data technologies obstructed our desire to retain all operational data for long-term analysis.

It is expensive to store each unit of data in the storage technologies that we’ve been using for years. Ours is a highly measured, intensely scrutinized and regulated process. We already know which data are most valuable to retain, and so it makes sense to pay a higher average cost to store that data—but there is far more data whose value is uncertain. Moreover, low-value data today, may grow in value as our understanding of our processes change. The point is this: the cost of storage can constrain the amount of additional data that we could like capture or retain electronically after the batch was produced and shipped. We may retain some paper records by buying more cabinets, but those can be very time consuming to search.

Another cost driver was the need to transform multiple sources of raw data into a structure and format required by our existing storage platforms. We call this constraint “schema on write”. This approach requires that you know the questions to be asked of the data prior to putting the data into the correct “schema.”

Hadoop has a “schema on read” architecture that allows us to capture, store and bring together a huge variety of data in one shared environment, without first having to go through a costly and labor intensive process to create a schema for this data in advance. The questions that we will want to ask in the future will then determine the data we seek. And with Hadoop, we have far more data to explore, in a shared environment where we can join it in various ways to rapidly answer new questions that will help us improve our processes.

Because of schema on read we can now create new data sets that never existed in their natural habitats, and we can keep those as long as we need them.

Challenge #3: The High Cost of Testing Hypotheses in the Real World

Another advantage of having all of the data together is that we can investigate some of our intuitions without having to run an experiment in a physical environment.

For example, we had a belief that we were diluting ingredients by washing them with water chase at a certain stage in the process. It would have been too risky and costly to try to test that hypothesis on the shop floor. With Hadoop we were able to collect all of the historical data to see if that hypothesis was true, without having to conduct the experiment in the plant. (The data showed that there was no effect.)

Solution: Yield Optimization with Hortonworks Data Platform

We can only answer important questions about our highly variable, biological manufacturing process if we have enough data across that entire process. Now with Hadoop, we can ask the important questions, identify systemic patterns in the data and take advantage of those patterns by improving our processes. This valuable capability has enabled us to identify variables that have the greatest impact on product yields.

Here are some of the questions asked and answered with the help of Hadoop.

How can we predict how a piece of equipment will perform?

We mine the data from our equipment maintenance system. With more data on more instruments spanning years back in time, we can establish performance profiles for individual machines and their critical components from previously unseen patterns. These profiles could then be used to monitor streaming sensor data in real time to proactively detect and respond appropriately. This can substantially improve the overall productivity of operations and avoid unnecessary interruptions.

How can we enhance the yield of a particular protein in our fermentation process?

In manufacturing, you always want to control something. We have a lot of control levers to do that. We manage heat. We change the agitation rate. We change the rate at which we add ingredients.

Fermentation is one of the biological processes that we control. We cultivate yeast cells. During the biomass growth phase, we want the cells to generate more cells but without gene expression.

In the next phase we do want to transition to gene expression. This means passing the energy from the cells growing new cells, into making the protein. We need data to manage and control that transition.

In a biological environment this phase can be quite variable and it depends on external factors. For example, salinity is an important condition to monitor. Oxygen may dissolve from gas to liquid at different rates because of agitation. The biology is constantly changing.

What we’ve discovered is that there’s often a proxy or measure that gives you a good indication of everything that’s going on in that batch, such as a respiratory quotient. We need to understand that and use it as a feedback mechanism.

Feedback control theory is an interdisciplinary branch of engineering and mathematics that deals with the behavior of dynamic systems and how their behavior is modified by feedback. Say you’re sitting in your house and it’s a hot day out and it’s getting warm inside. The feedback is the inside temperature, and you prefer 68 degrees.

The day is getting hotter, so you turn on your air conditioning. Now it’s 75 in the house and based on that feedback the AC knows to control the temperate downward to 68. There are many algorithms that are used for that feedback-control mechanism.

The same concept is useful in pharmaceutical manufacturing. Having access to all the production data, allows us to perform a virtual experiment to determine the best operating regions of our process. This enhances our ability to understand how we may increase the yield of the proteins of interest.

Hadoop has also helped us with another process impediment: the speed of our analysis. For example, if a batch needed to be investigated for whatever reason, it could take us months to gather all the data that may exist in a combination of both paper and electronic formats and then aggregate it to understand what caused the issue of interest.

Now we are working towards having “curated data on tap” across the end-to-end process. Data lineage is a top priority. We know where the data came from. It will still take time to analyze it, but it might take a week to find answers or refine our analysis, rather than many months.

Looking Ahead

Now we’re working on building up our ability to analyze streaming data in real time. Ideally, we would want algorithms to analyze real-time data, match that to a profile of a “golden batch” that we have in history and then alert us if any batch begins to deviate from that ideal profile. We’re not there yet, but that’s where we’d like to be.

This is one of the reasons that our partnership with Hortonworks is so important. They’ve already worked with many customers across different industries to implement streaming analytics, so their guidance is valuable as we plan to extend our use of their platform to further optimize our yields. We trust their advice and count on their Hadoop expertise.

Jerry Megaro
Director Innovation & Manufacturing Analytics,


Bob Parker
Group VP Manufacturing Insights,

Grant Bodley
GM Global Manufacturing,



Leave a Reply

Your email address will not be published. Required fields are marked *