Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
October 13, 2016
prev slideNext slide

Why are we treating Data like a Picasso?


Provenance, Lineage & Chain of Custody

The models of Provenance, Lineage and Chain of Custody are used in fine art to determine when a piece was created, the sequence of locations where it was held, how it was touched along the way, and who has owned it since creation, all with the purpose of authenticating the piece. What does this have to do with boring data?

It turns out many decisions which affect our daily lives are made using a single final result – or score – which is derived from many other pieces of data. What if one of those pieces of data was wrong or stale? This could lead to “Bad Data”, and the consequences can range from the inconvenient to the catastrophic. We must understand the data components used to calculate a final number to insure the result is valid and current; this is why we need to adopt the models of Data provenance, Data Lineage and Data Chain of Custody, and make them an intrinsic part of any data driven decision.

Let me start with a few Examples:

  • In September 2008, a report flashed across trading screens saying United Airlines had filed for bankruptcy, this provoked investor panic, and sent UAL stock plummeting more than 75%. The (undated) article turned out to be six years old, concerning the 2002 bankruptcy of UAL, United Airlines’ parent company, and it appeared on the list of top news stories from Google News. The article was six years out of date.
  • In May 1999, five US JDAM guided bombs hit the Chinese embassy in Belgrade during the NATO bombing. George Tenet, director of the CIA, attributed this to a mistake caused by three basic failures: First, they had the wrong coordinates; Second, none of the military databases used to validate the targets contained the correct information; Third, nowhere in the target review process was either of the two mistakes detected.
  • Bad data has been widely accepted as a major factor in the Financial Crisis of 2008. Saul Hansell. New York Times Bits Blog, September 18, 2008, How Wall Street Lied to Its Computers. says: “The people who ran the financial firms chose to program their risk-management systems with overly optimistic assumptions and to feed them oversimplified data.” Financial regulators seem to agree and have passed a series of regulations related to data as well (see The Tortoise and the Hare in Wall Street).

The cost of “Bad Data” ranges from TDWI (The Data Warehousing Institute) estimate of $611 billion each year for U.S. firms, to IBM’s $3.1 trillion per year figure, either figure is simply staggering, not to mention the individual lives affected by this.

Data Governance

The causes of Bad Data typically fall into these categories:

  1. Bad Source: Data sourced from the wrong place, or entered incorrectly
  2. Undocumented Alteration: Data which is altered along the way and not documented
  3. Wrong Use: Data modified for a specific purpose which does not fit other uses
  4. Stale: Data that is outdated

The right solution needs to address all these issues under the umbrella of Data Governance, and it must provide a full audit trail to record and verify all events that could change every piece of data going into a meaningful calculation. It must enable enterprises to have the proper tracking and monitoring of data via Data Provenance, Data Lineage, and Data Chain of Custody.


Data Provenance refers to the “origin” and “source” of data – where a piece of data came from and the process by which came to be in its present state.

Data Lineage is the process of tracing and recording the origins of data and its movement between databases or systems; it tracks the data life cycle from its origin to its destination over time, and what happens as it goes through diverse processes.

Chain of custody refers to the indelible record that captures the original data, who may have accessed or modified it during its lifetime, records how the data changed, and where and when there was a transfer of possession.

Data provenance needs to allow the user to see how a piece of data flowed through the system, replay it at any stage in the flow, store what happened to the data before and after key stages, thereby simplifying data flows that are often large, complex directed graphs involving transformations, forks, and joins.

Apache NiFi

A great solution to solve this problem came from an unexpected source, the National Security Agency (NSA). The NSA could not find a commercial solution which had at its core the data governance, security and audit capabilities they needed to move massive amounts of data securely from multitude of sources.  So they decided to implement it in-house more than ten years ago. The project is called Apache NiFi, and it was submitted to the Apache Software Foundation in November of 2014 as part of the NSA Technology Transfer Program, making it an open source software project.


Apache NiFi was implemented to solve two basic problems: First, move massive amounts of data, from many sources and varieties, securely and effectively; and Second, to have the embedded Data Governance built directly into the system to trace the data from beginning to end.

Every piece of data that flows through Apache NiFi is listed for chain of custody, lineage and data provenance analysis.


Once a piece of data is chosen, further inspection of it’s lineage can be viewed. The picture shows how a piece of data was received, forked and routed between systems.

By furthering inspecting this flow, one can gain provenance information on how that piece of data was handled and processed along the way. This means full knowledge where the data came from, who modified it along the way and how the each reported number is calculated.


Data is only reliable when the sources and process used to create a result set are traceable, reproducible, and visible to those responsible for the results. This requires an infrastructure designed from the ground up to implement the proper Data Governance and track data through all the transformations, from source to end result, and guarantees that the integrity and reliability of the provenance records cannot be rewritten.

To quote Tim Berners-Lee: Data is a precious thing and will last longer than the systems themselves.”



Helga says:

Excellent summary. Thanks Diego !

Vivek says:

Great summary of issues I see clients grappling with, in today’s environment. A pleasure reading this blog. Please keep writing!!

Srikanth says:

Great!! Thanks for summarizing Diego!

Leave a Reply

Your email address will not be published. Required fields are marked *