Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
April 06, 2018
prev slideNext slide

The Emergence of Data Marketplaces

When we talk about the exponential growth of data in today’s digital world, the word “exponential” seems to be such an understatement. In 2017 alone, more data was generated than in the past 5000 years combined, and this will rise tenfold in less than a decade. A significant contributor to that data growth will be IoT. An IDC 2017 report predicts that by 2025, more than a quarter of data created across the world will be real time in nature, and real-time IoT data will make up more than 95% of this.

With this type of data growth, there are a few challenges that enterprises face –

  • Data sprawl – Data is spread across the on-premises infrastructure as well across multiple clouds. And there are new data sources popping up feeding new data into the enterprise on a daily basis. The ability to ingest all different types of data across all these locations is quite challenging.
  • Dark Data – When the ability of an organization to collect data exceeds the throughput at which it can process or analyze the data, then the organization starts to accumulate on dark data. Dark data is the data residing in your data stores that is not being used in any way to extract valuable business, customers, and operational insights. In 2016, the world produced 16 ZB (yup, 16 million Petabytes) of data, but only 1% was analyzed.
  • Data Decay – With the fast pace at which businesses are moving and technologies are evolving, data retains its value for a certain period of time within a certain context. The insights that you derive from real-time data streams are termed as perishable insights because that insight is valuable only within a certain time window after which the opportunity to use that actionable intelligence is lost.

In essence, the need for gaining access to the correct set of data at the correct time and context is very essential to take corrective actions within the opportunity window. But, in spite of the inability to process all the data that an enterprise possesses or receives, it is still pertinent for the enterprise to look at relevant data sets that are outside the organization as well. With the mainstream use cases for AI, ML and Data Science on the rise, enterprises need to feed such algorithms larger sample sets to get more accurate predictive models for decision making. Look at the fatal incident with the autonomous Uber car in the news recently. It shows that our ML models are far from being perfect in trusting them with mission-critical or life-saving activities. More such real-life data is made available to such algorithms, the better they learn and execute.

In order to gain access to such external data sets, enterprises are willing to pay money. Every digital enterprise today is sitting on a gold mine – their own data. I am not talking about customer data or PII (Personally Identifiable Information) data. Usage metrics, consumer behavior, demographics information, machine test data and so many such data sets – with a little bit of anonymization, enterprises can monetize such data sets to their own benefit.

On a related note, the Things economy is paving way for an Ecosystem economy. IoT is only as good as the data it generates and processes. Every connected device is made up of several connected OEM components internally. A connected car is made up of several connected OEM components from various manufacturers. The sensor data from one set of devices within that connected ecosystem can be valuable for another set of devices within the same ecosystem. And it is with this premise, data marketplaces are gaining more relevance and prominence today.

A data marketplace is a platform where data providers (those that own/license that data) can offer their data sets for a price. Data consumers can purchase or subscribe to such data sets and use it for their research, modeling or analysis. The marketplace itself may be hosted by a third party proving this secure platform for such data exchanges to smoothly happen. Data is made available through a data catalog. Data can be sold in chunks or segments. Data can also be made available as Data-as-a-Service.

There may be other services providers in such platforms such as data aggregators that can collect and combine specific data sets from multiple providers and create a logical dataset aggregation. This is made possible with metadata-driven models that allow for data tagging within the catalog. So, when looking for “tire wear-and-tear data”, data related to the tag “Tires” can come from multiple data sets across multiple data providers. In the marketplace, there may also be data quality service providers that may offer to cleanse up data sets for a price. There are some marketplaces that even crowdsource data quality services.

Now, you may ask, who would want to purchase such data? Companies looking for new business opportunities, new markets, competitive threats etc. are ideal data consumers. Companies planning to use such data sets for their product enhancements are ideal marketplace customers. And, companies using such information for enhanced or premium services to their customers are ideal for this model too. An example of that will be insurance companies using connected car usage metrics from automotive vendors to create more accurate and personalized insurance quotes for their customers in various regional locations.

If all this sounds so ideal, why aren’t we seeing more data marketplaces still? There are a few challenges to the adoption of this model. I will go into more details in my next post on the challenges and how emerging technologies like Blockchain can enable marketplaces.


Shankaran Sitarama says:

Awesome blog Dinesh. This term of Data Marketplace has been for a while now, but I guess, it is gaining more maturity and getting ready from the data side of things (the things described in this blog of yours like Data Quality svcs, Data aggregators, etc.). On the marketplace side, the serious potential comes from Blockchain technologies and that is what is super exciting, at least for me for the past 2 yrs now. So, looking forward to your next blog dealing more with those aspects.

I think, from the data side of things tho, one other important impediment, if you will, in my opinion is the “data obfuscation” aspects which enables enterprises to trust and safely let out that data without the fear of letting out some trade secrets. This is not just the confidentiality aspect of the data content, but the patterns and the other things that can be derived out of it. There is a fear that using Data Analytics, the insights that can be gained could perhaps lead to some secrets being exposed. I think that intelligent masking of data (obfuscation) in a way that it is still usable but perhaps does not allow detailed insights into the business processes / secrets of a company are extremely important and I see that as a major barrier to a mad rush of such marketplaces. More maturity in this aspect is to be seen.

On the blockchain and marketplace side of things, it is definitely less mature.

My 2 cents on this.

Dinesh Chandrasekhar says:

Thanks for your comment, Shankar. Very valuable insights. While I do agree with you about an inherent fear of data privacy or competitive intrusion in this model, it is an ecosystem that should evolve and learn to work with each other on a certain degree of trust. We do have data tokenization vendors that can help with data obfuscation for broader consumption. Moreover, this is where blockchains bring in the “trust and identity” model to ensure that you sell data to the ones you can trust on your ecosystem. Partners stand to benefit from each other’s data and grow mutually. I can foresee an ecosystem-driven blockchain model where the providers and consumers are very close-knit in the initial phases of adopting such a model. Later on, as technologies evolve and we have more accountable trust models, the closed ecosystems will start to open up. I also encourage you to look at IOTA and Ocean Protocol – two blockchain initiatives that are already delivering more than prototypes in this area.

Amar Padmanabha says:

Speaking of Blockchain + data marketplaces, you might want to take a look at Enigma protocol which brings data marketplaces with support for privacy & scalability to the Ethereum public network. Enigma data marketplace went live on the public Ethereum network recently. “The on-chain logic for the marketplace is coded in smart contracts deployed on the Ethereum network, and is operated directly with Enigma tokens (ENG). Presently all data sets are provided off-chain by several providers, and storage is managed independently from the logic embedded in the smart contract. ”
A wide range of applications from various industries which require data marketplaces with privacy can leverage Enigma.

Dinesh Chandrasekhar says:

Thanks Amar. Ocean Protocol, that I mentioned above, has a similar model to Enigma. Having data on-chain is definitely not the way to scale – at least the way Blockchains are today. With the next evolution of Blockchains, we might start seeing other models. But, until then, it makes perfect sense to keep the datasets off-chain but leverage the Blockchain for tracking, security, payments and governance to a fair degree.

Leave a Reply

Your email address will not be published. Required fields are marked *