Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
October 20, 2015
prev slideNext slide

Is a Lake Big Enough to House Your Ocean of Data?

Is a Lake Big Enough to House Your Ocean of Data?

Contrary to popular belief, Hadoop was not the elephant-in-the-china-shop that marauded and disrupted the data center. The real culprit is data and how it has exploded in volume. The past two or three years have seen a rise in the number of successful Hadoop projects in enterprises to tackle this explosion of big data. These large volumes of data, the emergence of the Hadoop technology and the need to store all the siloed data in one place have prompted the phenomenon called the Data Lake among enterprises.

Is the Data Lake an effective catchment for all of the enterprise data?

Yes and No. Data lakes are good to house the current, inter-related data but they don’t address the need for an enterprise-wide data management system

  • Since the data lake holds raw data of different types the business user cannot have controlled access to risk-free, secure, governed and curated data with semantic consistency as in the case of an enterprise data warehouse
  • Enterprise data today is heterogeneous, locked in disparate data sources and data from these systems are in conflict
  • A data lake is agnostic to the type of data it receives and due to issues such as lack of governance, descriptive metadata and a mechanism to maintain it, the data lake can easily turn into a data swamp with too much data
  • Hadoop and related technologies are still nascent even among early adopters, who are mostly conversant with SQL for data discovery and require training in Pig and MapReduce for data access. This slows down time-to-value for enterprises.

Big Data Virtualization to harness the power of the Data Lake

VHA is the largest member-owned healthcare company in the US delivering industry-leading supply chain management services and clinical improvement services to its members. The company had its product, supplier, and member information, and other data, spread across multiple sources, residing in silos.

The value of consolidating their disparate data into a data lake was not missed on VHA. This resulted in the company using the Hortonworks Data Platform, to enable the business users to discover the related data and provide services to their members. Because of their previous success with data virtualization using the Denodo Platform, VHA decided to use data virtualization to enable their business users to discover data using the familiar SQL, and thus abstract their access directly to Hadoop. With the Denodo Platform users can combine several types of data that float in a data lake and offer them all as one integrated data set to the consuming application.

Related big data technologies such as Pig, MapReduce and Impala allow users to query a Hadoop cluster through SQL but they involve a steep learning curve and extensive training. The Denodo Platform offers a data abstraction layer over Hadoop, NoSQL and traditional enterprise repositories, and allows the creation of virtual, canonical business views of data to address a broad spectrum of use cases, including big data analytics and agile BI solutions.

Are you still learning about the Data Lake? Wondering how it can help your organization manage and leverage massive amounts of data? Learn from VHA experts who detail their use of Hadoop and Data Virtualization in this webinar titled “Hadoop and Data Virtualization – A Case Study by VHA”. Watch it below:


Leave a Reply

Your email address will not be published. Required fields are marked *