cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
February 20, 2015
prev slideNext slide

ClearStory Data and Hortonworks Partnership Enables Fast-Cycle, Business-Ready Analytics on Hadoop

In this guest blog, Kumar Srivastava, senior director of product management at ClearStory Data, shares his thoughts on ClearStory’s integration with Hortonworks Data Platform (HDP)

We are excited to be working with and announcing ClearStory Data’s integration with Hortonworks Data Platform (HDP) during Strata + Hadoop World 2015. This partnership with Hortonworks is significant as it brings ClearStory’s business-ready, fast-cycle, scalable analysis on Hadoop Data Lakes and specifically on the Hortonworks Data Platform (HDP).

ClearStory’s integration that includes a data inference and data streaming framework provides faster access to data in HDP via Apache Hive, fast blending of additional sources of data with data in HDP for holistic insights, and a business-ready user application that brings the power of data stored in Hadoop directly to business users, who need to see insights directly, collaborate in real-time on analysis and take data-driven actions. Hadoop is moving mainstream, companies are running fast-cycle analytics on HDP to address critical business use cases and growing their HDP clusters towards a modern data lake architecture. As data volumes grow in Hortonworks and more sources of data are added and converged into HDP, ClearStory’s business-user ready analytics scales with it. ClearStory’s integrated data processing and data blending platform and its intuitive user application lets users easily access data from HDP for fast discovery of insights and speed visual interactive analytics on more data. Furthermore, this partnership with ClearStory also enables its medadata management and data governance capabilities on Hortonworks.

This tighter integration through Hive and partnership with ClearStory Data is welcome news for our customers as HDP is increasingly being used by our customers for their data lake initiatives. The scalability of Hadoop makes it an obvious choice for enterprises to implement a central repository for all data sources; be it corporate data, partner data, externally syndicated data or public data. Applications and systems can easily send this data into a data lake without an upfront hit to understand how the data will be used. As a result, data lakes tend to be very rich in variety and hold the potential for extremely high-value insights.

The value that a business user, analyst, developer or an application can derive from a data lake is directly proportional to how easy it is to search for, discover, harmonize and analyze the data itself. Yet, it’s vital that appropriate compliance and security guidelines are being met. For example, one key question that needs to be answered is: Can the right user search, discover and use the data in the authorized manner? This is a key challenge that poses a data, user and analytics governance problem for enterprises. With ClearStory and HDP, users can now search for and discover data that can be challenging to inspect (a precursor requirement to actual usage) and/or be devoid of any lineage and provenance.

Moreover, there are a number of things that a user needs to be aware of when viewing data from a data lake. Any inability of the interested user to understand how the data was created can seriously hamper the value that can be derived from the data set and diminish the ROI of the data lake. Context around the data sets that users need to know include:

  • Whether the data set represents only a sample of the entire data set
  • Whether the data set is a filtered view of an originating data set
  • Whether the data is derived and aggregated from a raw data set
  • Whether there are duplicates of the same data set used for other purposes
  • How data can be combined and harmonized with other disparate data for a broader insight
  • How data analysis can be securely shared with other authorized users
  • How the data refreshes, the update frequency and the implication to the user view

Without the right visibility and understanding about data sets in a data lake, users can apply this data in analytics in a way that violates key assumptions around how the data was created and published. This inadvertently leads to incorrect insights and broken analytics.

Through the ability to inspect and understand data through ClearStory’s data profiling and data inference capability, ClearStory is able to intimately understand not only the data dimensions within the data set but the structure of the data, the spectrum of values across time, location and other categories and other relationships that exist in the structure of the data or across data sets such as parent-child or sibling relationships. The end result is a rich set of “metadata” including lineage information that describes the data, how it’s used in an analysis, how it’s blended and harmonized with other data sets in HDP and how it’s managed and updated.

This metadata and ClearStory’s data governance and user governance is the key to data lake architectures. It solves the problem of data search and discovery, and security and usage governance, which are core requirements as data scales in the data lake architecture. With HDP integration and ClearStory’s analytics solution and user application companies can now govern and scale their analytics while giving business users across the enterprise faster access to insights.

We are teaming up with Hortonworks to enable scalable, governed analytics with a business-oriented user application for companies worldwide. Now companies can quickly identify and uncover business-oriented insights through automated, intelligent blending and data harmonization that solves the data discovery problem and reduces the time and effort spent in data wrangling. ClearStory also embeds a Spark-based in-memory processing engine for fast-cycle, scalable business-oriented analysis.

In summary, this combined solution delivers key data management capabilities to data lake architectures including: 1) deep metadata management, 2) data governance, 3) data lineage for auditing and compliance, and 4) granular user governance capabilities so collectively the data in HDP and data lakes can be securely accessed, used for analysis, explored, audited and managed. These capabilities are key to the modern data lake architecture.

ClearStory Data’s partnership and integration with Hortonworks increases the combined business value of data stored in Hadoop, accelerates the realization of the value of Hadoop-based Data Lakes, and delivers customers a readily usable data lake architecture and deployment blueprint complete with business-accessible data analytics and exploration, user governance, data cataloging, metadata and data governance.

Learn More

For more information about ClearStory Data and Hortonworks

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *