Water, water everywhere, Nor any drop to drink
These lines from “The Rime of the Ancient Mariner,” by Samuel Taylor Coleridge also accurately describe the companies that are trying to transform themselves into a data driven company. These organizations have astronomical volumes of raw data at their disposal but how do they find that proverbial needle in the haystack when there is no map pointing them to the location where they can look for this promising information or even know what is available. Armed with a deluge of data, how can they navigate these unchartered waters and find their island of treasures without a clear map to guide them?
Information from curated data sources and the transformations that have been applied is often an arcane and esoteric process within large organization with IT, security and operational staff often holding the secret keys. Adding to this tribal knowledge regarding data provenance, the volume and velocity of data ingested into Big Data lakes makes data governance and data management quite challenging.
As big data projects mature from prototypes to production solutions, there is a critical need for a business catalog that enables data engineers to find the data they are looking for among millions of data entities and make it available to data scientists and business analysts. The ability to classify data effectively can greatly shorten the data to insight cycle. By certain estimates, data scientists currently spend 50% to 80% of their time just searching for relevant datasets before they extract value from this data.
Organizations need data governance capability to understand their information and to answer such questions about corporate data as:
What data do we have and what do we know about it?
Where was this data sourced from and how is it being used?
Does this data adhere to corporate policies and conform to the national regulations ?
Common Data Governance Challenges in Big Data
With exploding data volumes in data lakes, business terms such as customer, product, location and others become more fluid and imprecise in definition with many versions associated with them. Without a comprehensive and flexible business catalog, it becomes very difficult for a business user to identify cleansed and curated data, which leads to loss of integrity and lowers business’s confidence in its data.
Majority of the commercial solutions that are used to classify data such as MDM / RDM tools are limited and rigid in functionality, as they only manage metadata at the application level and require data to be exclusively managed from a single path for end-to-end tracking. These tools although rich in industry-specific regulations and compliance models, provide poor platform level visibility into Hadoop. Most commercial MDM tools do not support IoT Hadoop workflows or have the functionality to drive dynamic security policies based on data classification and metadata.
Inaccurate or Duplicate Data
Considering how much businesses are dependent on accurate data for business intelligence and improved decision making, inaccurate classification and duplication of data continues to hinder a number of companies. According to a study conducted by Experian, on average, U.S. companies believe that 25 percent of their data is inaccurate and common data errors adversely affect 91 percent of organizations. The concern is that if a significant portion of corporate data is inaccurate and companies are unable to identify it then they are making important business decisions based on erroneous data.
Business Taxonomy (Catalog) Definition
Big Data brings democratization of information access and eases how information can be shared across the enterprise. However, unplanned growth can result in ‘data swamps’ with content that is not tagged or cataloged adequately. Business taxonomies can provide the missing link in closing this gap. From the Greek, ‘taxis’, meaning ‘order’ and ‘arrangement’, taxonomies use a hierarchy of terms to classify and arrange concepts or physical/ logical objects making them the ideal vehicle to capture the structure of the entire domain of an enterprise’s content.
Consistent classification and tagging across the enterprise using taxonomies supports system/ platform interoperability and value generation from structured and unstructured data sources by mapping them to common shared vocabulary. This authoritative reference taxonomy improves both data confidence and time to insight.
Requirements for a Big Data Business Catalog
Purpose-Built Platform Solution
In order to make sense of big data and provide users with the ability to find the right information, enterprises need a data governance solution that is designed for Hadoop and operates at the platform level, so that it consistently classifies data across all the engines used by the organization to move and analyze data.
A purpose-built platform solution can serve as the single source of metadata truth in Hadoop by automatically tracking multi-user, multi-application activity in Hadoop components with native connectors, whereas data governance solutions that operate at the application level require a single proprietary solution path which ends up proliferating data silos.
Faster Data Discovery
The business catalog enables data officers and stewards to search for data and metadata quickly and in a number of different ways to reduce time to value. This includes the ability to search by:
Asset Type: Search for a Hive table, Storm Topology or any connected component
Tags: Search for all columns or tables that have a specific tag such as PII
Business Language: Aligned with compliance standards & policies
The combination of these search capabilities empowers data stewards to construct a model of their organization and how it conducts business. These includes the ability to model a business by combining both logical and physical data entities to develop a more complete understanding.
Classification-based Dynamic Protection
Effective business catalog cannot be passive or simply forensic. Consistent data classification must drive access policies that can withstand the test of audit and compliance. Specifically metadata and taxonomy can be used to institute centralized dynamic access policies at run time that proactively prevents violations from occurring.
Agile and adaptable – ensures information is current by native connectors
Apache Hadoop exists within a broader ecosystem of enterprise analytical packages. This includes ETL tools, ERP and CRM systems, enterprise data warehouses, data marts and others. Modern workloads flow from various traditional analytical sources into Hadoop and then often back out again. Big data business catalog must be adaptable to streamline compliance efforts by allowing companies to import existing metadata structures via REST-based APIs from other sources to leverage legacy investments, or to pre-load a taxonomy-rule combination for a specific industry or line of business.
High Confidence data in Hadoop for regulated verticals
Many organizations operate in a tangle of compliance requirements. Sensitive data must be protected by both governance regimen, as well as specific technologies. Organizing data to align with not just business language but with specific terms/ concepts used for industry standards such as BASEL I & II Accords and others enable accuracy and ease in applying the correct safeguards to the entire analytic workflow. Data domicile, payment information, retention, and personally identifiable information rules vary from industry to industry and by geography. A proper business taxonomy facilitates agile governance regimen, which is a prerequisite for any analytic platform, as well as to extract insight from data. A business catalog empowers decision makers to have the confidence to quickly make data driven business decisions.
To learn more about business catalog attend the session What the #$* is a business catalog and why you need it! at Hadoop Summit San Jose on June 28, 2016 at 11:30AM