If data is the new bacon, data stewardship supplies its nutrition label!
This is the second part of a two-part blog introducing Data Steward Studio (DSS) which covers a detailed walkthrough of the capabilities in Data Steward Studio
With GDPR coming into effect in May 2018 and California legislature signing California Consumer Privacy Act of 2018 (CCPA) that grants California residents a broad range of rights similar to what GDPR requires when it comes to their personal information (PI), businesses need comprehensive solutions in order to understand how personal data flows through their systems and processes. For example, they need to be able to provide chain of custody information, inventory and classify data assets, secure access to personal data and monitor usage of such data. Having a comprehensive data inventory, managing trust and veracity of data, and proving that businesses have appropriate operational controls and safeguards for processing sensitive data have become paramount in the increasingly complex hybrid and multi-cloud enterprise data universe.
In April, we unveiled Data Steward Studio (DSS) at DataWorks Summit in Berlin which addresses several key areas of data management challenges faced by enterprises that are extremely relevant to hybrid data management under the regime of such new regulations. DSS has been released and is generally available to Hortonworks customers since May 2018 and is the second service to be generally available on the DPS platform. DSS addresses many key data management challenges faced by enterprises today:
In this blog we will walk you through the key features of DSS that empower businesses to understand the data and get a comprehensive view of their data in their hybrid data lake environments. DSS empowers enterprises to precisely identify and evaluate trust levels of their data, to collaborate securely, and to democratize data across the enterprise confidently so that they can derive value from the data in their data lakes – whether these data lakes are located in on-premise data centers or in the cloud or across multiple cloud provider environments.
Data stewards can create Asset Collections by filtering and selecting data assets in their data lakes with metadata using either contextual attributes such as name, description, owner, data lake or system attributes such as version, date on which asset was created or modified or the person who created or modified the data asset. Business users and data stewards can also search for assets using above-mentioned attributes or free text, view personalized dashboard and delete/ update data asset collections.
Overview: Provides metadata summary properties such as number of rows, columns, sensitive columns, number of partitions, owner, tags, profilers. Lineage shows the chain of custody for the data from relevant metadata repositories and both upstream paths (lineage) into and downstream paths (impact) out of a given asset. Usage and monitoring metadata are shown in the overview separately including widgets that display the top 10 users for the data asset and access types outlines action performed and operation type as well as trending of data access over time. System classifications generated by profilers (for example for sensitive date type classification for particular columns) and other managed classification (for example business classifications done via Apache Atlas tags) are also shown along with technical metadata and operational summaries of profiler execution.
Schema: Displays the structure and shape details schema of the data asset for structured data such as Hive tables using the relevant metadata repositories such as Atlas. You can also view the shape or distribution characteristics of the columnar data within a schema based on the Hive column profiler.
Policy: The policy view shows authorization policies defined for data assets. These policies may be defined and enforced using Apache Ranger. It includes both resource (physical asset based) as well as classification based policies
Audit: The data asset audit logs page shows both most recent access audits from Apache Ranger and also summarized views of audits by type, user, and time window based on profiling of audit data.
With DSS, data stewards can collaborate and share their insights with other users in the enterprise regarding various asset collections.
Data stewards can rate asset collections and view the average rating of an Asset Collection. This can help other data stewards and business users to find Asset Collections with certain trusted rating to be used in their analysis. Data stewards can also add their knowledge and insights to an asset collection by adding comments. Other users can then respond to earlier comments or add their comments about each data asset collection. Users of Data Steward Studio can also favorite and bookmark their asset collections for easy access.
DSS also provides a comprehensive dashboards that show at a glance the summary of data in a particular data lake or asset collection. For example, one can get an idea of how data is growing over time in terms of # of tables, how much of the content within the tables has been profiled and deemed to be sensitive, understand what are the top accessed tables in a data lake.
Similar dashboards are also available for every Asset Collection to give users a complete picture of the assets collection usage, contents, and help them collaborate effectively with others across the enterprise.
In summary, DSS enables enterprises to contextualize knowledge about data located across hybrid data lake platforms, take meaningful actions or generate actionable insights about their business operations, and reduce the lag between insight discovery and value creation.
See DSS in action from the keynote demo in DataWorks Summit, Berlin and conference breakout session on Security & Governance at Dataworks Summit, San Jose.
To learn more visit https://hortonworks.com/products/data-services/data-steward-studio/.