What is a Data Catalog?
A data catalog is a repository that holds definitions for information assets in an enterprise. It is used to support effective data management by documenting the most critical information. A data catalog helps data analysts and data scientists use data effectively to answer business questions. The data catalog contains technical metadata for enterprise data assets and acts as the single source of truth in the effort to effectively manage data assets.
How do Data Catalogs support Data Governance?
Data governance is the set of processes and technologies that are used to ensure effective management and utilization of data. Analysts and Data Stewards in an organization use data cataloging tools to simultaneously enforce corporate governance policies and to promote the correct usage of data. Typically metadata is extracted from Databases, ETL processes, and some BI Tools and is consolidated in the Data Governance Tool. This information is enriched with additional governance information in order to support the established enterprise metadata management strategy.
Data Catalogs support data governance through the following functionality:
- Data Quality – Data profiling is performed to alert analysts when data pipelines appear to have issues with data quality. For example, if today’s data load contains only half the number rows in a transaction table as is typical for a table, the analyst or data engineer should investigate the issue before reports are distributed with incomplete data. Data catalogs also allow analysts to flag qualitative issues with a dataset and track these issues through resolution.
- Certification – The Data Governance tool identifies which datasets and visualizations have been certified and tracks ownership and certification changes over time. Certification details may be extracted from the underlying BI Tool metadata for reports or certification can be performed directly within the data catalog.
- Usage Stats – In some tools, usage statistics are collected from the underlying BI Tools and presented to the analyst in the Data Governance tool. These stats identify the level of engagement of business users with each reporting asset and are used by Analysts to determine which reporting assets are gaining traction within the user base and which content is underperforming.
- Data Classification – Effective governance requires that datasets and reporting be classified based on data sensitivity, the presence of PII data, as well as other key metadata. Some data catalogs automate the data classification process of flagging PII and other sensitive data using machine learning algorithms. Data classification metadata is required to inform the proper usage of data to meet regulatory compliance, GDPR regulations, and data privacy requirements. A data catalog typically provides the ability to extend the basic metadata collected from source systems with this required data classification metadata.
- Data Lineage – Before working with a dataset or a report, analysts must first understand the source of the underlying data. Data Lineage diagrams provide a visual map of the sources for a given dashboard or dataset. They establish the full data preparation journey for the data integration behind the visualization. A detailed data lineage diagram establishes the necessary context for an analyst who is trying to determine whether an existing BI asset that has the correct information to help answer a specific business question.
- Business Glossary – If an organization does not have a consistent set of definitions for key enterprise metrics and business terms, invariably over time different analysts will use a different set of rules to measure the same metric. This inconsistency presents the business with a conflicting set of numbers and leads to a lack of trust in the data. Business analysts maintain the approved set of definitions for all key metrics along with established ownership of these definitions as part of the business glossary.
- Life Cycle Management – All BI Assets, whether they are tables or dashboards and reports must be managed through their lifecycle. Before a new dashboard is published to users, it must undergo a process to certify that it uses data and transformation rules that are consistent with established metric definitions. Over time, as business rules and data sources change, tables and reports that were previously considered a “gold standard” can become obsolete and must be updated or retired. An effective data governance tool provides a mechanism to manage the lifecycle of all key BI Assets.
Why a Data Catalog is not enough?
The critical functions provided by data catalogs are invaluable to data analysts and data scientists as they make decisions about which existing BI assets to use in an analysis. However, these tools are inadequate in addressing the full governance needs of the organization because they fail to support the needs of all data consumers in the enterprise. A typical business user will not make use of a data catalog tool as part of his day-to-day work and will therefore not benefit from the wealth of information that it contains. As a result, many organizations struggle to achieve business value from the substantial ongoing investment required to maintain governance data in these tools.
How does a BI Portal extend a Data Catalog?
A Business Intelligence Portal allows organizations to fully leverage their investment in a data catalog. By fully integrating information captured in the catalog and making it available to all users in the enterprise, a fully governed self-service analytics environment is achieved: Portal Image of MI here (add lneage or business glossary below)
- By accessing lineage information, any user can understand the source of information in a report or dashboard. The user can see whether the reports draw from the data warehouse, a data lake, or some other big data environment and can be assured that the right data is driving any decisions that are made with the data.
- By integrating data catalog metadata with results that are returned via BI Portal search, a google-like search user experience is provided in which data discovery becomes a familiar and engaging experience.
- Displaying the Business Glossary terms along with the relevant visualizations provides users with the necessary business context for users to properly interpret data. For example, with this approach, a business user instantly knows which enterprise KPIs are associating with a visualization and can immediately access the relevant definition and ownership information.
- The process of curation of enterprise assets into a BI portal ensures that only relevant and accurate content is presented to users. The data science and data analytics teams curate content to ensure that reporting is certified, has correct lineage definitions, and has accurate data classification.