Cloud Data Governance and Catalog Scanner Capabilities for Microsoft Fabric

Last Published: Nov 14, 2024 |
Yaron Canari
Yaron Canari

Principal Product Manager, Fabric Platform Team, Microsoft

This blog is co-authored by Wojciech Kijas (Informatica), Tomasz Czeleń (Informatica), Ajay Gollapalli (Informatica). 

Microsoft Fabric is an increasingly popular platform not only for analytics but also as a key data source for organizations looking to fuel innovative generative AI (GenAI) applications. One of the key challenges with implementing GenAI use cases like copilots is ensuring models only access high-quality, safe, clean and trustworthy data. Otherwise, results are likely to be inaccurate or, potentially worse, based on protected or confidential data. To ensure models only access appropriate data, data governance and data lineage have become more vital than ever.

Informatica Cloud Data Governance and Catalog (CDGC), can help overcome these challenges with newly announced scanner capabilities for Microsoft Fabric. With these new scanner capabilities, CDGC can provide a detailed analysis of data elements hosted in Microsoft Fabric, to build comprehensive knowledge about the object’s dependencies, relationships and business logic. Enabling insights into data origins, transformations and usage is a key reason data lineage has become one of the core functions of modern governance solutions. 

Cloud Data Governance and Catalog for Microsoft Fabric

Informatica has worked with Microsoft to develop metadata scanners for all three Fabric data sources to its Cloud Data Governance and Catalog (CDGC) portfolio, including scanners for Microsoft Fabric Data Lakehouse, Microsoft Fabric Data Warehouse and Microsoft Fabric OneLake. The scanners better serve joint Microsoft and Informatica customers eager to bring the power of Informatica’s Enterprise Data Management to the Microsoft Fabric data platform, especially for AI. 

Microsoft Fabric

Microsoft Fabric provides organizations with a single platform for data services, including data engineering, data warehousing, data science, real-time analytics and business intelligence. It is built upon a single data lake, OneLake, where organizations can access their entire multi-cloud data estate.

With Cloud Data Quality and Cloud Data Governance and Catalog, every Informatica Intelligent Data Management Cloud (IDMC) user can easily scan the following sources:

You can leverage CDGC to automatically harvest critical, technical metadata to enable governance processes and additional analytics like profiling and quality. An example of this in action might be regarding private personal information for employees or customers. Organizations will likely want to store employee and customer information in the cloud but will need to control and restrict access to the data to maintain regulatory compliance, satisfy privacy concerns and avoid damages relating to data exposure. CDGC can help by automatically identifying private, personal, or proprietary information and ensuring only users and applications with the correct permissions can access the data. For AI applications, this means that organizations can mitigate the risk of ChatGPT-type tools accidentally surfacing personal or proprietary information to users.

Microsoft Fabric Data Warehouse

Microsoft Fabric Data Warehouse is a data warehouse designed for large-scale enterprise analytics. The Informatica metadata scanner extracts inventory and details from any of the following objects:

  • Databases
  • Schemas
  • Tables
  • Views
  • Columns
  • Primary Keys
  • Functions
  • Stored Procedures

As part of programable object analysis (stored procedures and views), the scanner extracts business logic from the identified raw code and translates it into corresponding metadata. Generated metadata describes table/dataset/column/field relationships and applied logic.

Any interdependencies between objects are captured automatically, for example when a single procedure is calling another procedure. The CDGC scanner crawls through the code trying to emulate database engine behavior, tracks parameter values passing and analyzes every transformation unit within the appropriate execution context. Through this process, any information produced is fully accurate and precise and delivers proper insights with expected granularity to the catalog users.

Microsoft Fabric Lakehouse

Microsoft Fabric Lakehouse is a data architecture platform for storing, managing and analyzing structured, semi-structured and unstructured data in a single location. Informatica’s recently introduced metadata scanner can deliver data lineage capabilities by scanning Lakehouse SQL endpoints and extracting details of Lakehouse basic relational objects, such as databases, schemas, tables and columns.

The scanner also supports programable objects, such as views and stored procedures, for which the same set of parsing capabilities is available.

Microsoft Fabric OneLake

OneLake is Microsoft Fabric’s multi-cloud data lake that is wired into every Fabric workload. Informatica’s metadata scanner can extract complete folder structures and file details available on the OneLake file system level. In addition, the scanner can extract structures (including hierarchical structures) and detect/discover data partitions for a specified set of file formats.

The IDMC platform can seamlessly integrate metadata from Microsoft Fabric and other data sources and processes, including those that might be outside of your Microsoft environment. As a result, you can achieve a holistic view of their entire data landscape. Any critical data element and its journey can be traced end-to-end through the IT environment, leaving no unanswered questions about the data, its origins and how it has been processed. 

Example outputs

To provide an example of how the scanner works, let’s examine using the Informatica Cloud Data Integration mapping. In this scenario, a CDI mapping transforms Microsoft OneLake data by joining files to generate an output table in Microsoft Fabric Lakehouse. Both CDI and Microsoft Fabric sources were scanned to understand how and what data assets were referenced and transformed. This allowed CDGC to automatically generate data lineage diagrams (Figure 1 and Figure 2).

Figure 1 - Table level data lineage diagram

Figure 2 – Column-level data lineage diagram

CDGC can visually represent existing relationships and logic developed by developers or analysts. This allows you to clearly see how data objects traveled through the example environment and how they were transformed, allowing you to trust that the object is clean, usable data.

By understanding how data has been transformed and ensuring that only safe, clean and high-quality data is consumed by models, you can be confident in the outcomes, insights and results provided by users and applications leveraging the data. It can help optimize the outputs and outcomes of innovative AI solutions, including GenAI applications and copilots.

If you would like to learn how to accelerate cloud modernization, enable modern analytics and remove obstacles blocking GenAI adoption with the Informatica Cloud Data Governance and Catalog, contact us.

First Published: Sep 23, 2024