Intelligent Glossary Associations
"What's in a name? That which we call a rose by any other name would smell as sweet." said Juliet to Romeo in the epic tale of love between members of two warring families. The quote is the tragedy and the central struggle for both Shakespeare's famous play and the more mundane but certainly as epic modern-day Data Cataloging!
There is a lot in a name. SQL based databases and applications have many naming constraints like use of spaces, limited special characters, length of the name, use of numbers etc. Add the naming styles used by different developers and organizations to the mix, and you get names like tblEmpl, cust_nm, cust_cust_typ_id, and more infamous SAP table names like LFA1 (table for Vendor data), LFBK (table for Bank details), FBL1N (Vendor Line Items in Accounts Payable) and more.
If Application developers were the only users of data, this may have been less of a problem. But the same datasets are needed by Data Analysts, Data Engineers, Data Scientists and Line of Business users. Expecting these users to know all the technical names and database lingo is a recipe for failure of any self-service analytics program. Organization specific terminology (Revenue vs Sales, Customers vs Subscribers, Customer ID vs Customer Code etc.) further complicates discoverability. Clearly, a data catalog that indexes only these physical names will not help.
Another factor outside of the discoverability is understanding of the business context. A Data Analyst looking at the rev_YTD column needs to not only understand that it is a Revenue column but also how is it calculated? Are license and subscription revenues added the same way? Is the financial year same as calendar year? and so on.
A rose by any other name may smell as sweet but is less "discoverable" and "understandable".
To solve the above problems, (most) Data Catalogs provide the capability to associate Business Glossary terms with technical names. Most large organizations have documented business glossaries and dictionaries that include business names and context for key data elements.
However, there is a still a snag - wouldn't be an epic without it. Associating business glossary terms to technical physical assets is the most tedious task in data cataloging and data governance. Large enterprises are dealing with millions of physical data assets and doing these glossary associations manually would take either an army of people dedicated to data curation or multiple months of mind-numbing effort.
Enter the CLAIRE™ engine, Informatica's metadata-based AI hero, who stands for enterprise productivity, automation and matching star-crossed terms regardless of their names. In Informatica's Enterprise Data Catalog (EDC) v10.2.2, released in March 2019, CLAIRE can reduce the manual effort of doing these glossary associations, bringing down time required to hours from months.
CLAIRE uses the following three methods to perform the glossary associations:
- Data Domain Discovery: Data Domains are rule-based semantic labels that work as probes for finding the right columns or tables for glossary associations. These rules can be regular expressions, reference table associations (for glossaries like Customer Names, Supplier Names, etc.) or complex logic. Additionally, these rules can be both on column data or metadata(names). Using Data Domain Discovery, it is easy to create a rule once and automatically associate business glossaries for any columns/tables that match these rules.
- Similarity Discovery: EDC uses unsupervised clustering to cluster similar columns together. This machine learning methodology uses Data Overlap, Unique Value Overlap, Pattern Matches and Name Similarity as factors for computing similarity. Associating a glossary term to a column will propagate the glossary term to other columns in the similarity cluster. This ensures that manual glossary term association is required only for the first (few), post which CLAIRE learns the association and automatically does it when a similar column is indexed in the catalog.
- Name Matching: Here EDC uses name sequence alignment to automatically associate the right business glossary terms with the technical names. For example, "cust_nm" is matched automatically with "Customer Name". Here are examples of actual automatic associations done by this system on real world datasets:
In a test on a real-world customer dataset, CLAIRE was able to associate business terms to ~18,000 columns out of 21,000 columns with 99% precision. That is months of work, done in less than 10 minutes, all based just on the Name Matching technique.
CLAIRE will continue to add additional factors to make these associations more and more accurate, so that users essentially get a self-driving data catalog compared to a system that needs an army of data curators to maintain.