"What's in a name? That which we call a rose by any other name would smell as sweet." said Juliet to Romeo in the epic tale of love between members of two warring families. The quote is the tragedy and the central struggle for both Shakespeare's famous play and the more mundane but certainly as epic modern-day Data Cataloging!
There is a lot in a name. SQL based databases and applications have many naming constraints like use of spaces, limited special characters, length of the name, use of numbers etc. Add the naming styles used by different developers and organizations to the mix, and you get names like tblEmpl, cust_nm, cust_cust_typ_id, and more infamous SAP table names like LFA1 (table for Vendor data), LFBK (table for Bank details), FBL1N (Vendor Line Items in Accounts Payable) and more.
If Application developers were the only users of data, this may have been less of a problem. But the same datasets are needed by Data Analysts, Data Engineers, Data Scientists and Line of Business users. Expecting these users to know all the technical names and database lingo is a recipe for failure of any self-service analytics program. Organization specific terminology (Revenue vs Sales, Customers vs Subscribers, Customer ID vs Customer Code etc.) further complicates discoverability. Clearly, a data catalog that indexes only these physical names will not help.
Another factor outside of the discoverability is understanding of the business context. A Data Analyst looking at the rev_YTD column needs to not only understand that it is a Revenue column but also how is it calculated? Are license and subscription revenues added the same way? Is the financial year same as calendar year? and so on.
A rose by any other name may smell as sweet but is less "discoverable" and "understandable".
To solve the above problems, (most) Data Catalogs provide the capability to associate Business Glossary terms with technical names. Most large organizations have documented business glossaries and dictionaries that include business names and context for key data elements.
However, there is a still a snag - wouldn't be an epic without it. Associating business glossary terms to technical physical assets is the most tedious task in data cataloging and data governance. Large enterprises are dealing with millions of physical data assets and doing these glossary associations manually would take either an army of people dedicated to data curation or multiple months of mind-numbing effort.
Enter the CLAIRE™ engine, Informatica's metadata-based AI hero, who stands for enterprise productivity, automation and matching star-crossed terms regardless of their names. In Informatica's Enterprise Data Catalog (EDC) v10.2.2, released in March 2019, CLAIRE can reduce the manual effort of doing these glossary associations, bringing down time required to hours from months.
CLAIRE uses the following three methods to perform the glossary associations:
In a test on a real-world customer dataset, CLAIRE was able to associate business terms to ~18,000 columns out of 21,000 columns with 99% precision. That is months of work, done in less than 10 minutes, all based just on the Name Matching technique.
CLAIRE will continue to add additional factors to make these associations more and more accurate, so that users essentially get a self-driving data catalog compared to a system that needs an army of data curators to maintain.