How do you make more data available to more users? This is the question keeping many data professionals up at night.
I recently met up with a friend, Sean, who works at a large pharmaceutical company. His organization is very sophisticated in how they approach analytics, and they have a very advanced research organization that leverages AI and machine learning.
But the challenge they’re facing is that more people want more data. “Every day I have someone coming to me and asking for data,” he told me. They’re asking questions like, “Why aren’t you collecting this data? How can I get access to that data? Can you make sure that this data that’s created in another department is available?”
Sean is feeling more pressure than ever before. But he keeps finding himself further behind. “I’m not sure there will ever be enough data,” he admitted.
Empower the organization with an ingestion-first approach
My suggestion to Sean was to consider an ingestion-first approach. The idea here is simple: Can we empower everyone in the organization who owns an application or a dataset or some other data asset to ingest their data into a data lake? And if we can do that, can we automate the cataloging of the data and have some basic governance processes in place for the data, making the data discoverable and available immediately?
If Sean can do those things, the advantages will be significant. There are many occasions when his company’s need for data intensifies: when they’re acquiring another company, starting a new project, or going through a massive change—all events that come up fairly frequently. In all of these cases he’s going to empower the people with the right tooling and the right process so they can ingest the data first. (See my earlier blog post “Give the Power of Data to the People: Focus on Integration Strategy and Enablement” for more on data empowerment.)
Two critical steps to the ingestion-first approach
Ingestion can be a magic bullet—if you take the right approach. In order to ingest data, you don’t have to know a lot about how the data is going to be used. You don’t have to do advanced things in manipulating it and preparing it for the next stage. You just need the right tooling and the right process to move forward.
But—and this is a very important point—if all you do is ingest data without doing two very basic actions afterward, you have not necessarily done anything. You need to take two critical steps to be successful.
Step 1: Make the data discoverable. You need to catalog everything you ingested—and preferably catalog it and tag it automatically—because now you have a lot of data. You need to know what data you have, understand the data lineage, et cetera.
Step 2: You need to trigger basic governance. Once you’ve cataloged the data, you need to trigger a process of data enablement where key individuals can interact with the data. They can add comments or indicate which data has to go through additional steps. Depending on the data, you may need more advanced data governance processes—but those can be triggered as well.
If you did just these two things, you are already successful and ahead of the game. Now, if somebody asks for an additional dataset, this data is already there, ready for consumption.
How ingestion-first works in the real world
Let’s take the example of a pharmaceutical company to see what ingestion-first looks like in practice.
Say the organization is running a clinical trial for a new drug. To support that initiative, they’ve created a new database and they are receiving new files from a third party that is helping them on the clinical trial.
With an ingestion-first approach, they will ingest the clinical trial information from the database (whether it’s in real time or in batch) and from the flat files to the data lake leveraging a simple wizard. A catalog that is already scanning the lake will automatically identify the new data, categorize the data, identify the domains the data belongs to, tag the data, and lastly understand the lineage of where the data is coming from.
Automatic data governance processes will connect the data to the right business owner and the right process for data provisioning. In the example of the clinical trial data, the catalog can identify that this is clinical data and associate it with the person responsible for the data and what is required for that data to be made available. For clinical trial data to be made available in the data lake, it needs to undergo data quality and basic masking. So those processes are what will be triggered automatically.
How to get started with ingestion-first
The easiest way to begin is to start with a pilot. In the pilot you can focus on one or two different domains and one or two ingestion patterns (such as file ingestion, database ingestion, or streaming ingestion). In our case here, the requirements are for all three ingestion patterns. Keep in mind that if you’re ingesting into a cloud data lake, you’ll want to use tools optimized to work with cloud storage.
Your pilot will involve people, process, and product. But I’ve put the emphasis on process and product because our tools are simple enough to enable anyone to do the ingestion.
Join us at our Data for AI and Analytics Summit in North America or EMEA to learn how intelligent and automated data management that takes advantage of cloud data warehouses and cloud data lakes helps you gain the agility, speed, cost savings, and scale to succeed.