Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems.
My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing.
There are many data processing pipelines. One may:
Consumers or “targets” of data pipelines may include:
Below are examples of data processing pipelines that are created by technical and non-technical users:
As a data engineer, you may run the pipelines in batch or streaming mode – depending on your use case. Standardizing names of all new customers once every hour is an example of a batch data quality pipeline. Validating the address of a customer in real time as part of approving a credit card application is an example of a real-time data quality pipeline.
You may also receive complex structured and unstructured documents, such as NACHA and EDI documents, SWIFT and HIPAA transactions, and so on. You can receive documents from partners for processing or process documents to send out to partners. This is an example of a B2B data exchange pipeline.
Data matching and merging is a crucial technique of master data management (MDM). This technique involves processing data from different source systems to find duplicate or identical records and merge records in batch or real time to create a golden record, which is an example of an MDM pipeline.
For citizen data scientists, data pipelines are important for data science projects. Data scientists need to find, explore, cleanse, and integrate data before creating or selecting models. These machine learning models are tuned, tested, and deployed to execute in real time or batch at scale – yet another example of a data processing pipeline.
There are three ways to do it:
Data processing pipelines have been in use for many years – read data, transform it in some way, and output a new data set. Regardless of use case, persona, context, or data size, a data processing pipeline must connect, collect, integrate, cleanse, prepare, relate, protect, and deliver trusted data at scale and at the speed of business.
In the next blog, I’ll focus on key capabilities of the design tools that make data processing pipelines successful.
Create a data processing pipeline in the cloud – sign up for a free 30-day trial of Informatica Intelligent Cloud Services: https://www.informatica.com/trials