Process Files Quickly and Efficiently in Your Data Lake
This blog is co-authored by Vinay Bachappanavar, Senior Product Manager.
We are excited to announce a new cloud data integration feature called incremental file load, an optimized way to process files continuously and efficiently as new data arrives from cloud storage like Amazon S3, Microsoft Azure Data Lake Storage (ADLS), and Google Cloud.
We see most customers implementing cloud data warehouse and data lake architectural patterns to move data from source systems to a cloud data warehouse and lake (DWDL). As customers are building cloud DWDL, they are using object stores like Amazon S3, ADLS, or Google Cloud as their data lake, and Snowflake, Amazon Redshift, or Azure Synapse as their cloud data warehouse for analytics. In this post, we will focus on processing files easily and efficiently within a data lake.
Using Data Lakes for Cloud Storage
Most companies have a variety of data sources, including the internet of things (IoT) and edge devices and messaging sources, relational databases, mainframes, and modern cloud apps like Salesforce, Marketo, and Workday. Organizations want to derive value from these diverse data sources by producing actionable and meaningful insights for their business. While most centralize their data within a data warehouse, organizations also implement modern DWDL architectures. They use cloud storage as their data lake so that machine learning (ML) frameworks in Python/R libraries can easily access data in the lake.
The data from source systems is spread across various systems and formats, while the data from relational databases is in a structured format. For example, data from APIs is usually in XML or JSON formats, and data from edge devices is in a semi-structured format. Consequently, the data inside the lake must be enriched and standardized in an open format like Apache and made accessible for various use cases.
Benefits of Incremental File Load
Incremental file load is an enhanced way to process files quickly as new data arrives from cloud storage. This new feature allows you to:
- Easily identify files as they land with a built-in, metadata-driven framework
- Scale and process massive datasets cost-effectively and reliably in your own cloud network using built-in elasticity
- Automatically identify new files in subfolders as well as partitioned directories on cloud storage
- Enable or disable incremental file load via a simple check box
- Ensure there is no data loss or data duplication with built-in awareness of partially written files
The benefits are clear. Now let’s walk through how incremental file load works, in three easy steps.
Step #1: Configure the Data Ingestion Process
Informatica offers the industry’s first cloud-native unified mass ingestion solution with Informatica Intelligent Cloud Services’ (IICS) Cloud Mass Ingestion, which can ingest data from various sources. It’s also very easy to use with a simple wizard-driven unified experience for building flows to ingest data from batch sources like files, applications, relational databases, and real-time sources like CDC, IoT systems, and other streaming sources. And it provides a consistent real-time monitoring and lifecycle management experience for jobs so that you can manage them from a single console.
The first step in using the incremental file load feature is to configure the data ingestion process. Let’s say you want to ingest data from an entire database schema into a cloud data lake in raw form:
- First, configure your source (database schema) connection. You can do this by optionally configuring rules for filtering the tables for ingestion, or you can even specify actions like trimming the spaces from columns
- Then, configure the target, which in this case is the cloud data lake
- Next, choose the output file format (Avro, Parquet, etc.)
- From here configure “‘schema drift”’ to automatically detect and handle source schema changes at the target
- Lastly, configure runtime properties like schedule and load type (initial, incremental, or both)
- Deploy mass ingestion jobs
Step #2: Process Data on the Data Lake
Once the raw data is ingested into the lake, it is incrementally processing new data as it lands in the cloud storage and making it ready for consumption for ML or analytics. This is a typical workflow in data engineering workloads. Today it is very challenging to process these files as soon as they arrive in the raw zone.
Case in point: Customers are forced to build complex frameworks to identify new files, move them to a processing directory, trigger their ETL mapping, and finally move the process files into a different directory. These frameworks are often error-prone, making recovery a challenge in case of failures. So, customers need advanced file processing techniques to identify new files by listing the directory and tracking the new files already processed.
Informatica incremental file load is an optimized file source that addresses the above limitations and provides an efficient and seamless way for data teams to process the data as it lands on the data lake.
Step #3: Leverage a Data Warehouse for Analytics
Next, use a cloud data warehouse and lakehouse for analytics. The modern DWDL has many benefits over the traditional on-premises data warehouse. For example, it lowers the total cost of ownership, and enhances self-service capabilities without compromising on performance or security.
Informatica recommends using Advanced Pushdown Optimization (APDO) to move data from cloud storage to the data warehouse. APDO optimizes performance by pushing mapping logic to data warehouse native functions. However, in mappings where data quality transformations, python transformation, address verifiers etc. are used, you will benefit tremendously from using CDI-Elastic. In those scenarios, you can still leverage incremental file load to incrementally load data from your cloud data lake to cloud data warehouse.
Voila! You now know how to quickly process files in your data lake using incremental file load.
Ready to Learn More?
To see how prebuilt advanced transformations and zero-code orchestrations can help you build enterprise integration workloads, sign up for our free 30-day Cloud Data Integration trial.