Data Ingestion and Predictive Analytics Using Spark Structured Streaming and Informatica

Last Published: Aug 05, 2021 |
Vishwanath Belur
Vishwanath Belur

IoT and streaming data is increasingly becoming a competitive differentiator for enhancing customer experience, improving operational efficiency, and real-time business decision making. According to Gartner, “By 2022, more than half of major new business systems will incorporate continuous intelligence that uses real-time context data to improve decisions.”[1]

With the adoption of streaming data, enterprises are looking to address streaming analytics use cases – which were not possible before. Streaming and IoT data management includes the capabilities to ingest, manage, and act on a variety of real-time data. Apache Spark brings massive scale capabilities and the recently introduced Structured Streaming to help solve real world streaming analytics use cases. Moreover, adoption of cloud-native services for data lakes and compute is growing. And Databricks, with its scalable engine, is providing the ability to process huge volumes of data.

Key Streaming Use Cases

There are two key streaming use case patterns that we hear from customers:

  1. Load the data onto a data lake so that it can be used for data science and advanced analytics projects
  2. Run pre-created predictive models on streaming data as the data is going through the pipe in real time, for fraud detection

Informatica offers end-to-end streaming data management, starting from ingestion of streaming and IoT data to applying reasoning on the data, to finally acting on the streaming data. All this is provided within the same platform – so that customers don’t have to manage different pipelines for batch and streaming.


Let’s look at how you can solve for both of these key use cases using Informatica’s multi-latency data management solution.


Ingestion of Streaming Data

The first step in solving your streaming use case is to ingest the IoT data from a variety of sources into Kafka or a data lake. In this example, we’re ingesting streaming data from IoT sources to both Kafka and a data lake.

Informatica offers both cloud-ready and cloud-native streaming and IoT data ingestion solutions from streaming sources like Kafka, IoT, web logs, etc. into cloud data lakes such as Amazon S3 and Azure Data Lake Storage (ADLS) or into messaging systems such as Kafka and Amazon Kinesis. Informatica Edge Data Streaming (EDS) and Ingestion at Scale provide a wizard-based experience for designing the flow and real-time monitoring for managing the jobs.



Batch Processing on IoT Data for Advanced Analytics

It is important to get the raw IoT data ready for analytics using various aggregations and other transformations so that data scientists and data analysts can make use of the data directly in their analytics.

Informatica Big Data Management (BDM) is a cloud ready big data processing solution which helps customers perform transformations and aggregations on the data stored in the lake using the highly scalable Apache Spark engine for processing large volumes of data at very high performance. Big Data Management enables customers to run the processing jobs on Spark either on premises or in the cloud. Informatica helps customers to leverage big data technologies while keeping the business logic abstracted from underlying technologies. 

Real-Time Analytics on IoT Data for Operationalizing ML Models

To solve the real-time streaming analytics part of the use case, we need to the ability to apply real-time enrichments on the data, including the ability to run machine learning models on the streaming data.

Informatica Big Data Streaming (BDS) is cloud ready continuous stream processing solution addresses the real-time parsing and transformations use cases on the streaming data including the ability to operationalize ML models on the streaming data. It uses the power of Spark Streaming engine for performing analytics on streaming data at very high volumes. Big Data Streaming adopted Spark Structured Streaming in the latest version so that customers can address real world streaming use cases including event time windowing and ability to handle late arrival of events.




Informatica offers multi-latency data management platform for addressing the batch and streaming use case of the customers. Big Data Streaming and Big Data Management use the same design, development and monitoring interface so that the customers don’t need to build and monitor separate pipelines for batch and streaming. This also helps the developers to leverage the business logic between batch and streaming mappings with zero learning curve.  

Check out our Next-Gen Analytics product portfolio on how our products and services address customer use cases in streaming analytics.

You can follow our discussion of streaming data management in this joint webinar with Databricks: “Keys to Building End-to-End Intelligent Data Pipelines for AI & ML Projects.”

Visit us at Kafka Summit San Francisco (September 30 – October 1) booth #S26 to learn more about streaming analytics at scale with Kafka and Informatica.

[1] Adopt Stream Data Integration to Meet Your Real-Time Data Integration and Analytics Requirements, by Ehtisham Zaidi , Eric Thoo , W. Roy Schulte, 15 March, 2019

First Published: Aug 25, 2019