Over the last several months we have made significant enhancements to address emerging customer use cases with Amazon Redshift. These include the development of new cloud services that are designed to target specific use cases, and also general enhancements within our existing Redshift connectivity. These along with our powerful data integration features and industry’s broadest connectivity enable you to configure data flows into your Redshift data warehouse using Informatica’s easy-to-use cloud-based design tools and load it from any sources, using any load patterns you prefer, and using any compute engines that are most suitable in your use case.
Major new services and enhancements include:
- Cloud Mass Ingestion into Amazon S3 and Redshift
- Streaming ingestion into S3 and Redshift via Kinesis
- Serverless Spark-based processing
- Databricks support
- Spectrum support
- General connectivity enhancements
As more and more customers are implementing data lakes and data warehouses using S3 and Redshift respectively, we have developed services and features that support commonly observed load patterns:
- Loading data into an S3 data lake, and then to a Redshift data warehouse or
- Loading it directly into Redshift from a wide range of sources including ERPs, CRMs, databases, file storage systems, IoT endpoints, etc.
More information about newly developed features is below
Cloud Mass Ingestion services
Using our new Cloud Mass Ingestion services (CMI) you can upload a large amount of data from various sources into your S3 data lake. In addition, you can also ingest data directly into your Redshift data warehouse. Mass ingestion tasks differ from data integration tasks as these upload data in bulk as a whole compared to data integration tasks that typically transform data record by record. While data integration tasks are necessary when you need to transform the data, it is very common to first upload data from its source as-is to either a data lake or a data warehouse staging layer. The mass ingestion services are specifically designed to do this efficiently and include features like Schema Drift to support changes in the data structures.
There is also a new Streaming Ingestion service, which can be used to ingest streaming data such as logs, clickstream, Kafka messages. You can write this data to Kinesis, S3 or Redshift. Streaming Ingestion can also apply bulk transformations on the edge.
As raw data is loaded into a data lake it needs to be curated and processed further to make it ready for any data science projects, machine learning (ML) algorithm training, or to load it further into a data warehouse for enterprise analytics. In most cases it involves applying complex transformations and algorithms over very large amount of data. Since the needs of such projects vary, you need a very flexible way to provision and de-provision resources needed to process it. Informatica Cloud Data Integration Elastic (CDIE) can be used to process and transform such data from S3 and to load it into Redshift without having to provision an engine while at the same time using Informatica’s easy to use tools. Informatica uses CLAIRE-based auto scaling and auto tuning to spin up a Spark-based cluster using Kubernetes on AWS. You are charged only for the duration of use. This allows you to curate, filter, process, and load large amounts of data from the data lake to a data warehouse in a very flexible way.
Support for Databricks
For both S3 and Redshift connectors we will support Databricks Spark as an engine with our upcoming release of Informatica Data Engineering (formerly called Big Data Management). This allows you to take advantage of Databricks’s powerful and performant data pipelines when you need to process a large amount of data at enterprise scale and performance (learn more about our offerings for Databricks).
Support for Redshift Spectrum
As more and more users look to augment their Redshift data warehouse by using their data in an S3 data lake, we have extended support for external tables created using Redshift Spectrum. With it, customers do not need to load all the data in the data warehouse. You can read data from your data lake on S3 using external tables just like you are able to read it from Redshift tables or views. Such tables can also be joined with existing Redshift tables, creating a combined data set that you might want to process for a downstream system. Our support for Spectrum makes it easy for you to do this.
General connector enhancements
In addition to the above we have made several enhancements in our general connectivity features to Redshift. Some of those include allowing custom JDBC URL; support of Redshift as a target for CDC-based loads, which provides restart/recovery capability; full push down optimization to Redshift to take advantage of Redshift’s computer power to implement an Extract, Load and Transform (ELT) pattern; and several improvements in performance. The inherent design of our interactions with Redshift allows you to take advantage for cluster resizing features in Redshift – both Classic and Elastic resize. We always fetch the latest metadata about a cluster, nodes, slices, etc. and determine the load parameters based on that. So any changes you do in your Redshift cluster configuration are automatically available the next time you write data to Redshift or read from it.
As Redshift features evolve and as the use patterns of our customers evolve, we are ensuring that our Redshift connectivity addresses those. We have several enhancements coming up in the near-term roadmap: more ways of authentication with AWS/Redshift, Stored Procedures, enhancements to Spectrum using Glue catalog, broader KMS support, and more. Stayed tuned!
If you’re going to re:Invent, be sure to see us at booth #1305 . You can book a meeting now to meet 1:1 with one of our experts and learn about the latest AI-powered technology to accelerate AWS migration at scale. Learn more about our offerings for AWS.