Given the exponential increase of data-driven applications and benefits of the cloud, modern enterprises are increasingly using object storage systems or cloud data lakes for advanced analytics and AI applications. That’s because cloud data lakes help provide organizations with additional flexibility and benefits by democratizing access to data for virtually all data consumers. Like books in a library, data in a data lake is organized and stored as datasets in files with directory structures to provide self-service access.
Although this traditional method has many benefits, needs have changed. Modern applications now require:
- Simultaneous data access and changes by multiple applications written in different programming languages
- Transaction management (atomicity, consistency, isolation, and durability [ACID] compliance)
- Schema management and evolution of data
- The ability to track and audit changes
- Version control and time travel
- And much more
Although a metadata catalog (or meta store) on a data lake can help define schema, location and more, it has limitations to serve demands of modern applications.
This is where applying table formats to data becomes extremely useful. Table formats explicitly define a table, its metadata and the files composing the table. These table formats help in organizing both structured and unstructured data in a cloud data lake and help organizations meet the requirements and demands of modern applications. This approach is becoming popular among data engineers; let’s find out why.
Why Data Engineers Are Turning to Apache Iceberg
Apache Iceberg is a high-performance open-source data table format. It is specifically designed for large-scale data lakes with petabytes of data. It brings the reliability and simplicity of a relational database management system (RDBMS) in SQL tables to big data in a data lake. Apache Iceberg also allows organizations to build an open data architecture, which eliminates vendor and technology lock-in.
Here are key capabilities and benefits of Apache Iceberg that make it a hit among data engineers:
- Open and flexible: Apache Iceberg is an open table format that permits all applications and multiple data processing tools to update tables directly and simultaneously. This provides developers with more flexibility in developing high-performance, data-driven applications at scale.
- Easy to manage: Apache Iceberg provides transactional consistency and supports schema evolution. It improves the productivity of data engineers since they do not have to rewrite tables due to schema changes. This also empowers data engineers to develop data pipelines with high-quality and trusted data for analytics consumption with zero maintenance.
- Simplified data access and superior performance: Apache Iceberg does not require user-maintained partition columns. In fact, it can handle the tedious and error-prone task of producing partition values automatically. This provides high performance queries without depending on a physical layout of the table.
- Out-of-the-box support for major data processing engines and frameworks: Apache Iceberg is designed to work with commonly used data processing frameworks and tools. It also offers connectors and APIs that can seamlessly integrate with popular frameworks, such as Spark, Flink, Presto and Hive. This enables data engineers and analysts to make the most of existing tools and systems without making significant changes. It also improves data engineers’ productivity since they don’t have to learn new tools.
- Active support from the open-source community: Apache Iceberg is an open-source table format with active support and contributions from community members. Data engineers can get their query resolved in few minutes by posting it in the community and accessing regular feature upgrades.
These capabilities make Apache Iceberg an attractive choice for data engineers and organizations who want to enhance their data lake infrastructure and optimize data processing workflows. The cloud ecosystem vendors tend to agree. Read on to learn more.
Cloud Ecosystems Are Embracing Apache Iceberg
As cost-effective, durable and flexible cloud storage has proliferated, it has put immense pressure on businesses to perform high-performance analytics on the structured and unstructured data stored within it. Given Apache Iceberg’s impressive features and ease of use, it appears to be a great fit for the lakehouse era. In fact, Iceberg is being widely adopted by the cloud ecosystem vendors:
- BigLake, Google Cloud's data lake storage engine, and Cloudera started supporting Apache Iceberg late last year.
- During the Snowflake Summit this past June, “Iceberg tables” were introduced, which allow organizations to use their own storage in the Apache Iceberg format, irrespective of whether the data is managed by Snowflake or managed externally. Best part? You can still benefit from Snowflake’s ease of use, performance and unified governance.
- Databricks unveiled Universal Format (UniForm) for Apache Iceberg compatibility with Delta tables during its 2023 Data and AI Summit.
- Amazon Web Services (AWS) selected the Apache Iceberg table format to extend the reach of its Redshift data warehouse to data lakes.
- Watsonx.data, IBM’s new data lakehouse offering, also supports Apache Iceberg.
Now that we better understand why Apache Iceberg is so widely accepted by top cloud providers, let’s review how Informatica can help you take your cloud data lake and modern data architectures like – data fabric, data mesh and lakehouse initiatives — to the next level.
Jumpstart Your Cloud Data Lake Projects With Informatica and Apache Iceberg
Informatica Intelligent Data Management Cloud (IDMC) provides an intelligent, agnostic and comprehensive data management platform for cloud data lakes. With IDMC, data engineers can construct streamlined and forward-looking data management applications and pipelines without relying on expensive, inefficient approaches like manual coding, piecing together individual products, or solutions restricted to specific ecosystems. Plus, IDMC can be integrated with all the major cloud ecosystems. Apache Iceberg is just the latest innovation in data lakes IDMC supports, joining Hadoop Distributed File System (HDFS), Delta, Parquet, Microsoft OneLake and others.
Below are some IDMC key capabilities that seamlessly work with Apache Iceberg:
- CLAIRE, our FinOps enabled, AI engine, supports the reading, writing and processing of data within an Apache Iceberg table. That’s because CLAIRE automatically selects the optimal mode (i.e., Native, Spark extract, load, transform (ELT), SQL ELT) based on the data pipeline, which can help optimize costs.
- The unified intuitive design interface allows you to build data pipelines effectively with Apache Iceberg tables. This code-free and GUI-based interface improves data engineers’ productivity since they don’t need to write custom codes to build data pipelines.
- Replication and synchronization capabilities can synchronize and replicate data in real time from various data sources into cloud ecosystems with underlying Apache Iceberg tables. This provides real-time data to data scientists for building and deploying AI/ML models at scale to put AI into action.