What Is Apache Iceberg? Understanding Iceberg Tables

Jeffrey Erickson | Senior Writer | November 17, 2025

Iceberg tables were created by Netflix engineers who had pushed the streaming service’s real-time analytics beyond what its Apache Hive-based data warehouses could handle. Their solution? Develop a table format that lived over in the bulk data storage layer. They dubbed the format Iceberg tables and found that the new system did indeed provide a more scalable and predictable platform for dealing with massive, frequently updated data stores.

Seeing that the Iceberg technology solved an issue commonly faced by users of large, complex data lakehouses, the company open sourced and donated the project to the Apache Software Foundation in 2018. Iceberg quickly found an enthusiastic user base and in 2020 became a top-level Apache project. It’s now used by a community of developers and is embraced by hyperscale cloud data management systems, such as Oracle Autonomous AI Database. Here’s a look at what Iceberg can do and how to use it.

What Is Apache Iceberg?

Apache Iceberg is an open source table format designed to simplify the management of vast data lakes and data lakehouses while improving query performance.

Iceberg tables differ from traditional relational tables found in databases such as Postgres, MySQL, or Oracle. Relational tables store both metadata and data in the database where the data is processed and are well suited for structured application data. Moreover, strict relationships between tables in the database can be enforced. Iceberg tables, on the other hand, store both data and metadata in some form of file system storage layer, such as your local file system, Amazon S3, Google Cloud Storage, or Oracle Object Storage. This separation of data and metadata storage and compute decouples data processing from the data itself and gives end users the flexibility to choose the processing engine that is right for their specific needs.

Iceberg tables facilitate data analytics by combining the scalability and flexibility of data lakes with the reliability and performance of traditional data warehouses, making Iceberg tables a popular format for data lakehouses. For example, they support both real-time analytics and batch processing workloads on vast amounts of unstructured data while also providing ACID-compliant transactions. ACID transactions help ensure data integrity by providing automaticity, consistency, isolation, and durability to database transactions and are fundamental to relational databases and data warehouses. Iceberg tables extend this integrity to data lakehouse transactions.

Iceberg tables also provide flexibility, allowing you to specify different data types for different workloads. For batch analytics, Iceberg supports columnar formats, like Parquet or ORC, and row-based formats, such as Avro. To use Parquet files with Iceberg, for example, you would create an Iceberg table and configure it to use Parquet as the data file format using tools provided by Iceberg. Iceberg tables also support functions for efficient querying, such as data partitioning and pruning, which reduce the amount of data scanned per query.

Another strength of the Iceberg table format is its advanced data versioning features, such as snapshots and time travel. With time travel, each change to the table is recorded as a new snapshot, allowing users to query the table at any point in its history. This feature is highly useful for auditing, debugging, and compliance purposes—and, of course, rolling back changes if necessary.

What Are Iceberg Tables?

Iceberg tables are an open source table format designed for large-scale analytics in data lakes. Iceberg tables are self-describing, meaning they contain metadata about their schemas and data, which can simplify querying and management.

Key Takeaways

  • Apache Iceberg is an open source table format designed to simplify management and improve query performance for big data workloads.
  • Iceberg tables contain their own metadata about schemas and data, allowing them to be stored separately from the compute resources that process the data analytics.
  • Iceberg tables can be accessed and updated concurrently by different data processing engines.
  • Robust versioning features and support for time travel mean you can query data as it existed at any point in the past.
  • Iceberg supports ACID transactions, bringing high levels of data integrity and consistency to data lakehouse management.

Apache Iceberg Explained

Iceberg tables are an open source table format that’s managed by the Apache Foundation and designed for large-scale data analytics in data lakehouses. Apache Iceberg tables address the challenges of managing and querying large data sets in environments where multiple users or processes are writing to the same data set.

To accomplish this, Iceberg tables provide several key architectural features and management benefits. First, Iceberg tables are self-describing, meaning they contain their own metadata about their schemas and data. They reside in a cloud storage layer, such as Amazon S3, Oracle Object Storage, Google Cloud Storage, or Hadoop Distributed File System (HDFS), rather than being part of a database or other query engine. This provides many advantages. For example, because they are held separate from the compute needed to process queries, Iceberg tables can be used with your data processing engine of choice, such as Apache Spark and Flink, Trino, and Oracle Autonomous AI Database or other enterprise data management systems.

Iceberg tables provide a range of helpful features for managing and querying large, complex data lakehouses, such as data versioning and ACID transactions. Moreover, Iceberg's schema evolution capabilities let organizations adapt their data schemas over time without the need for complex and time-consuming table rewrites. Robust data versioning features include time travel capabilities, which help maintain a clear and traceable history of data changes and allow users to query data as it existed at a previous point in time.

Iceberg tables offer flexibility in terms of data formats and processing engines, so engineers can select the data storage format best suited for each workload, such as batch analytics or transactions.

Data security in Iceberg table architectures can be achieved through a combination of the native features delivered by the Apache Foundation and those provided by cloud storage vendors and the data management systems accessing and querying the Iceberg table. The result will be a mix of data encryption, access controls, data masking, and other data governance features, as well as replication and synchronization capabilities for recovery from failures or attacks.

How Does Apache Iceberg Work?

Apache Iceberg is an open table format designed to help manage and query petabyte-scale data sets efficiently. It was developed to address the limitations of existing file formats for data lakes, which didn’t provide a way to manage data at the table level.

Unlike traditional tables, Iceberg tables were built to handle the complexities of big data sets that incur frequent updates, deletions, and schema evolution. Iceberg has proven to be a robust and efficient way to manage data, whether you’re performing batch processing, real-time streaming, or interactive queries. Here’s how.

  • Open table format: Iceberg tables reside in the data storage layer and contain metadata that tracks the state of the table, including the location and status of data files.
  • Metadata management: Metadata for the table is stored in your choice of metadata catalog, such as Hive Metastore for the Hadoop environment, or another catalog, such as AWS Glue for typical cloud storage. These tools provide a centralized and structured repository for table definitions, schemas, and other metadata, and catalogs provide robust metadata management and easy integration with various data processing engines.
  • Powerful versioning: This metadata is versioned, meaning that each change to a table—such as adding or removing data—creates a new version of the metadata, allowing you to track the history of the table and roll back to previous states if needed. One of the key features in this process is Iceberg support for ACID transactions. This means that operations such as inserts, updates, and deletes are atomic and consistent, ensuring that the table remains in a valid state even if multiple operations are happening simultaneously. Iceberg achieves this by using a snapshot model, where each snapshot represents a consistent view of the table at a specific point in time. When you perform a write operation, for example, Iceberg creates a new snapshot that includes the changes, and the previous snapshot remains unchanged until the new one is committed. These snapshots enable a function called time travel, which allows you to query data as it existed at a time in the past. This is helpful for debugging, compliance operations, and other use cases.
  • Schema evolution: Iceberg also supports schema evolution, which allows you to add, remove, or modify columns in the table without having to rewrite the entire data set—a must for big data environments where schemas often change over time. When you alter the schema, Iceberg updates the metadata to reflect the changes, and the data files are modified only as needed, minimizing the overhead of schema changes.
  • Choice of file format: In Iceberg tables, data is stored in files, typically in formats such as Parquet, ORC, or Avro that are optimized for efficient storage and querying based on your use case.
  • Choice of processing framework: Iceberg integrates with a wide range of data processing frameworks and engines, such as open source Apache Spark, Apache Flink, and Trino, or enterprise data platforms, such as the Oracle Autonomous AI Database. This integration allows you to query Iceberg tables as if they were native tables, leveraging the performance and scalability of Iceberg while maintaining the flexibility and power of your existing data processing pipelines.

Benefits of Using Apache Iceberg

The many benefits of Apache Iceberg make it a compelling choice for organizations looking to manage and process large data sets efficiently and reliably.

  • ACID transactions: For applications that require high data integrity or real-time analytics, Apache Iceberg supports ACID transactions to ensure that data operations are reliable and consistent.
  • Time travel and versioning: For common processes, like auditing and debugging, Iceberg allows you to maintain a record of multiple versions of your table. This allows for time travel queries, where you can see data as it existed at any point in the past.
  • Integration with data processing engines: Iceberg can be used with popular data processing engines like Apache Spark, Apache Flink, and Trino and is easily managed with enterprise data management engines, such as Oracle Autonomous AI Database. This can help eliminate silos between data management providers and enable you to leverage the best data querying and processing tools in your data pipelines.
  • Schema evolution: Iceberg tables store schema changes in metadata files, so you can modify the structure of your data schema over time without disrupting existing data or queries. This feature enables you to add, remove, or rename columns and even change data types, all while maintaining the integrity and consistency of your data. Existing queries will simply use the schema that was active at the time they were made.
  • A growing ecosystem: Iceberg integrates with a wide range of data processing frameworks and engines, including Spark, Flink, and Trino, and cloud-based data warehouses, such as BigQuery, Amazon Redshift, Snowflake, and Oracle AI Database, making it a versatile choice for modern data architectures.

Apache Iceberg Use Cases

Apache Iceberg is a high performance table format for big data workloads that supports a range of data types and multiple query engines as well as ACID transactions. Here are five common use cases for Apache Iceberg.

  • Data lake management: Iceberg was originally designed to help manage large data lakes by providing a structured and efficient way to store, query, and manage petabytes of unstructured data in an intelligent data lake.
  • Real-time analytics: Iceberg can be used in real-time analytics pipelines to ensure that data is consistently and reliably available for analysis. Its support for ACID transactions and snapshots help maintain integrity and consistency, even when data is being updated frequently.
  • Data versioning and auditing: Iceberg's versioning capabilities allow you to keep track of changes to your data over time. This is particularly useful for auditing purposes, as you can easily revert to previous versions of the data or analyze historical changes.
  • ETL workflows: In extract, transform, and load, or ETL, workflows, Iceberg can serve as a robust and scalable target for data transformation and loading processes. Its ability to handle large volumes of data and support efficient data manipulation operations make it a valuable tool in data engineering pipelines.
  • Machine learning and data science: In machine learning and data science tasks, where data needs to be accessed and processed in a consistent and efficient manner, Iceberg tables provide fast access to data and support the iterative nature of model training and validation.

Who Uses Apache Iceberg?

Iceberg tables were created by engineers at Netflix to bring more speed and flexibility to streaming analytics and their real-time recommendation engines, and it does an excellent job for this sort of use case. Now an open source project run by the Apache foundation, Iceberg is still used by Netflix, as well as a growing number of organizations, including the following:

  • Snowflake: Snowflake was an early user of Apache Iceberg, using it to enhance its data lakehouse architecture by allowing data to be stored in a common, open format and accessed by a wide range of platforms.
  • Oracle: Oracle uses Apache Iceberg to optimize data storage and querying in its cloud data warehousing solutions. Oracle also optimized its Autonomous AI Database query engine to read and update Iceberg tables on any cloud storage.
  • Apple: Apple is a growing customer of Apache Iceberg tables, which it uses to efficiently manage and query its large data sets. The company has become a frequent contributor to the Apache Iceberg project.
  • Airbnb: Airbnb employs Apache Iceberg to improve the performance of the firm’s data warehouse architecture.
  • Lyft: Lyft, now part of DoorDash, uses Apache Iceberg to manage its data lakes and maintain data consistency in its logistics and delivery operations.
  • Salesforce: Salesforce leverages Apache Iceberg to enhance its data lakes and warehousing services, especially as its large and varied customer base requires interoperability with a wide range of data processing frameworks and services.

Optimize Performance with Apache Iceberg and Autonomous AI Database

Oracle Autonomous AI Database supports Apache Iceberg tables. If your data sets are already in Iceberg format on a different cloud, they can be easily read by Oracle Autonomous AI Database, reducing data duplication and enhancing the flexibility of your operations. Now you can take advantage of Iceberg tables for managing large, complex data sets and optimizing query performance—while also enjoying cross-platform data accessibility with your Oracle data management platform, where you can build scalable AI-powered apps using your choice of large language model (LLM) and deploy in the cloud or your data center.

Apache Iceberg has forever transformed the data management landscape. It arrived as companies like Netflix were struggling to manage and query massive data sets and mixed data types. As these conditions have become more common, Iceberg tables have become a staple of the modern data ecosystem. Look for the trend to continue as more businesses look to streamline their data pipelines and derive insights from big data analytics.

Is your data infrastructure set up to handle similarity search and other AI initiatives? Our ebook lays out a plan to build a data foundation robust enough to support AI success.

Apache Iceberg FAQs

Is Apache Iceberg better than Delta Lake?

Neither Delta Lake nor Apache Iceberg can be considered generally superior. Both are table formats for managing data sets for analysis. A simple way to think about the differences is that Delta Lake does best within the Spark ecosystem, while Iceberg offers wider compatibility and works well with a range of data management engines.

What is Apache Iceberg used for?

Apache Iceberg is used to simplify the management and querying of very large, complex, multiplatform data sets into a single, unified data layer for analysis processes.