Apache Iceberg

Apache Iceberg is a high-performance, open-source table format specification designed for managing huge analytical datasets stored in data lakes (like AWS S3, Google Cloud Storage, Azure Blob Storage). It's not a storage engine itself but rather a specification defining how data files should be organized and metadata tracked to bring the reliability and structure of traditional databases directly to distributed file systems or object stores.

Iceberg was created at Netflix to address the challenges of managing massive, evolving datasets on cloud storage and is now a top-level Apache Software Foundation project. It's a foundational technology enabling modern Data Lakehouse architectures, including Streaming Lakehouses.

The Problem: Limitations of Traditional Data Lake Tables

Before table formats like Iceberg, managing tables directly on data lakes (often using just directory structures, sometimes with Hive metastore conventions) faced significant issues:

Reliability: Concurrent writes could lead to data corruption or inconsistent views. Atomic operations were difficult to guarantee.
Performance: Listing large numbers of files in directories (common in partitioned tables) was slow and expensive on object stores. Query planning required costly file system interactions.
Schema Evolution: Modifying table schemas (adding/dropping/renaming columns) was risky and could break existing data or queries.
Partitioning Issues: Changing the partitioning scheme of a table required rewriting the entire dataset. Hidden partitioning (partitioning by derived values like 'month(timestamp)') was difficult to manage correctly.
Consistency: Readers could see partial results from writers mid-operation.

Apache Iceberg was designed specifically to solve these problems by managing metadata independently from the data files.

Core Concepts and How Iceberg Works

Iceberg's core idea is to manage table state through immutable metadata layers, tracked via snapshots:

Table Metadata File: The entry point for any Iceberg table. Contains information about the table schema, current snapshot ID, partitioning spec, sort order, and pointers to previous metadata files (for history).
Snapshot: Represents the complete state of a table at a specific point in time. Each write operation (commit) creates a new snapshot. A snapshot points to a manifest list file.
Manifest List: A file listing multiple manifest files associated with a snapshot. It includes metadata about each manifest, such as partition information ranges, allowing query engines to quickly prune entire manifests (and thus many data files) without reading them.
Manifest File: A file listing a subset of the data files belonging to the table, along with per-file metadata like column-level statistics (min/max values, null counts), partition membership, and file paths. This enables efficient pruning of individual data files during query planning.
Data Files: The actual data, typically stored in formats like Parquet, Avro, or ORC. Iceberg tracks these files but doesn't dictate their internal format.

Write/Read Path:

Write: When data is written, new data files are created. New manifest files are written listing these data files. A new manifest list is created pointing to the relevant manifests. Finally, a new table metadata file is atomically swapped in, pointing to the new manifest list and marking the new snapshot as current. This atomic swap provides transactional guarantees.
Read: Query engines start by reading the current table metadata file, find the current snapshot's manifest list, use partition information in the manifest list to prune irrelevant manifests, read the remaining relevant manifests, use column statistics and partition info in the manifests to prune irrelevant data files, and finally read only the necessary data files.

Key Features:

Snapshot Isolation & ACID Transactions: Each write creates a new, immutable snapshot. Reads always use a specific snapshot, ensuring consistency. The atomic swap of the metadata pointer provides ACID guarantees for commits.
Scalable Metadata: Avoids costly directory listing operations by tracking files in manifest files. Metadata is hierarchical, allowing for efficient pruning.
Full Schema Evolution: Safely supports adding, dropping, renaming, updating, and reordering columns without rewriting data files. Changes are tracked via unique field IDs in the metadata.
Partition Evolution: Allows changing the partitioning scheme of a table without rewriting old data. Old data remains readable with its original partitioning, while new data uses the new scheme.
Hidden Partitioning: Defines partitions using transformations on source columns (e.g., 'hours(timestamp)', 'bucket(category)'), hiding the complexity from users who query using the original source columns. Iceberg manages the relationship automatically.
Time Travel & Rollback: Allows querying previous snapshots of the table (querying data as of a specific time or snapshot ID) and easily rolling back the table state to a prior snapshot.
Data Compaction & Maintenance: Provides mechanisms and specifications for background tasks like compacting small files into larger ones or removing orphaned files without impacting concurrent readers/writers.

Key Benefits

Reliability on Data Lakes: Enables ACID transactions and consistent views for concurrent operations.
Improved Performance: Efficient file pruning during query planning significantly speeds up queries on large tables compared to directory-based listings.
Safe Schema & Partition Evolution: Adapts to changing data and requirements without expensive rewrites or risking data corruption.
Abstraction: Hides physical layout details (partitioning, file locations) from users.
Engine Agnosticism: Designed to work with various query engines (Spark, Trino, Flink, Presto, etc.).
Time Travel Capabilities: Essential for auditing, debugging, and reproducibility.

Common Use Cases

Building reliable and performant large-scale analytical tables on data lakes.
Replacing fragile Hive-style tables.
Implementing robust batch ETL/ELT pipelines.
Serving as the storage foundation for a Data Lakehouse or Streaming Lakehouse.
Providing a reliable target (sink) for streaming data ingestion.

Iceberg and RisingWave

Apache Iceberg is a critical component in realizing the Streaming Lakehouse vision with RisingWave. RisingWave provides a dedicated Apache Iceberg sink connector.

This allows RisingWave to:

Continuously process streaming data using Streaming SQL.
Maintain fresh results in Materialized Views.
Efficiently write these results or changes into Iceberg tables stored on cloud object storage.

This enables a unified architecture where RisingWave handles the real-time processing and serving layer, while Iceberg provides the reliable, scalable, and queryable long-term storage foundation on the lake, accessible by both RisingWave and other batch query engines.

Apache Iceberg

The Problem: Limitations of Traditional Data Lake Tables

Core Concepts and How Iceberg Works

Key Benefits

Common Use Cases

Iceberg and RisingWave

Related Blog Posts

Frequently Asked Questions

Related Glossary Terms