Optimizing Iceberg Compaction: Why We Built an Embedded Engine in Rust

Apache Iceberg has become the de facto standard for open table formats by providing ACID transactions, time travel, and schema evolution. However, the very mechanics that enable these features, specifically snapshots and distinct delete files, introduce significant performance overhead over time. For a streaming database like RisingWave, which writes data continuously, this overhead accumulates rapidly. To address this, we designed a dedicated compaction architecture to solve the "small file" problem and metadata bloat. This post explores our engineering approach to compaction, how it powers distinct write modes like Copy-on-Write, and the benchmark results that validate this architectural choice.

The Cost of Streaming Writes

To understand why compaction is necessary, we have to look at how Iceberg handles updates. Every time data is written or rewritten, Iceberg generates a new Snapshot. While this architecture enables time-travel queries, it creates specific problems for long-running streaming jobs.

First, there is Metadata Bloat. As snapshots accumulate, the metadata storage grows, increasing the overhead for query planning. Second, there is Fragmentation. Streaming data often arrives in micro-batches. Writing 100MB of data across 10 commits results in dozens of small files rather than a few optimized Parquet files.

The situation gets more complex with deletes. Iceberg supports row-level deletes by writing "Delete Files" (position or equality deletes) rather than rewriting the data immediately. When an external engine queries the table, it must read both the original data file and the delete file to filter out removed rows. This is Read Amplification: as delete files accumulate, read performance degrades significantly.

Integrating the Engine

We chose to build our compaction logic by integrating DataFusion, a Rust-based query execution framework, directly into the RisingWave codebase, deployed as a specialized compaction worker. This integration allows us to leverage vectorized execution while maintaining tight control over resource scheduling and storage I/O.

The implementation operates on a continuous cycle: identifying fragmented files or expired snapshots, merging them in memory, and writing optimized files back to storage.

We utilize three distinct maintenance strategies:

Snapshot Expiration: The first line of defense. The system prunes useless snapshots based on retention settings (retaining the last days or versions), freeing up storage and reducing the search space for metadata operations.
Bin-packing: To solve fragmentation, the engine identifies small data files (e.g., multiple 32MB chunks) and combines them into a single target file. This reduces the number of file handles required during scans.
Delete File Compaction: The most critical operation for read performance. The engine merges delete files with their corresponding data files, physically removing the deleted rows. While this causes write amplification, it eliminates read amplification for downstream consumers.

Write Modes: Merge-on-Read vs. Copy-on-Write

This integrated architecture enables different trade-offs between write latency and read performance, allowing users to choose between two specific write modes.

In the default Merge-on-Read (MoR) mode, RisingWave prioritizes ingestion speed. It appends data and delete files quickly, accepting that the table will become fragmented. The compaction engine runs in the background to clean up this "mess" eventually.

For scenarios where read performance is paramount, we support Copy-on-Write (CoW). In this mode, we guarantee that the table exposed to readers is always clean, meaning it contains no delete files and consists of optimized Parquet chunks. We achieve this via a Two-Branch Architecture (Ingestion Branch vs. Main Branch). The writer appends to the ingestion branch, while the compactor runs in the background, atomically committing clean snapshots to the main branch. This ensures the writer is never blocked.

In RisingWave, you can use the following SQL statement to create an Iceberg table with the Copy-on-Write mode.

create table t_cow(a int primary key, b int)
with(
    commit_checkpoint_interval = 1,
    snapshot_expiration_max_age_millis = 0,
    write_mode = 'copy-on-write'
) engine = iceberg;

Benchmark: Architectural Efficiency

Deciding to build an integrated compaction engine rather than relying on external jobs was a strategic architectural decision. To validate the efficiency of this design, we benchmarked the RisingWave compaction engine against a standard Apache Spark setup.

For the detailed benchmark numbers and results, please see here.

Test environment

To ensure a fair comparison, both engines were executed on identical hardware.

Hardware: AWS m5.4xlarge (16 vCPU, 64GB RAM).
Storage: General Purpose SSD (EBS).
Spark Config: Tuned for the instance size (40GB Executor Memory) to prevent premature OOMs. The detailed config is as follows:

spark.executor.memory = 40g
spark.executor.memoryOverhead = 8g
spark.driver.memory = 12g
spark.memory.fraction = 0.8
spark.memory.storageFraction = 0.2
spark.sql.shuffle.partitions = 1000

RisingWave Config: Execution Parallelism: 64, Output Parallelism: 64

Scenario 1: Bin-packing (Small File Compaction)

The Objective: Merge ~17,000 small, fragmented files into a few large, optimized files (no deletes involved).

Dataset: ~193 GB (17,358 files).
Target: 512 MB file size.

Metric	RisingWave (Iceberg Compaction Engine)	Apache Spark	Difference
Duration	277 seconds (~4.6 min)	1,533 seconds (~25.5 min)	~5.5x Faster
Input Files	17,358	17,358	-
Est. Compute Cost	~$0.06	~$0.33	~82% Cheaper

The Analysis:

The 5.5x speedup highlights the "Distributed Tax." Spark incurs significant overhead for JVM startup and task coordination, whereas RisingWave’s embedded engine utilizes the hardware immediately. We observed similar results with ZSTD-compressed data (~5.2x speedup), confirming that framework overhead is the dominant bottleneck.

Scenario 2: High-Complexity Compaction (Deletes)

The Objective: Test the Copy-on-Write capability. The engine must load massive metadata (position & equality deletes), filter rows, and rewrite data.

Input: 60,000 total files (20k data + 20k position deletes + 20k equality deletes).

Scenario	RisingWave Result	Apache Spark Result
High Entropy	SUCCESS	FAILED
40k Delete Files	Time: 490s	Status: Out of Memory (OOM)
Massive Metadata	SUCCESS	FAILED
20GB Equality Deletes		Status: Out of Memory (OOM)

The Analysis:

We attempted to run this workload on Spark using the identical hardware configuration. However, the job failed to complete, terminating due to Out-Of-Memory (OOM) errors.

This proves that RisingWave's Rust-based architecture is significantly more memory-efficient. It successfully processed the massive metadata load on a standard mid-sized node, proving that Copy-on-Write is production-ready.

Operational Simplicity

During the tests, RisingWave was able to saturate the 16 cores of the m5.4xlarge instance efficiently (~10 cores active average) with a stable memory footprint (~35% utilization).

This efficiency means users do not need to manage a separate, complex Spark cluster for maintenance. A dedicated compaction worker within the RisingWave cluster can handle heavy workloads with a significantly lower hardware footprint.

Summary

Iceberg compaction in RisingWave is designed to be a "set it and forget it" maintenance system. By unifying stream processing with advanced table maintenance, we have built an engine that automatically handles the entropy of streaming data. It converts fragmented, delete-heavy tables into optimized, query-ready datasets with a fraction of the hardware footprint required by traditional batch engines.

See the complete benchmark report here.