Data Lake Compaction

Data Lake Compaction is an essential maintenance and optimization process performed on tables within Data Lakes and Data Lakehouses, particularly those managed by Open Table Formats like Apache Iceberg, Delta Lake, and Apache Hudi. It involves rewriting multiple small data files within the table into fewer, larger data files.

The primary goal of compaction is to mitigate the "small files problem" often encountered in big data systems, thereby improving query performance and reducing metadata management overhead.

The Problem: Small Files in Data Lakes

Streaming ingestion, frequent updates, or certain batch processing patterns can lead to the accumulation of numerous small files in data lake tables. This causes several issues:

Query Performance Degradation:
- Metadata Overhead: Query engines need to track, open, and manage metadata for every file. A large number of files significantly increases this overhead. Open table formats help, but excessive file counts still impact planning.
- Read Throughput: Reading many small files from cloud object storage often involves more latency (opening connections, initial requests) compared to reading the same amount of data from fewer large files. Filesystem listings (if applicable) also become slow.
- Reduced Parallelism Benefits: Processing frameworks often parallelize work at the file level. Too many small files can lead to inefficient task scheduling and underutilization of resources.
Metadata Management Issues: Open table formats like Iceberg maintain metadata listing all data files. An excessive number of files bloats this metadata, making snapshot management and query planning less efficient.
Storage Inefficiency (Minor): While less significant, some storage systems might have minimum object charges or slight overhead per object.

How Compaction Works

Compaction processes typically perform the following steps, often as a background job:

Identify Files for Compaction: The process selects a set of small files within a table or partition based on predefined criteria (e.g., file size below a threshold, number of files in a partition).
Read Data: It reads the data from the selected small files.
Rewrite Data: It writes the combined data into one or more new, larger files (aiming for an optimal target file size, e.g., 256MB, 512MB, 1GB).
Update Table Metadata: Critically, the process updates the table's metadata (within the rules of the specific Open Table Format) to atomically:
- Remove references to the original small files.
- Add references to the new, larger files. This atomic metadata update ensures that concurrent readers are not affected and see a consistent view of the table before or after the compaction commit.

Compaction Strategies:

Bin Packing: Simple strategy focusing on combining files to reach the target size without necessarily ordering the data within the new file.
Sort-Based Compaction: Combines small files and sorts the data within the new larger files based on specified columns. This can further improve query performance if queries often filter or group by the sorted columns (data skipping/predicate pushdown). This often requires more resources during the compaction job.

Triggering Compaction:

Compaction can be triggered manually, scheduled periodically (e.g., nightly), or sometimes automatically by the system based on heuristics.

Key Benefits

Improved Query Performance: Significantly reduces metadata overhead and optimizes read operations, leading to faster queries.
Efficient Metadata Management: Keeps the table metadata (manifests in Iceberg) leaner and more manageable.
Better Resource Utilization: Enables query engines and processing frameworks to work more efficiently with larger files.
Stable Table Performance: Prevents gradual performance degradation due to the accumulation of small files.

Compaction in Open Table Formats

Apache Iceberg: Provides actions and APIs for compaction (e.g., 'rewriteDataFiles'). Users typically run these actions using engines like Spark or Flink. Iceberg's atomic metadata commits ensure compaction doesn't interfere with reads/writes.
Delta Lake: Has an 'OPTIMIZE' command (often with 'ZORDER' for sorting) that performs compaction.
Apache Hudi: Compaction is an integral part, especially for Merge-on-Read tables, where it merges delta logs into new base files. Hudi offers different scheduling and execution options for compaction.

Compaction in Streaming Lakehouses (RisingWave Context)

When using RisingWave to sink data into an Iceberg table in a Streaming Lakehouse architecture, compaction becomes important:

Streaming Sinks & Small Files: Streaming sinks writing frequent, small batches of data can inherently create small files in the target Iceberg table.
Coordination: Compaction jobs need to run alongside the continuous writes from the streaming sink. Open Table Formats like Iceberg handle this concurrency safely through their atomic metadata operations.
RisingWave's Role: While RisingWave itself doesn't typically perform the compaction on the target Iceberg table (this is usually done by a separate batch job using Spark/Flink), the data written by RisingWave's Iceberg sink benefits directly from periodic compaction performed on the lakehouse table. Maintaining the health of the Iceberg table through compaction ensures that downstream queries (whether from RisingWave reading its own output, or from other batch engines) remain performant.