Blazing Fast Iceberg Queries with Apache DataFusion in RisingWave

Introduction

Iceberg makes the lakehouse look pleasantly “clean”: the table format is open, the data lives in object storage, and in principle you can swap query engines without paying a huge migration tax or getting locked into a vendor. The catch is that once you scale to 100GB, 1TB, or more, and your queries start hitting S3 directly, what determines the experience is usually not “can I switch engines,” but “can the engine I’m using actually handle this.” Latency is real, Parquet files do not magically organize themselves, and the combination of scans, projections, join blow-ups, plus all the internal data-format shuffling inside an execution engine can turn “I just want to run a SQL query” into a tug-of-war over resources.

For a long time, RisingWave focused more on the ingestion and transformation side of the lakehouse: reliably writing incremental data into Iceberg and keeping processing pipelines durable and continuous. But as more users started running analytical queries directly on Iceberg, a very practical expectation became hard to ignore: people want to get as much done as possible in a single system, rather than ingesting and transforming in RisingWave and then switching to another engine for analysis. That put the problem squarely on us. RisingWave needed a batch execution engine that could really deliver, not something that merely “works.”

That is the most important change in RisingWave 2.8: we replaced our batch execution engine with Apache DataFusion. DataFusion is a Rust query engine built around Apache Arrow as its core in-memory data model, and it is particularly good at running OLAP-heavy workloads fast through vectorized execution of scans, joins, and aggregations. In this post, we use 100GB TPC-H (Iceberg tables stored on S3) as the benchmark to explain why RisingWave chose DataFusion, what changed in performance and behavior, and what is actually happening behind the OOM cases we saw on a handful of join-heavy queries.

What role does DataFusion play inside RisingWave?

When people hear “RisingWave uses DataFusion,” the first assumption is often that we outsourced the entire batch execution layer to DataFusion. We deliberately did not. Inside RisingWave, DataFusion is more like a very strong OLAP core. We let it do what it does best: run vectorized execution on Arrow and push heavy operators like scans, joins, and aggregations as fast as possible. Meanwhile, SQL semantics, plan generation, and the overall optimization framework remain driven by RisingWave, because we do not want the system to grow a second SQL world that is “almost the same but not quite.” Iceberg reads go through the Rust Iceberg SDK, which produces Arrow RecordBatches, and we try to feed that data stream into DataFusion as directly as possible, shortening the path and avoiding pointless copying.

Of course, not everything can run end-to-end inside DataFusion yet. Some expressions still fall back to RisingWave’s expression framework, which is more streaming-oriented. That means on those paths, data needs to be converted between Arrow RecordBatch and RisingWave’s DataChunk. We have no romantic view of that conversion. It is overhead, it is a performance tax. But compared to the alternative of splitting SQL semantics into two different systems “for speed,” we would rather pay this tax for now, because one of RisingWave’s core promises is consistency: the same SQL should produce the same results whether it runs on real-time streams or on historical data in Iceberg.

This is where streaming plus batch becomes genuinely difficult. Building two engines with two semantics is easy, and many systems do exactly that, which leaves users constantly explaining “why the same SQL behaves differently here and there.” We do not want to go down that path. The strategy today is straightforward: RisingWave guarantees semantics and planning, DataFusion provides the batch execution foundation, and we steadily push more expressions and execution down into DataFusion over time, gradually eliminating unnecessary format conversions and moving the architecture from “works well” to “works well, and cleanly.”

Benchmark: 100GB TPC-H, with data on S3

We ran this benchmark on a 100GB TPC-H Iceberg dataset. All tables are stored on S3, and compute runs in the same region. That choice is intentional. We did not want this to turn into a “who is faster on a local SSD” contest, because in reality many Iceberg pain points come from object storage latency, throughput, and file layout, not from the peak bandwidth of sequential disk reads.

On the RisingWave side, we used 16 vCPUs. For a roughly comparable reference point, we also ran DuckDB with a similar resource budget, and explicitly turned off the external file cache so the comparison would not become “who caches better locally.” We used a storage catalog. All 22 standard TPC-H queries were warmed up before timing, mainly to reduce first-run noise and make the measurements closer to how people actually run recurring reports in production.

Overall, after switching to DataFusion, RisingWave’s batch query performance moved into a noticeably different tier compared to one of the engines we tested against, with many queries dropping significantly in runtime. That said, the story is not purely positive: on a handful of join-heavy queries, we hit OOM under the current memory budget (Q3, Q5, Q18, Q21). From an engineering perspective, this is not surprising. Replacing the execution core with DataFusion solves a large part of the speed problem, but turning it into a robust cloud OLAP experience at larger scale requires stronger memory resilience and spillable execution paths.

Why did performance improve so much?

Saying “DataFusion is faster” does not really explain the magnitude of what we observed. The real change is that the execution path was rewritten, especially where Iceberg scan output enters the execution engine and how much “format hauling” we pay along the way.

In RisingWave’s default engine, the Rust Iceberg SDK outputs arrow_array::RecordBatch, while RisingWave’s internal batch operators consume DataChunk. Two different in-memory representations means you need a conversion step in the middle, and in the worst case that conversion degenerates into row-by-row copying. On small datasets this is just wasted CPU. Once you reach a 100GB scale, it can easily become a dominant cost: you are no longer spending cycles on joins and aggregations, you are spending them repeatedly reshaping the same data.

The first immediate win from DataFusion is that Arrow RecordBatch is its native in-memory format. Whatever the Iceberg SDK produces, DataFusion can consume directly. That removes the heavy conversion chain between scan and execution operators. It looks like “one fewer step,” but for cloud Iceberg workloads this step is often a permanent tax. Once it is gone, CPU starts doing the work you actually care about, such as joins and aggregations.

The second win comes from the execution model itself. DataFusion’s batch execution is columnar and vectorized. Many of its core operators run tight loops over batches of data, reducing per-row interpreter overhead and making it easier to take advantage of CPU cache behavior and SIMD. For scan-heavy and aggregation-heavy queries, this difference is direct and structural. It is not a tuning trick. It is simply a better fit for OLAP workloads.

The last piece is operator quality for joins and aggregations. TPC-H includes a lot of join-heavy queries, and join efficiency often determines your overall throughput. DataFusion’s hash join and hash aggregation implementations are more mature, and when the working set fits in memory, execution can be extremely efficient. Put together, the improvement is the combined effect of a cleaner scan path, a more vectorized execution model, and faster heavy operators. That is the core reason the performance shift is so noticeable.

OOM: why did it happen, and what does it mean?

It is easy to talk about performance improvements. It would be dishonest to hand-wave the OOM cases. Under our current resource configuration, a few queries did hit the memory ceiling, and that exposes a clear reality: DataFusion today still leans toward “keep intermediate state in memory and finish the job.” When join or aggregation intermediates grow beyond the memory budget, graceful degradation mechanisms are not mature enough yet, especially when it comes to spill-to-disk paths that remain stable under pressure. This creates a classic engineering tradeoff: if the working set fits in memory, DataFusion can be very fast; if it does not, it often does not slow down and limp through, it fails.

There is also a subtle but important factor: planning. RisingWave generates the logical plan and performs the main optimizations, and DataFusion applies additional optimizations on the execution side. With that division of responsibilities, join order becomes extremely sensitive with respect to memory usage. In multi-way joins, a suboptimal join order can amplify intermediate results early, forcing later operators to hold much larger hash tables or aggregation states, pushing queries that might otherwise succeed into OOM. We insist on RisingWave-driven planning to preserve consistent semantics between streaming and batch, but that also means our planner must become more “DataFusion-aware,” with stronger cost modeling and join reordering to avoid creating huge intermediates at the logical layer.

So we do not treat these OOM cases as “DataFusion is bad.” We see them as a very clear engineering signal: the direction is right, and the performance foundation is in place, but if we want this to be a production-ready lakehouse batch execution path, memory resilience has to catch up. Our next steps follow a few concrete directions: make joins and aggregations spillable and improve overall memory-aware behavior; make join planning more robust to reduce intermediate explosions; and push more expressions down into DataFusion so we can eliminate unnecessary conversions between Arrow and DataChunk, making the system not only faster, but also more resilient.

Strengths and limitations today: do not idealize the system

So far, this integration has already proven its value in Iceberg-on-S3 workloads that put execution fundamentals to the test. The most obvious change is the cleaner scan path: Iceberg produces Arrow, DataFusion consumes Arrow, and the execution pipeline avoids a lot of useless data reshaping. DataFusion’s columnar vectorized execution also improves CPU efficiency, especially for OLAP core operations like scans and aggregations. More importantly, we did not sacrifice RisingWave’s core goal just to adopt DataFusion: the same SQL can run against real-time streams and historical Iceberg data with consistent semantics. For users, that “one semantic layer across streaming and batch” is harder to build than raw speed, and it is worth keeping.

But if you think of this as “one perfectly unified engine,” that is too idealistic. Memory resilience is a real gap today. When join or aggregation intermediates get large, the lack of mature spill-to-disk paths can still push a small number of queries into OOM. Plan quality matters just as much. In multi-way joins, join order can make intermediates blow up and memory pressure multiply, which means our planner must remain semantics-correct while also becoming more DataFusion-aware. And there is still the format conversion tax from expression fallback: some expressions are evaluated in RisingWave’s framework, which forces Arrow–DataChunk conversion overhead. We will continue pushing this down and shrinking that surface area over time.

Conclusion

With DataFusion in RisingWave 2.8, we took a major step forward for batch queries on Iceberg over S3. We finally have a batch execution core that fits cloud lakehouse workloads much better, and the overall experience moves up a tier. At the same time, this change also makes another truth more obvious: speed is never the whole story. Long-term production reliability comes from resilience under memory pressure, spillable execution paths, and smarter, more stable planning.

Our next work is focused and practical. We will strengthen spilling and memory-aware execution so join-heavy queries do not fall apart when intermediates grow; we will make the planner more DataFusion-aware and use better join orders to avoid intermediate explosions; and we will continue expanding pushdown into DataFusion, reducing unnecessary conversions between Arrow and RisingWave’s internal formats and shaving off performance tax over time. The goal is simple: not only faster, but fast and stable.