RisingWave v2.8: Query Your Lakehouse, Backfill Faster, and Tune Jobs Individually

You Asked for a Unified Streaming-Batch Engine. Here It Is.

If you run streaming pipelines on RisingWave and batch queries on a separate engine, v2.8 changes the equation. This release adds a DataFusion-powered query engine that lets you run batch SQL directly on your Iceberg tables inside RisingWave, without spinning up Spark or Trino for ad-hoc analysis.

But that is just the headline. v2.8 also makes backfilling materialized views dramatically faster with snapshot backfill (now on by default), gives you per-job configuration so you can tune one pipeline without touching others, and introduces streaming vector search for AI workloads.

Here is what matters and why.

Query Iceberg Tables Directly with DataFusion

The problem: You use RisingWave to stream data into Apache Iceberg. But when you need to run an ad-hoc query or investigate a data issue, you switch to Spark or Trino. That is a separate cluster to maintain, a different SQL dialect to remember, and extra latency when you just want to check something.

What changed: RisingWave v2.8 embeds Apache DataFusion as its batch query engine for Iceberg tables. You can now run SELECT, aggregate, window function, and join queries on your Iceberg tables using the same RisingWave connection you already have.

This is not a toy integration. The DataFusion engine supports:

Scalar, aggregate, and window functions
EXPLAIN plans for query debugging
Iceberg table statistics for query optimization
Native decimal type handling
Union and dedup operations

What this means for you: One engine for both streaming and batch on your lakehouse. Write a streaming pipeline that sinks to Iceberg, then query the same table for reporting or debugging, all from the same psql session. No separate Spark cluster. No context switching.

For teams building on the streaming lakehouse pattern, this eliminates an entire layer of infrastructure.

Snapshot Backfill Is Now the Default

The problem: When you create a materialized view on a table with millions of existing rows, RisingWave used to replay the entire changelog to bring the view up to date. For large tables, this could take hours and consume significant resources.

What changed: Snapshot backfill is now enabled by default. Instead of replaying history row by row, RisingWave takes a consistent snapshot of the existing data and loads it in bulk, then switches to incremental processing for new changes.

You also get new lifecycle controls:

Rate limiting -- throttle the backfill to avoid overwhelming your cluster during peak hours
Cancel and drop -- stop a backfill job that is taking too long without leaving behind orphaned state
Serverless backfill -- offload the backfill workload to avoid impacting running streaming jobs

What this means for you: Creating materialized views on large existing datasets is dramatically faster. If you have been avoiding materialized views on big tables because the bootstrapping cost was too high, try again in v2.8.

Per-Job Configuration: Tune One Pipeline Without Breaking Others

The problem: RisingWave had system-level configuration. Changing a setting affected every streaming job in the cluster. If one pipeline needed a different join encoding or a larger state cache, you had to make a global change.

What changed: v2.8 introduces ALTER .. SET CONFIG for individual streaming jobs. Each job can now have its own configuration overrides, visible in the new rw_streaming_job_config system table and the dashboard.

-- Tune join encoding for a specific materialized view
ALTER MATERIALIZED VIEW my_heavy_join SET CONFIG 'streaming_join_encoding_type' = 'compact';

-- Check what overrides exist
SELECT * FROM rw_streaming_job_config;

What this means for you: You can tune hot-path pipelines aggressively without worrying about side effects on other jobs. This is especially useful in multi-tenant or mixed-workload deployments where different pipelines have different resource profiles.

Iceberg v3 Delete Vectors and Schema Evolution

The Iceberg integration got two major upgrades that remove friction from production workflows.

Schema Evolution Without Pipeline Rebuilds

Previously, adding a column to your Iceberg sink meant dropping and recreating the pipeline. v2.8 supports schema changes for both exactly-once and non-exactly-once Iceberg sinks. Add a column in your upstream table, and the sink adapts automatically.

Iceberg v3 Delete Vectors

RisingWave now reads and writes Iceberg v3 delete vectors, which mark individual rows for deletion without rewriting entire data files. This means faster compaction and more efficient reads for tables with frequent updates.

More Iceberg Improvements

IAM role support for S3 and Glue -- use AWS assume-role instead of static credentials
Google authentication for REST catalog -- connect to Google-managed Iceberg catalogs
JDBC catalog with AWS assume role -- for enterprise setups using JDBC-based catalogs
Automatic expired file cleanup -- Iceberg tables no longer accumulate orphaned data files

What this means for you: Your Iceberg pipelines are now more flexible (schema changes), more efficient (delete vectors), and easier to secure (IAM roles instead of long-lived credentials).

Streaming Vector Search for AI Workloads

The problem: You have embedding vectors flowing through your streaming pipeline, and you want to find the most similar items in real time, as each new vector arrives. Previously, you had to sink to an external vector database and query it separately.

What changed: v2.8 adds stream vector index lookup. You can now build a vector similarity index on a table and query it within a streaming pipeline.

What this means for you: If you are building RAG pipelines, recommendation engines, or real-time anomaly detection with embeddings, you can keep the entire workflow inside RisingWave. No external vector database needed for the streaming path.

Watermark TTL: Smarter State Management

The problem: Streaming jobs accumulate state over time. For event-time windowed queries, old state for windows that will never fire again wastes memory and slows down checkpointing.

What changed: Watermark TTL lets you define how long state is retained based on event-time watermarks. RisingWave automatically cleans up state for events that are older than the watermark threshold, whether the watermark is in the primary key or a value column.

What this means for you: Lower memory usage and faster checkpoints for time-windowed streaming jobs. If you have pipelines that process timestamped events (logs, IoT telemetry, clickstreams), watermark TTL keeps your state size under control without manual intervention.

Snowflake as a Source

You can now ingest data directly from Snowflake into RisingWave streaming pipelines. This is useful for teams that have historical or reference data in Snowflake and want to join it with real-time streams in RisingWave.

See the Snowflake source documentation for setup details.

Adaptive Parallelism

The problem: You set a fixed parallelism for each streaming job at creation time. If your workload changes, you had to manually rescale.

What changed: v2.8 adds adaptive parallelism strategies that automatically adjust compute resources for streaming jobs based on current workload. You also get separate parallelism controls for the backfill phase, so bootstrapping a new materialized view does not starve your production pipelines.

What this means for you: Less manual tuning. RisingWave adjusts to traffic patterns without you changing configurations every time load shifts.

Better Observability Across the Board

Debugging streaming jobs got significantly easier in v2.8:

Job-level CPU profiling -- see exactly how much CPU each streaming job consumes, not just node-level metrics
Reorganized Grafana dashboards -- metrics are now grouped by component with cleaner panel layouts. Barrier metrics, streaming metrics, and alerts are separated into their own sections
Richer diagnostics -- the diagnose output now includes view definitions, license status, hostname info, and streaming job tables
Backfill progress tracking -- rw_ddl_progress now shows backfill type and whether serverless backfill is active
Slow DDL notifications -- get alerted when CREATE TABLE or ALTER operations take longer than expected

What this means for you: When something goes wrong (or runs slowly), you can pinpoint the issue faster. The new CPU profiling alone is worth the upgrade if you run multiple streaming jobs on shared compute.

CDC Improvements You Will Notice

Configurable queue sizes -- avoid JVM OOM errors by tuning debezium.max.queue.size and the new max.queue.size.in.bytes per CDC source
Binlog offset monitoring -- track how far behind your CDC pipeline is from the upstream database
PostgreSQL geometry type -- CDC from PostGIS-enabled databases now captures geometry columns
Reset source command -- restart a CDC source from a clean state without dropping and recreating it
Per-source publications -- automatically created publications are now scoped to individual sources, avoiding conflicts in shared databases

FAQ

What is the DataFusion engine in RisingWave v2.8?

DataFusion is an open-source query engine that RisingWave v2.8 embeds for running batch SQL queries on Iceberg tables. It supports scalar functions, aggregates, window functions, and EXPLAIN plans. This lets you query your lakehouse tables directly from RisingWave without a separate batch engine like Spark or Trino.

Do I need to change anything for snapshot backfill?

No. Snapshot backfill is enabled by default in v2.8. Any new materialized view you create will automatically use snapshot backfill when applicable. You can still control it with rate limiting or disable it for specific cases.

Can I upgrade from v2.7 to v2.8 without downtime?

RisingWave supports rolling upgrades between minor versions. Check the upgrade documentation for version-specific instructions and any breaking changes.

How does per-job configuration work?

Use ALTER MATERIALIZED VIEW <name> SET CONFIG '<key>' = '<value>' to override system-level settings for a specific streaming job. View all overrides with SELECT * FROM rw_streaming_job_config. Changes take effect after the next barrier.

Conclusion

RisingWave v2.8 is a significant step toward a unified streaming-batch platform:

DataFusion engine eliminates the need for a separate batch query tool on your Iceberg lakehouse
Snapshot backfill by default makes bootstrapping materialized views on large tables practical
Per-job configuration lets you tune individual pipelines without global side effects
Iceberg v3 delete vectors and schema evolution reduce friction in production lakehouse pipelines
Streaming vector search brings AI/embedding workloads into the streaming engine

This release contains 460+ commits from the RisingWave community. For the full changelog, see the v2.8.0 release notes on GitHub.

Ready to try v2.8? Get started with RisingWave in 5 minutes. Quickstart →

Join our Slack community to ask questions and connect with other stream processing developers.