Apache Iceberg + RisingWave: Building a Real-Time Lakehouse

Introduction

The data lakehouse promises one thing: combine the flexibility of data lakes with the reliability of data warehouses. Apache Iceberg delivers on the storage side with ACID transactions, schema evolution, and time travel on open file formats. But most lakehouse architectures still rely on batch pipelines that refresh data every 15 minutes to several hours. Your "real-time lakehouse" is only as real-time as your slowest ETL job.

RisingWave closes that gap. As a streaming database, it processes data continuously and can sink results directly into Apache Iceberg tables. This means your lakehouse reflects changes within seconds of them happening upstream, not after the next scheduled batch run.

This guide walks through building a real-time lakehouse architecture using RisingWave and Apache Iceberg. You will learn how to stream data through RisingWave, transform it with SQL, and sink the results into Iceberg tables for downstream analytics. All SQL examples are tested on RisingWave v2.8.0.

Why a Real-Time Lakehouse?

Traditional lakehouse architectures use batch ingestion tools like Apache Spark or dbt to load and transform data on a schedule. This works well for historical analysis, but it creates blind spots for operational decisions:

Dashboards lag behind reality: A 30-minute batch cycle means your supply chain dashboard misses a warehouse outage for half an hour.
ML features go stale: Feature stores refreshed hourly feed models with outdated signals, degrading prediction accuracy.
Alerting is slow: A batch pipeline that detects anomalies every 15 minutes is 15 minutes too late for fraud detection.

A real-time lakehouse architecture adds a streaming layer that continuously processes events and writes fresh results into your lakehouse. The batch layer still handles backfills and heavy transformations, but the streaming layer keeps the most critical data current.

Where RisingWave Fits

RisingWave acts as the streaming compute layer in this architecture. It ingests events from sources like Kafka, performs SQL transformations (aggregations, joins, windowing), and sinks the processed results directly into Apache Iceberg tables. Downstream tools like Trino, Spark, or DuckDB then query the Iceberg tables for analytics.

The data flow looks like this:

graph LR
    A[Event Sources] --> B[Kafka]
    B --> C[RisingWave]
    C --> D[Apache Iceberg]
    D --> E[Trino / Spark / DuckDB]

For a detailed overview of RisingWave's architecture and how it handles streaming data, see the RisingWave architecture documentation.

Setting Up the Streaming Pipeline

Let's build a concrete example: a real-time lakehouse for e-commerce event data. We will ingest order events into RisingWave, compute aggregations, and sink results to Iceberg.

Step 1: Create Source Tables in RisingWave

First, create a table in RisingWave to receive order events. In production, this would typically be a Kafka source, but for this tutorial we use a regular table with direct inserts:

CREATE TABLE order_events (
    event_id INT,
    order_id INT,
    customer_id INT,
    product_category VARCHAR,
    amount DECIMAL,
    event_type VARCHAR,
    region VARCHAR,
    event_time TIMESTAMP
);

Insert sample event data:

INSERT INTO order_events VALUES
    (1, 1001, 201, 'electronics', 899.99, 'purchase', 'us-east', '2026-03-28 09:00:00'),
    (2, 1002, 202, 'clothing', 49.99, 'purchase', 'us-west', '2026-03-28 09:05:00'),
    (3, 1003, 203, 'electronics', 1299.99, 'purchase', 'eu-west', '2026-03-28 09:10:00'),
    (4, 1004, 201, 'home', 199.99, 'purchase', 'us-east', '2026-03-28 09:15:00'),
    (5, 1001, 201, 'electronics', 899.99, 'refund', 'us-east', '2026-03-28 09:30:00'),
    (6, 1005, 204, 'clothing', 79.99, 'purchase', 'eu-west', '2026-03-28 09:35:00'),
    (7, 1006, 205, 'electronics', 549.99, 'purchase', 'us-west', '2026-03-28 09:40:00'),
    (8, 1007, 206, 'home', 329.99, 'purchase', 'us-east', '2026-03-28 09:45:00');

Step 2: Create Materialized Views for Aggregation

Before sinking to Iceberg, transform the raw events into useful aggregations. Create a materialized view that computes revenue by category and region, excluding refunds:

CREATE MATERIALIZED VIEW mv_category_revenue AS
SELECT
    product_category,
    region,
    COUNT(*) AS purchase_count,
    SUM(amount) AS total_revenue
FROM order_events
WHERE event_type = 'purchase'
GROUP BY product_category, region;

Verify the materialized view:

SELECT * FROM mv_category_revenue ORDER BY total_revenue DESC;

Expected output:

 product_category | region  | purchase_count | total_revenue
------------------+---------+----------------+---------------
 electronics      | eu-west |              1 |       1299.99
 electronics      | us-east |              1 |        899.99
 electronics      | us-west |              1 |        549.99
 home             | us-east |              2 |        529.98
 clothing         | eu-west |              1 |         79.99
 clothing         | us-west |              1 |         49.99

Step 3: Sink to Apache Iceberg

RisingWave supports sinking data directly to Apache Iceberg using the CREATE SINK command. This sends the materialized view results (or raw table data) to an Iceberg table, keeping it continuously updated.

Here is the syntax for creating an Iceberg sink:

CREATE SINK sink_category_revenue FROM mv_category_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_category, region',
    catalog.type = 'storage',
    catalog.name = 'demo',
    warehouse.path = 's3://my-lakehouse/warehouse',
    s3.endpoint = 'https://s3.amazonaws.com',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    database.name = 'analytics',
    table.name = 'category_revenue'
);

Key parameters explained:

connector = 'iceberg': Specifies the Iceberg sink connector.
type = 'upsert': Uses upsert mode so that updated aggregation results overwrite previous values for the same key. Use 'append-only' if you only need inserts.
primary_key: Columns that uniquely identify each row for upsert operations.
catalog.type: The Iceberg catalog type. Options include 'storage', 'rest', 'hive', and 'glue'.
warehouse.path: The S3 (or compatible) path where Iceberg data files are stored.
database.name / table.name: The target Iceberg database and table.

For the complete list of Iceberg sink parameters, see the RisingWave Iceberg sink documentation.

Using Different Iceberg Catalogs

RisingWave supports multiple Iceberg catalog types depending on your infrastructure:

REST Catalog (recommended for production):

CREATE SINK sink_to_iceberg_rest FROM mv_category_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_category, region',
    catalog.type = 'rest',
    catalog.name = 'my_catalog',
    catalog.uri = 'http://iceberg-rest-catalog:8181',
    database.name = 'analytics',
    table.name = 'category_revenue'
);

AWS Glue Catalog:

CREATE SINK sink_to_iceberg_glue FROM mv_category_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_category, region',
    catalog.type = 'glue',
    catalog.name = 'glue_catalog',
    warehouse.path = 's3://my-lakehouse/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'category_revenue'
);

Architecture Patterns

Pattern 1: Streaming Aggregation to Iceberg

This is the simplest pattern. Raw events flow through RisingWave, get aggregated in a materialized view, and the aggregated results are sunk to Iceberg. Downstream query engines read the Iceberg tables.

Best for: dashboards, reporting tables, pre-aggregated metrics.

Pattern 2: Stream Enrichment and Sink

Join streaming events with reference data before sinking. For example, enrich order events with customer information:

CREATE TABLE customer_profiles (
    customer_id INT,
    segment VARCHAR,
    lifetime_value DECIMAL
);

INSERT INTO customer_profiles VALUES
    (201, 'enterprise', 15000.00),
    (202, 'smb', 2500.00),
    (203, 'enterprise', 22000.00),
    (204, 'consumer', 800.00),
    (205, 'smb', 3200.00),
    (206, 'consumer', 1500.00);

CREATE MATERIALIZED VIEW mv_enriched_orders AS
SELECT
    o.order_id,
    o.product_category,
    o.amount,
    o.event_type,
    o.region,
    o.event_time,
    c.segment AS customer_segment,
    c.lifetime_value
FROM order_events o
JOIN customer_profiles c ON o.customer_id = c.customer_id;

SELECT order_id, product_category, amount, customer_segment FROM mv_enriched_orders ORDER BY order_id LIMIT 5;

Expected output:

 order_id | product_category |  amount | customer_segment
----------+------------------+---------+------------------
     1001 | electronics      |  899.99 | enterprise
     1001 | electronics      |  899.99 | enterprise
     1002 | clothing         |   49.99 | smb
     1003 | electronics      | 1299.99 | enterprise
     1004 | home             |  199.99 | enterprise

This enriched data can then be sunk to Iceberg for analytics teams to query with full customer context.

Pattern 3: Multi-Sink Fan-Out

RisingWave can sink the same materialized view to multiple destinations. Send aggregated metrics to Iceberg for batch analytics, and simultaneously to PostgreSQL for application queries:

-- Sink to Iceberg for the data lake
CREATE SINK sink_metrics_iceberg FROM mv_category_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_category, region',
    catalog.type = 'storage',
    warehouse.path = 's3://my-lakehouse/warehouse',
    database.name = 'analytics',
    table.name = 'category_revenue'
);

-- Sink to PostgreSQL for the application
CREATE SINK sink_metrics_pg FROM mv_category_revenue
WITH (
    connector = 'jdbc',
    jdbc.url = 'jdbc:postgresql://pg-host:5432/app_db',
    table.name = 'category_revenue',
    type = 'upsert',
    primary_key = 'product_category, region'
);

Querying Your Real-Time Lakehouse

Once data lands in Iceberg, any engine that supports Iceberg can query it. Here are examples:

Trino:

SELECT product_category, SUM(total_revenue)
FROM iceberg.analytics.category_revenue
GROUP BY product_category;

DuckDB:

SELECT * FROM iceberg_scan('s3://my-lakehouse/warehouse/analytics/category_revenue');

Apache Spark:

SELECT * FROM analytics.category_revenue WHERE region = 'us-east';

The data in these Iceberg tables updates continuously as RisingWave processes new events, giving your analytics teams access to fresh data without waiting for batch jobs. For more on querying options, check the Apache Iceberg documentation.

FAQ

What is a real-time lakehouse?

A real-time lakehouse is a data architecture that combines the open storage format of a data lake (like Apache Iceberg) with streaming ingestion that keeps data continuously fresh. Instead of relying solely on batch ETL, a streaming layer processes events in real time and writes results directly to the lakehouse.

How does RisingWave sink data to Apache Iceberg?

RisingWave uses the CREATE SINK command with connector = 'iceberg' to continuously write processed data to Iceberg tables. It supports upsert and append-only modes, multiple catalog types (REST, Glue, Hive, storage), and writes data in Parquet format to S3-compatible storage.

Can I use RisingWave with an existing Iceberg catalog?

Yes. RisingWave supports REST catalogs, AWS Glue, Hive Metastore, and storage-based catalogs. If you already have an Iceberg catalog running, point RisingWave's sink configuration to it using the appropriate catalog.type and catalog.uri parameters.

What is the latency from event ingestion to Iceberg availability?

End-to-end latency depends on your RisingWave checkpoint interval and the Iceberg commit frequency. With default settings, data typically becomes available in Iceberg within 10-30 seconds of arriving in RisingWave. You can tune the checkpoint interval for lower latency at the cost of more frequent small files.

Conclusion

Building a real-time lakehouse with RisingWave and Apache Iceberg gives you the best of both worlds: continuously fresh data with the open, queryable storage format that modern analytics tools expect. Key takeaways:

RisingWave's CREATE SINK with the Iceberg connector streams processed data directly into Iceberg tables
Upsert mode keeps aggregation tables current by overwriting stale rows, while append-only mode works for event logs
Materialized views let you transform and aggregate data with SQL before it reaches the lakehouse
Multiple catalog types (REST, Glue, Hive, storage) integrate with your existing infrastructure
Downstream engines like Trino, Spark, and DuckDB can query the Iceberg tables immediately

For a complete walkthrough of the Iceberg sink configuration options, see the RisingWave Iceberg sink documentation.

Ready to try this yourself? Try RisingWave Cloud free, no credit card required. Sign up ->

Join our Slack community to ask questions and connect with other stream processing developers.