Real-Time Clickstream Analytics to Apache Iceberg with RisingWave Events API (No Kafka Required)

Modern digital products live and die by user behavior signals. Every page view, click, add-to-cart, and purchase carries immediate business value, but only if it’s processed while it still matters.

Teams increasingly need:

Live traffic visibility, showing what users are doing right now
Conversion funnels that update continuously
Session-level KPIs without batch lag
Open analytics storage for BI, experimentation, and AI
Low operational overhead without Kafka-heavy pipelines

Traditional batch ETL introduces minutes or hours of delay, while Kafka-based streaming stacks add infrastructure complexity that many teams don’t actually need. This is where RisingWave Events API provides a simpler alternative.

By ingesting events directly over HTTP, processing them in real time with Postgres-style materialized views, and persisting results to Apache Iceberg, you get a fully real-time clickstream pipeline with no Kafka required.

When Kafka is the right choice (and how RisingWave fits)

Kafka still has some very real, unique advantages when you need a shared, organization-wide event backbone, not just a buffer into one destination like RisingWave.

First, the Kafka ecosystem is enormous. Kafka Connect provides a standard framework for sources and sinks, along with proven operational patterns, and there is a deep catalog of connectors and integrations maintained across the community and vendors. This is especially useful when Kafka needs to sit in the middle of many systems, not just a single database like RisingWave.

Second, Kafka shines when you want long retention, near-permanent storage, and replay as a design primitive. With tiered storage, Kafka can separate compute from storage and offload older log segments to remote or object storage. This lets you retain far more history cost-effectively while keeping the Kafka consumption model intact, which is ideal for replaying historical data, backfills, audits, disaster recovery, and onboarding new consumers later.

Third, Kafka is the natural fit when you need strong data contracts and schema evolution across many producers and consumers. A Schema Registry gives teams a central place to store and validate schemas (commonly Avro, Protobuf, or JSON Schema), enforce compatibility rules, and let clients resolve schemas by ID at runtime. This is especially valuable for large distributed organizations where formats evolve over time.

If you do need those Kafka-native benefits, the “best of both worlds” approach is to keep Kafka as the backbone and ingest into RisingWave via the Kafka connector. RisingWave supports creating Kafka-backed sources and tables and can integrate with schema registry for formats like Avro, so you can preserve your Kafka platform investments while still getting RisingWave’s streaming SQL and materialized views downstream.

Use case and architecture overview: Real-time clickstream analytics to Iceberg

In this blog, we build a complete real-time clickstream analytics pipeline and architecture that:

Ingests clickstream events from an online platform via HTTP using the RisingWave Events API, supporting both JSON and high-throughput NDJSON batches
Models core entities such as users, sessions, devices, campaigns, and pages using RisingWave tables
Enriches events in real time through continuous joins using RisingWave materialized views
Computes live session-level KPIs and funnels, including engagement, duration, revenue, and funnel progression
Streams enriched data into Apache Iceberg, using Lakekeeper REST catalog + MinIO, and keeps everything queryable from RisingWave, Spark, Trino, BI tools, or other Iceberg-compatible query engines because the data is stored in the Apache Iceberg format

All with simple, Postgres-style SQL, minimal services, and no message broker like Kafka.

Real-Time Clickstream Analytics to Apache Iceberg with RisingWave Events API (No Kafka Required)

Ingest clickstream data with RisingWave Events API

For the purpose of showing how you can use the RisingWave Events API and integrate it into your architecture, we will walk you through an end-to-end demo that is fully runnable from the RisingWave awesome-stream-processing repository.

Clone the repo and start the stack

First, clone the RisingWave awesome-stream-processing repository, navigate to the demo directory, and start the services:

RisingWave (localhost:4566)
Events API (localhost:8000)
Lakekeeper (REST Iceberg catalog)
MinIO (S3-compatible object store)
A clickstream HTTP producer sending NDJSON batches

git clone https://github.com/risingwavelabs/awesome-stream-processing.git

cd awesome-stream-processing/07-iceberg-demos/clickstream_events_api_to_iceberg

docker compose up -d

Connect to RisingWave

Since the Docker Compose file starts a RisingWave instance at localhost:4566, use psql to connect as shown below:

psql -h localhost -p 4566 -d dev -U root

Model clickstream tables (Events API Ingestion)

Instead of Kafka topics, events are sent directly via HTTP:

POST /v1/events?name=<table>
Supports JSON and NDJSON
Append-only tables for high throughput

Core tables

We create the following append-only tables to optimize performance: users, devices, sessions, campaigns, page_catalog, and clickstream_events. These tables are continuously populated by the HTTP producer.

-- Users
CREATE TABLE users (
  user_id BIGINT,
  full_name VARCHAR,
  email VARCHAR,
  country VARCHAR,
  signup_time TIMESTAMPTZ,
  marketing_opt_in BOOLEAN,
  ingested_at TIMESTAMPTZ
) WITH (appendonly = 'true');

-- Devices
CREATE TABLE devices (
  device_id VARCHAR,
  device_type VARCHAR,
  os VARCHAR,
  browser VARCHAR,
  user_agent VARCHAR,
  ingested_at TIMESTAMPTZ
) WITH (appendonly = 'true');

-- Sessions 
CREATE TABLE sessions (
  session_id VARCHAR,
  user_id BIGINT,
  device_id VARCHAR,
  session_start TIMESTAMPTZ,
  ip_address VARCHAR,
  geo_city VARCHAR,
  geo_region VARCHAR,
  ingested_at TIMESTAMPTZ
) WITH (appendonly = 'true');

-- Campaigns 
CREATE TABLE campaigns (
  campaign_id VARCHAR,
  source VARCHAR,
  medium VARCHAR,
  campaign VARCHAR,
  content VARCHAR,
  term VARCHAR,
  ingested_at TIMESTAMPTZ
) WITH (appendonly = 'true');

-- Page catalog
CREATE TABLE page_catalog (
  page_url VARCHAR,
  page_category VARCHAR,
  product_id VARCHAR,
  product_category VARCHAR,
  ingested_at TIMESTAMPTZ
) WITH (appendonly = 'true');

-- Clickstream events
CREATE TABLE clickstream_events (
  event_id VARCHAR,
  user_id BIGINT,
  session_id VARCHAR,
  event_type VARCHAR,         -- page_view/click/add_to_cart/checkout_start/purchase
  page_url VARCHAR,
  element_id VARCHAR,
  event_time TIMESTAMPTZ,     -- client event time
  referrer VARCHAR,
  campaign_id VARCHAR,
  revenue_usd DOUBLE PRECISION,
  ingested_at TIMESTAMPTZ     -- ingest time
)WITH (appendonly = 'true');

Materialized views: Real-time enrichment (Core logic)

MV 1 — Fully enriched clickstream stream

This materialized view joins events → users → sessions → devices → pages → campaigns. It stays continuously updated as new events arrive in the tables.

CREATE MATERIALIZED VIEW clickstream_joined_mv AS
SELECT
  e.event_id,
  e.event_time,
  e.ingested_at,
  e.event_type,

  e.user_id,
  u.full_name,
  u.country,
  u.marketing_opt_in,

  s.session_id,
  s.session_start,
  s.geo_city,
  s.geo_region,

  d.device_type,
  d.os,
  d.browser,

  e.page_url,
  p.page_category,
  p.product_id,
  p.product_category,

  e.element_id,
  e.referrer,

  c.source AS campaign_source,
  c.medium AS campaign_medium,
  c.campaign AS campaign_name,

  e.revenue_usd
FROM clickstream_events e
LEFT JOIN users u        ON e.user_id = u.user_id
LEFT JOIN sessions s     ON e.session_id = s.session_id
LEFT JOIN devices d      ON s.device_id = d.device_id
LEFT JOIN page_catalog p ON e.page_url = p.page_url
LEFT JOIN campaigns c    ON e.campaign_id = c.campaign_id;

MV 2 — Real-time session KPIs and funnels

This materialized view computes per-session metrics in real time, including first/last event time, session duration, event counts by type (page views, clicks, add-to-cart, checkout start, purchases), total revenue, and a purchase flag.

CREATE MATERIALIZED VIEW session_kpi_mv AS
SELECT
  s.session_id,
  s.user_id,
  u.country,
  d.device_type,

  MIN(e.event_time) AS first_event_time,
  MAX(e.event_time) AS last_event_time,
  EXTRACT(EPOCH FROM (MAX(e.event_time) - MIN(e.event_time)))::BIGINT AS session_duration_seconds,

  SUM(CASE WHEN e.event_type = 'page_view' THEN 1 ELSE 0 END) AS page_views,
  SUM(CASE WHEN e.event_type = 'click' THEN 1 ELSE 0 END) AS clicks,
  SUM(CASE WHEN e.event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_cart,
  SUM(CASE WHEN e.event_type = 'checkout_start' THEN 1 ELSE 0 END) AS checkout_start,
  SUM(CASE WHEN e.event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases,

  COALESCE(SUM(e.revenue_usd), 0) AS revenue_usd,

  CASE WHEN SUM(CASE WHEN e.event_type='purchase' THEN 1 ELSE 0 END) > 0 THEN TRUE ELSE FALSE END AS did_purchase
FROM sessions s
LEFT JOIN clickstream_events e ON e.session_id = s.session_id
LEFT JOIN users u   ON s.user_id = u.user_id
LEFT JOIN devices d ON s.device_id = d.device_id
GROUP BY s.session_id, s.user_id, u.country, d.device_type;

Now, let’s run some real-time clickstream queries on the tables and materialized views we created.

Live traffic (last 5 minutes)

This query shows per-minute event volume and revenue for the last five minutes of traffic.

SELECT
  date_trunc('minute', event_time) AS minute,
  COUNT(*) AS events,
  SUM(revenue_usd) AS revenue
FROM clickstream_events
WHERE event_time > now() - interval '5 minutes'
GROUP BY 1;

Conversion rate by device

This query calculates the session-level conversion rate for each device type based on completed purchases.

SELECT
  device_type,
  COUNT(*) FILTER (WHERE did_purchase) AS converted,
  COUNT(*) AS sessions,
  ROUND(100.0 * COUNT(*) FILTER (WHERE did_purchase) / COUNT(*), 2) AS conversion_pct
FROM session_kpi_mv
GROUP BY device_type;

Persist Real-time analytics to Apache Iceberg

We store the processed and analyzed data in a RisingWave native Iceberg table using a self-hosted Lakekeeper catalog.

Create an Iceberg Connection (Lakekeeper + MinIO)

This creates an Iceberg REST catalog connection (Lakekeeper) backed by MinIO S3 storage, so RisingWave can read and write Iceberg tables in the specified warehouse.

CREATE CONNECTION lakekeeper_catalog_conn
WITH (
    type = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://lakekeeper:8181/catalog/',
    warehouse.path = 'risingwave-warehouse',
    s3.access.key = 'hummockadmin',
    s3.secret.key = 'hummockadmin',
    s3.path.style.access = 'true',
    s3.endpoint = 'http://minio-0:9301',
    s3.region = 'us-east-1'
);

This gives RisingWave a fully spec-compliant Iceberg REST catalog, without Glue, Nessie, or vendor lock-in.

Enable Iceberg engine

This sets the Iceberg engine connection in RisingWave to use the previously defined Lakekeeper catalog.

SET iceberg_engine_connection = 'public.lakekeeper_catalog_conn';

Create a native Iceberg table

This creates a RisingWave-managed Iceberg table to store enriched clickstream data.

CREATE TABLE clickstream_joined_iceberg (
  event_id VARCHAR,
  event_time TIMESTAMPTZ,
  event_type VARCHAR,
  user_id BIGINT,
  session_id VARCHAR,
  device_type VARCHAR,
  page_url VARCHAR,
  campaign_source VARCHAR,
  revenue_usd DOUBLE PRECISION
)
WITH (commit_checkpoint_interval = 1)
ENGINE = iceberg;

Your Iceberg table is now always fresh, with a one-second commit interval. The commit_checkpoint_interval controls how frequently RisingWave commits data to the internal Iceberg table.

Stream into Iceberg continuously

This continuously writes data from the enriched materialized view into the Iceberg table.

INSERT INTO clickstream_joined_iceberg
SELECT * FROM clickstream_joined_mv;

You can query the native Iceberg table to verify the results:

SELECT * FROM clickstream_joined_iceberg LIMIT 5;

Query the same data from Spark, Trino, or DuckDB

Because the data is stored in Apache Iceberg, you can query it directly from Spark, Trino, DuckDB, or any other Iceberg-compatible query engine using the Lakekeeper REST catalog, with no exports and no duplication.

This enables:

Cross-engine analytics and access
Feature extraction for ML
Historical and real-time joins
Long-term retention with open formats

Strategic Benefits

This architecture delivers:

True real-time ingestion, analytics, storage, and low-latency serving
Zero Kafka operational overhead
Postgres-style, SQL-only stream processing
Open analytics storage (Iceberg)
Streaming to downstream analytics stores such as ClickHouse, StarRocks, BigQuery, and Snowflake
Seamless integration with BI tools such as Superset, Grafana, and Metabase
AI/ML integration for feature engineering and extraction

Conclusion

In this blog, we built a fully real-time clickstream analytics pipeline using RisingWave + Events API that ingests clickstream data over HTTP (JSON / NDJSON), enriches events in real time with materialized views, and computes live session KPIs and funnels. We then persisted the results continuously to Apache Iceberg and queried the same data from RisingWave, Spark, or Trino.

This approach shows how RisingWave lets teams build low-latency, open, and operationally simple analytics pipelines without Kafka, without batch lag, and without lock-in.