Giving Stream Processing a “Memory” — The Secret of Stateful Operations

Giving Stream Processing a “Memory” — The Secret of Stateful Operations

Imagine you’re running a coffee shop, and there’s a camera at the entrance, continuously capturing every customer who walks in. These snapshots are your event stream — one after another, playing endlessly.

But here’s the challenge, if you want to know:

  • How many times has this customer visited today?

  • How much has he spent this week?

  • What drink did he order five minutes after being seated?

You can’t just look at the current snapshot; you need to remember what happened before. This ability to retain history is what we call a stateful operation.

What is State?

In a streaming system, state refers to the historical information that the system needs to keep when processing new events.

Stateful processing relies on this historical information to compute results.

Stateless vs Stateful Processing:

Stateless:
Event1 ─→ [Process] ─→ Result1
Event2 ─→ [Process] ─→ Result2  (processed independently, no memory of Event1)
Event3 ─→ [Process] ─→ Result3

Stateful:
Event1 ─→ [Process + State] ─→ Result1
                 ↓
Event2 ─→ [Process + State] ─→ Result2  (remembers Event1)
                 ↓
Event3 ─→ [Process + State] ─→ Result3  (remembers Event1 and Event2)

Examples of Stateful Operations

Common stateful operations include:

1. JOIN

Keeps part of Stream A and Stream B in memory until a match is found, then outputs the result.

Stream JOIN Example:

Orders Stream:     [Order1] ──┐
                              ├── [JOIN State] ── [Complete Order]
Payments Stream:   [Payment1]─┘
                        ↑
                   Need to store:
                   • Unmatched orders
                   • Unmatched payments

2. GROUP BY / Aggregation

Maintains cumulative statistics for each group.

Aggregation Example:

User Events ─→ [Aggregation State] ─→ User Statistics
                      ↑
               Store per user:
               • Visit count
               • Total spending
               • Last activity

3. Window Aggregation

Computes metrics like average, max, or sum over a time window.

Window Aggregation:

Events ─→ [Window State] ─→ Window Results
              ↑
        Store in window:
        • All events in the timeframe
        • Intermediate calculations

Ways to Store State

In a streaming system, state is usually stored in one of two ways:

1. Local State Store

Local State Store:

Stream Processor ──┬── Processing Logic
                   └── Local RocksDB
                           ↓
                    Checkpoint to HDFS

Characteristics:

  • Example: RocksDB, co-located with the processing node

  • Low latency, ideal for high-frequency queries

  • Requires additional checkpoint for fault tolerance

2. External State Store

External State Store:

Stream Processor ── Network ── Redis/Cassandra/TiKV
                                      ↑
                               Shared across nodes

Characteristics:

  • Examples: Redis, Cassandra, TiKV

  • Easier to scale horizontally and share state across nodes

  • Higher latency, network cost must be considered

Industry Status

Mainstream streaming systems like Flink primarily use Local State Stores:

ApproachLatencyScalabilityShareabilityFault Tolerance ComplexityMainstream Use
Local StoreLowMediumDifficultHighFlink
External StoreMediumHighEasyLowSpecial cases

Why Local Store dominates:

  • Performance first: Streaming is extremely latency-sensitive; microsecond-level access is a huge advantage

  • Mature technology: RocksDB + Checkpointing is very stable

  • Complete ecosystem: Frameworks like Flink have built-in support, ready to use out of the box

Challenges with Local State Store

Simply storing data locally isn’t enough; you also need to address:

  • Persistence: State must survive node restarts

  • Consistency: State must remain correct in a distributed environment

  • Low-latency access: State must be readable/writable immediately when processing new events

How to achieve these three requirements?

Persistence:

RocksDB (Memory + Disk) ─── Checkpoint ──→ HDFS/S3
                                          (periodic snapshots)
           ↑                                   │
           └────── Recovery ←──────────────────┘
                 (on node restart)
  • Periodic checkpoints store RocksDB snapshots in a distributed file system

  • RocksDB keeps state in both memory and disk; on node failure, the latest checkpoint is loaded to recover state

Consistency:

Event Processing ──→ State Update ──→ Checkpoint Barrier
        │                               │
        └── Ensures event processing order ──────────┘
  • Checkpoint barriers ensure all nodes create snapshots at the same point in time

  • Guarantees a consistent view of state across distributed nodes

Low-latency access:

Hot Data (Memory) ←── RocksDB ──→ Cold Data (SSD)
     ↑                              ↑
  microsecond access               millisecond access
  • RocksDB caches hot data in memory, cold data on high-speed SSDs

  • Local access avoids network latency

Summary

State storage is the key technology that lets streaming evolve from “dumb forwarding” to intelligent computation.

Choices:

  • Local State Store (RocksDB + Checkpoint) is now the industry standard

  • Sacrifices some scalability complexity for microsecond-level performance

Mechanisms:

  • RocksDB: layered storage (memory + disk)

  • Checkpoint: ensures distributed consistency and fault recovery

  • Local access: avoids network latency, meets real-time processing needs

With state storage, streaming systems can finally support complex operations like JOINs, aggregations, and windowing.

Day 13 Preview: Moving into Streaming JOINs

On Day 13, we’ll stop treating JOINs as a “query the database once” patch and dive into Streaming JOINs. By implementing Join, GroupBy, and Window, we’ll better understand how state actually works in practice and gain a deeper understanding of stateful operations.

We’ll focus on in-memory state management to show how streaming can “remember” past events for stateful computation, without relying on RocksDB for the example implementation.

The Modern Backbone for
Real-Time Data and AI
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.