Durability (State Management)

Durability, in the context of stateful stream processing and databases, is one of the core ACID properties (Atomicity, Consistency, Isolation, Durability). It guarantees that once a transaction (or in streaming, often a state update associated with a successful Checkpoint) has been committed, it will persist permanently, even in the face of system failures like node crashes, restarts, or power outages.

Essentially, durability ensures that committed state changes are not lost. This is fundamental for building reliable systems where state represents important business logic, aggregations, or materialized results.

The Need for Durability

Stateful stream processing involves maintaining internal state (e.g., counts, sums, windowed data, join results). If this state only exists in volatile memory (RAM) on processing nodes:

Data Loss on Failure: A node crash or restart would wipe out the state held on that node, leading to incorrect results upon recovery unless the entire state can be perfectly reconstructed from the input streams (which is often inefficient or impossible for long-running streams).
Inconsistent Recovery: Without durable state, the system cannot reliably restore itself to a consistent point after a failure.

Durability ensures that the system's "memory" survives failures.

How Durability is Achieved

Durability typically relies on writing state information to persistent, non-volatile storage before acknowledging a commit or checkpoint completion:

Persistent Storage: Using storage media that retains data even when power is lost, such as:
- Hard Disk Drives (HDDs)
- Solid-State Drives (SSDs)
- Cloud Object Storage (S3, GCS, ADLS) - Common in cloud-native systems.
- Distributed File Systems (HDFS).
Write-Ahead Logging (WAL): A common database technique where changes are first written to a sequential log on durable storage before being applied to the main data structures. This ensures that even if a crash occurs mid-update, the log can be replayed to recover the state. While not always directly used in the same way by all streaming state stores, the principle of logging changes durably is related.
Checkpointing to Durable Storage: As discussed under Checkpointing, stateful stream processors periodically snapshot their entire consistent state and write these snapshots to durable storage. This is the primary mechanism for ensuring state durability in systems like RisingWave. Once a checkpoint is successfully written to durable storage, the state represented by that checkpoint is considered durable.
Replication: Often combined with durable storage, replicating data across multiple storage devices or nodes further enhances durability against hardware failures. Cloud object storage typically handles this replication transparently.

Durability vs. Availability

Durability (data survives failures) is distinct from High Availability (system remains operational during failures). A system might have durable state storage but could still experience downtime during recovery (low availability). Conversely, a highly available system might achieve failover quickly but could lose recent state if it wasn't made durable before the failure. Robust systems aim for both.

Durability in RisingWave

RisingWave achieves durability for its streaming state primarily through its Checkpointing mechanism and its Hummock state store:

Hummock & Cloud Object Storage: Hummock, RisingWave's state store, is designed to use durable Cloud Object Storage (like S3 or GCS) as its persistent backend.
Checkpoint Persistence: During checkpointing, snapshots of operator state are asynchronously written via Hummock to the configured object storage.
Commit Point: A checkpoint is only considered complete and successful after all its constituent state snapshots have been durably persisted in the object storage backend.
Recovery from Durable State: Upon failure, RisingWave recovers by loading the latest successfully completed checkpoint from the durable object storage via Hummock.

This ensures that even if compute nodes fail catastrophically, the committed state of materialized views, joins, and aggregations managed by RisingWave is not lost, allowing the system to recover to a consistent and durable point in time. The durability guarantees of the underlying cloud object storage (e.g., S3's 99.999999999% durability) are inherited by RisingWave's state.

Durability (State Management)

The Need for Durability

How Durability is Achieved

Durability vs. Availability

Durability in RisingWave

Related Glossary Terms