Checkpointing

Checkpointing is a fundamental fault tolerance mechanism used in stateful stream processing systems like RisingWave. It's the process of periodically taking consistent snapshots of the distributed state of a running streaming job and persisting these snapshots to durable storage (like cloud object storage or a distributed file system).

If a failure occurs (e.g., a processing node crashes), the system can restore its internal state from the latest successfully completed checkpoint and resume processing from the corresponding point in the input streams, minimizing data loss and ensuring consistency. Checkpointing is crucial for achieving exactly-once state semantics in the face of failures.

The Problem: Losing State in Streaming

Stateful stream processing operations (like aggregations, joins over time windows, or maintaining materialized views) inherently require remembering information from past events. This internal 'memory' is the operator's state. If a node running such an operator crashes without any fault tolerance mechanism:

  1. State Loss: All the accumulated state held in that node's memory is lost.
  2. Inconsistent Results: Simply restarting the operator from the beginning of the stream would lead to incorrect results, as it wouldn't have the necessary historical context. Recomputing everything from the very beginning is often infeasible for long-running streams.

Checkpointing solves this by periodically saving a recoverable 'save point' of the entire pipeline's state.

How Checkpointing Works: Key Concepts

Modern stream processing systems typically use variations of asynchronous barrier snapshotting algorithms, inspired by the Chandy-Lamport algorithm:

  1. Coordinator Initiation: A central coordinator (e.g., RisingWave's Meta node) initiates a checkpoint periodically by injecting special messages called 'checkpoint barriers' into all input streams (sources) of the dataflow graph.
  2. Barrier Propagation: These barriers flow downstream through the dataflow graph along with regular data records. They do not overtake data records, and data records do not overtake barriers within a logical stream path.
  3. Operator State Snapshot: When an operator receives a barrier from all of its input streams:
    • It knows that all data records before the barriers have been processed.
    • It immediately takes a snapshot of its current internal state (e.g., the current count in an aggregation, the buffered records in a join).
    • It persists this state snapshot asynchronously to the configured durable state backend (e.g., RisingWave's Hummock state store writing to S3).
    • Once its state is successfully persisted, it forwards the barrier downstream to its output streams.
  4. Checkpoint Completion: When the barriers have successfully traversed the entire graph and all operators have reported their state snapshots as complete to the coordinator, the checkpoint is marked as successful. The system now knows it can recover to this consistent state.
  5. Durable Storage: The snapshots are written to a reliable, durable storage system (like AWS S3, GCS, HDFS) separate from the processing nodes. This ensures the state survives node failures.
  6. Recovery: If a failure occurs, the system stops processing. It restarts the operators, instructs them to restore their internal state from the latest completed checkpoint stored in the durable storage, and resets the input sources to start replaying data from the point just after where that checkpoint was taken. This ensures minimal reprocessing and consistent state recovery.

Consistent Cut: The core idea is that the barriers define a 'consistent cut' across the distributed pipeline. The snapshot taken by each operator corresponds to the state after processing all records before the barrier and before processing any records after the barrier, ensuring a globally consistent view of the pipeline's state at that logical point in time.

Key Benefits

  • Fault Tolerance: Enables recovery from failures without losing significant progress or internal state.
  • Exactly-Once State Semantics: Ensures that the internal state of the stream processing job reflects each input record's effect exactly once, even with failures and retries (when combined with input replay).
  • Bounded Recovery Time: Recovery time (RTO) is typically limited to the time needed to reload state and replay records since the last checkpoint, rather than reprocessing the entire stream history.
  • Bounded Data Loss: Potential data loss (RPO) is limited to the data that arrived between the last completed checkpoint and the time of failure (often configurable by checkpoint interval).

Checkpointing in RisingWave

RisingWave implements a sophisticated checkpointing mechanism tightly integrated with its architecture:

  • Meta Node Coordinator: The central Meta node orchestrates the checkpointing process, injecting barriers and tracking completion.
  • Hummock State Store: Checkpoint snapshots are persisted to RisingWave's cloud-native state store, Hummock, which uses durable object storage (like S3) as its backend. Hummock's LSM-tree architecture is optimized for efficient asynchronous snapshotting.
  • Compute Node Execution: Compute nodes execute the operators. When barriers arrive, they trigger the snapshotting of their local operator state via Hummock.
  • Barrier Alignment: RisingWave ensures barriers are correctly aligned across parallel operator instances and different inputs before state is snapshotted.
  • Asynchronous Snapshots: State snapshotting happens asynchronously in the background, minimizing the impact on foreground data processing throughput and latency.
  • Configurable Interval: Users can configure the checkpointing interval, balancing recovery time/data loss objectives (RPO/RTO) against the overhead of frequent checkpointing.

This robust checkpointing mechanism is essential for providing RisingWave's strong consistency guarantees and fault tolerance, especially for long-running, stateful materialized views and streaming pipelines.

Related Glossary Terms

  • Fault Tolerance
  • Exactly-Once Semantics (EOS)
  • Stateful Stream Processing
  • State Store (RisingWave Specific)
  • Hummock (RisingWave Specific)
  • Recovery Point Objective (RPO)
  • Recovery Time Objective (RTO)
  • Dataflow Graph
  • Consistency (State Management)
  • Cloud Object Storage
The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.