Fault Tolerance

Fault Tolerance is the property that enables a system to continue operating properly, often without interruption, in the event of the failure of one or more of its components. In the context of distributed systems like stream processing engines or databases, this means being resilient to failures such as node crashes, network partitions, or software bugs, while maintaining data integrity and processing guarantees.

Achieving fault tolerance is critical for building reliable applications, especially those handling continuous data streams or managing important state, where downtime or data loss is unacceptable.

The Need for Fault Tolerance in Distributed Systems

Distributed systems inherently face a higher probability of partial failures compared to single-machine systems. Nodes can crash, networks can become unreliable, and disks can fail. Without fault tolerance mechanisms, such failures could lead to:

Data Loss: Losing intermediate results or committed state.
Incorrect Results: Processing data incorrectly due to inconsistent state after a failure.
System Downtime: Requiring manual intervention or long recovery processes.
Violation of Guarantees: Failing to meet processing semantics like At-Least-Once or Exactly-Once.

Key Techniques for Achieving Fault Tolerance

Common strategies used in distributed systems, particularly stateful stream processing, include:

Replication: Maintaining multiple copies of data or computation across different nodes or availability zones. If one copy becomes unavailable, others can take over. This is fundamental for High Availability (HA) and Durability.
- Example: Kafka replicates topic partitions across brokers; Hummock replicates state data in object storage.
Checkpointing: Periodically taking consistent snapshots of the system's state and persisting them to durable storage. Upon failure, the system can restore from the last successful checkpoint. This is the core mechanism for fault tolerance in many stateful stream processors like RisingWave and Flink.
Write-Ahead Logging (WAL): Logging intended changes to durable storage before applying them to the main state. If a failure occurs mid-operation, the log can be replayed to ensure the operation completes or is correctly rolled back upon recovery. (Used heavily in databases, conceptually related to how state changes might be logged before checkpointing in some streaming systems).
Heartbeating and Failure Detection: Nodes monitor each other (e.g., via heartbeat messages). If a node is unresponsive, failure detectors trigger recovery procedures.
Leader Election: In systems with leader/follower architectures (like Kafka brokers or some database replication setups), automatic leader election mechanisms choose a new leader if the current one fails.
Redundant Components: Running multiple instances of critical control plane or processing components, allowing failover if one instance crashes.
Task Retries and Scheduling: Automatically rescheduling failed processing tasks onto healthy nodes.

Fault Tolerance vs. High Availability (HA)

While related, Fault Tolerance and High Availability are distinct:

Fault Tolerance: The ability to withstand failures without data loss or correctness issues. Recovery might involve some downtime.
High Availability: The ability to remain operational and accessible during failures, minimizing downtime. HA often relies heavily on fault tolerance mechanisms like replication and automatic failover.

A system can be fault-tolerant (recovering correctly after a crash) but not highly available (taking significant time to recover). Conversely, a system might fail over quickly (HA) but could lose recent data if it wasn't made durable before the failure (lacking fault tolerance for that data). Robust systems aim for both.

Fault Tolerance in RisingWave

RisingWave is designed with fault tolerance as a core principle, leveraging several techniques:

Checkpointing: As described under Checkpointing, RisingWave periodically snapshots distributed operator state consistently to its Hummock state store. This is the primary mechanism ensuring stateful computations (materialized views, joins, aggregations) can recover correctly after failures.
Hummock State Store: Hummock persists checkpoints and state data to durable Cloud Object Storage (S3, GCS, etc.), which itself provides high durability through internal replication. This ensures state survives compute node failures.
Compute Node Recovery: If a compute node fails, RisingWave's Meta service detects the failure and reschedules the tasks previously running on that node onto other available compute nodes. These tasks restore their state from the last successful checkpoint in Hummock.
Meta Node High Availability: The Meta node itself can be configured for High Availability (using etcd or an embedded Raft implementation) to prevent it from being a single point of failure for the cluster's control plane.
Source Offset Persistence: Source connectors track their consumption progress (e.g., Kafka offsets). This progress is typically included in the checkpoints, ensuring that upon recovery, RisingWave resumes consuming from the correct position in the input stream, consistent with the restored state (essential for EOS).

These mechanisms work together to ensure that RisingWave can recover from failures automatically, maintain the integrity of its internal state, and uphold its processing guarantees (aiming for Exactly-Once Semantics for state).

Fault Tolerance

The Need for Fault Tolerance in Distributed Systems

Key Techniques for Achieving Fault Tolerance

Fault Tolerance vs. High Availability (HA)

Fault Tolerance in RisingWave

Related Blog Posts

Related Glossary Terms