Kappa Architecture

The Kappa Architecture is a software architecture pattern for data processing that aims to handle both real-time and historical (batch) data analysis using a single stream processing engine and an immutable log as the source of truth. It was proposed as a simpler alternative to the more complex Lambda Architecture, which uses separate batch and speed layers.

The core idea of the Kappa Architecture is to treat all data as a stream and use a stream processing system to perform all computations. If historical data needs to be reprocessed (e.g., due to a bug fix in the processing logic or a new analytical requirement), the stream processing job is simply rerun from the beginning of the immutable log.

Key Principles

Everything is a Stream: All data, whether real-time or historical, is treated as an ordered sequence of events (a stream).
Immutable Log as Source of Truth: An append-only, immutable log (e.g., Apache Kafka, Apache Pulsar) stores all incoming data indefinitely or for a very long retention period. This log is the single source of truth.
Stream Processing Engine for All Computations: A single stream processing engine (like RisingWave, Apache Flink, or ksqlDB) is responsible for all data transformations, aggregations, and analytics. There is no separate batch processing layer.
Reprocessing from the Log for Batch/Historical Analysis: To recompute historical results or apply new logic to old data, the stream processing job is redeployed to read from the log starting from an earlier offset (potentially the very beginning).
Serving Layer: Results from the stream processing engine are typically written to a serving database or data store optimized for fast queries (e.g., materialized views in RisingWave, key-value stores, search indexes, or traditional databases).

Kappa vs. Lambda Architecture

The Lambda Architecture consists of three layers:

Batch Layer: Manages the master dataset (immutable) and precomputes batch views from all data.
Speed Layer (Streaming Layer): Processes incoming data in real-time to provide low-latency updates and compensate for the batch layer's latency. Results are often approximate or partial.
Serving Layer: Merges results from the batch and speed layers to respond to queries.

Kappa Architecture Simplifies Lambda by:

Removing the Batch Layer: Eliminates the need to maintain and operate a separate codebase and infrastructure for batch processing.
Unifying Logic: The same processing logic (defined in the stream processor) is used for both real-time and historical data. This reduces code duplication and potential for inconsistencies between batch and streaming results.
Operational Simplicity: Managing one processing pipeline is generally simpler than managing two.

Advantages of Kappa Architecture

Simplicity: Fewer moving parts and a single codebase for processing logic.
Consistency: Reduces the risk of discrepancies between real-time and batch views because the same engine and logic are used.
Agility: Easier to update processing logic and recompute results, as only one system needs to be managed.
Scalability of Log: Relies on the scalability of modern distributed logs like Kafka.

Challenges and Considerations

Reprocessing Time: Reprocessing large historical datasets through a stream processor can be time-consuming, although stream processing engines are becoming increasingly efficient at this. The feasibility depends on the data volume, processing complexity, and the stream processor's capabilities.
Log Retention: The immutable log needs to store data for as long as reprocessing might be required. This can have storage cost implications, though tiered storage in log systems can help.
Stream Processor Capabilities: The chosen stream processing engine must be robust, scalable, and capable of handling both high-velocity real-time streams and large-scale reprocessing from historical data. It needs efficient state management and fault tolerance.
Tooling: Mature tooling for managing reprocessing jobs, monitoring progress, and handling schema evolution in the log is important.
Suitability for All Use Cases: While powerful, for extremely large historical datasets that require complex, multi-stage batch processing not easily expressible in a single streaming job, some might still find a hybrid approach or Lambda more suitable, though this is becoming less common.