From Lambda to Kappa: The Evolution of Stream Processing Systems

Remember last time we asked: “Why not just move all the logic to streaming? Wouldn’t that be faster?”

The Dual-Track Problem of Lambda Architecture

In a Lambda architecture, our data pipelines follow a dual-track approach:

Lambda Architecture:

Raw Data ┬ Batch Layer ─ Batch Views ────┐
         │                               ├ Serving Layer ─ Query Results
         └ Speed Layer ─ Real-time Views ┘

Responsibilities of each layer

Batch Layer: Handles batch processing, ensures final accuracy, and recomputes full datasets overnight
Speed Layer: Handles real-time processing, provides up-to-date results
Serving Layer: Merges outputs from both layers for queries

The cost

Logic must be maintained twice (once for batch, once for streaming)
Two sets of infrastructure are required (Hadoop/Spark + Flink/Kafka)
Any change in requirements forces engineers to update both layers

The Simplified Idea of Kappa Architecture

The idea behind Kappa architecture is straightforward: “Remove the batch layer and handle everything with streaming.”

Kappa Architecture:

Raw Data ── Event Log (Kafka) ── Stream Processing ── Results

Benefits: Only one set of logic, one system, drastically reducing maintenance overhead.

An Upgraded Coffee Shop Example

Imagine moving all coffee shop data processing logic to streaming:

Kappa Architecture applied to a Coffee Shop:

Order Events ── Kafka ── Stream Processing
                              │
                              ▼
                        ┌────────────┐
                        │ • JOIN     │
                        │ • GROUP BY │
                        │ • ORDER BY │
                        │ • TopN     │
                        └────────────┘
                              │
                              ▼
                        Result Tables
                      (Pre-computed Results)
                              │
                              ▼
                          Dashboard

Processing flow:

Kafka ingests order events
Stream processing handles them in real time
Results are written directly to Result Tables
Dashboards query pre-computed results — no need to JOIN, GROUP BY, or ORDER BY, latency < 50ms

Impact: Even at peak hours with hundreds of queries per second, the database is never overloaded because everything reads from a single result table.

The Core Challenge of Kappa: Stateful Operations

Think about it: “How do we handle JOINs, GROUP BYs, or sliding window computations?”

Exactly — the key step to adopting Kappa is enabling stateful operations.

Why is state needed?

JOINs: Need to store recent events to match with arriving events
GROUP BY / Aggregations: Must maintain intermediate aggregates
Window Functions: Need to track all events within the window

Stateful Operations Challenge:

Event1 ──┐
Event2 ──┼── [State Storage] ── Computed Results
Event3 ──┘        ↑
                  │
            Must remember:
            • Past events
            • Intermediate results
            • Window boundaries

The key challenge: Without state management, we cannot fully move multi-table JOINs or aggregation logic from databases into stream processing.

Giving Stream Processing “Memory”

From Day 12 onwards, we’ll dive into what stateful operations are and introduce a simple State Store in the Simple Streaming framework. This allows the system not just to blindly process events, but to remember what happened — just like a database.

This is the crucial step that brings our streaming pipeline fully into the world of Kappa architecture.