Join Design in Stream Processing Architecture

Our real-time dashboard was once like a simple coffee shop counter—it showed only how many cups were sold today (orders). Each morning, the owner would glance at the numbers and feel confident that business was going well.

Then the focus shifted to what was actually being sold—lattes or Americanos, large or small.

That’s the challenge we face: moving from basic order counts to deeper analysis based on detailed order data (orders_detail).

Note: All code in this article is pseudo-code for educational and conceptual illustration. Focus on understanding the architecture and logic, not on running the exact code. To keep today’s focus clear, we’ll simplify some implementation details from previous days and concentrate on the checkpoint mechanism itself.

The Three Layers of the Lambda Architecture

You may remember the Lambda architecture we discussed earlier:

Batch Layer: Processes historical data in full to ensure accuracy
Speed Layer: Handles real-time data streams to ensure low latency
Serving Layer: Merges results from both layers and serves queries

So far, our focus has been on how the Speed Layer handles real-time data efficiently. Now the question is: when the owner asks for more complex analytics, where should JOINs happen?

A Design Decision: Keeping the Speed Layer Pure

Think of a highway toll booth. If every car had to stop for a detailed inspection and paperwork, traffic would grind to a halt. The Speed Layer is like that toll booth—its job must stay simple.

Key principle: no JOINs in the Speed Layer.

The Speed Layer only does three things:

simple cleansing → fast writes → done.

When the owner opens the dashboard and clicks “View order details,” that’s when the magic happens—in the Serving Layer:

SELECT
    o.order_id,
    o.customer_id,
    o.order_time,
    o.total_amount,
    od.product_name,
    od.quantity,
    od.unit_price,
    SUM(od.quantity * od.unit_price) AS line_total
FROM
    orders AS o
JOIN
    orders_detail AS od ON o.order_id = od.order_id  -- JOIN happens here
GROUP BY
    o.order_id, od.product_name
ORDER BY
    o.order_time DESC;

The Real Challenge: Complex OLAP Logic

This architecture looks simple on the surface, but the real test is this: how strong are your SQL skills?

As the owner’s questions grow more sophisticated, the road quickly becomes rocky.

Scenario Upgrade: Advanced Business Questions

The owner says:

“I want to see sales trends for each product at different times of day—and segment them by customer type.”

Breaking Down a Complex JOIN Query

-- Complex multi-table JOIN + analytical query
WITH hourly_product_sales AS (
    SELECT
        od.product_name,
        DATE_FORMAT(o.order_time, '%H') AS hour_of_day,
        c.customer_type,
        SUM(od.quantity) AS qty_sold,
        SUM(od.quantity * od.unit_price) AS revenue,
        COUNT(DISTINCT o.customer_id) AS unique_customers
    FROM
        orders o
    JOIN
        orders_detail od ON o.order_id = od.order_id
    JOIN
        customers c ON o.customer_id = c.customer_id
    WHERE
        o.order_time >= DATE_SUB(NOW(), INTERVAL 30 DAY)
    GROUP BY
        od.product_name, hour_of_day, c.customer_type
),
product_rankings AS (
    SELECT
        product_name,
        hour_of_day,
        customer_type,
        qty_sold,
        revenue,
        unique_customers,
        ROW_NUMBER() OVER (
            PARTITION BY hour_of_day, customer_type
            ORDER BY revenue DESC
        ) AS sales_rank
    FROM hourly_product_sales
)
SELECT
    pr.product_name,
    pr.hour_of_day,
    pr.customer_type,
    pr.revenue,
    pr.sales_rank,
    ROUND(
        pr.revenue / SUM(pr.revenue) OVER (
            PARTITION BY pr.hour_of_day, pr.customer_type
        ) * 100,
        2
    ) AS revenue_percentage
FROM
    product_rankings pr
WHERE
    pr.sales_rank <= 5
ORDER BY
    pr.hour_of_day,
    pr.customer_type,
    pr.sales_rank;

And this is just the appetizer. Real-world OLAP reports often involve:

Deeply nested subqueries
Complex window functions
Crosstab and pivot analysis
Period-over-period comparisons and trend analysis

If you choose this architecture, you’ll need to master:

Performance Tuning

Table partitioning strategies
Composite indexing
Query execution plan analysis

Summary: A Common Choice for Early-Stage Teams

This story reflects a common approach many teams take when implementing a Lambda architecture: keep the Speed Layer simple, and push complex logic down to the database layer.

This “store first, JOIN later” strategy is a double-edged sword.

Pros

A minimal Speed Layer with fast processing
Lower development overhead and faster time to market
Independent scaling and operation of each layer

Real Challenges

Extremely high demands on SQL expertise
Difficulty building complex OLAP reports
Deep performance-tuning knowledge required

The owner now has access to richer analytical insights, while engineers handle the complexity behind the scenes with increasingly sophisticated SQL.

Day 10 Preview: Moving JOINs into Streaming

As you tune performance, you’ll quickly discover that JOINs are among the most expensive operations in a database.

In the data warehousing world, a common solution is the wide table: instead of joining at query time, you pre-flatten all required fields into a single table.

This idea isn’t limited to offline warehouses—it also applies to real-time processing. If we can join multiple tables during streaming and produce a real-time wide table, downstream queries become faster and lighter, significantly reducing OLAP pressure.

In the next article, we’ll explore how to design real-time wide tables in streaming systems, how they reduce JOIN overhead in the Serving Layer, and the trade-offs involved.