Modern real-time data systems rely heavily on Change Data Capture (CDC) to power downstream stream processing. CDC continuously captures INSERT
, UPDATE
, and DELETE
operations from databases, allowing downstream systems to subscribe to these event streams to:
Build real-time reports and monitoring dashboards
Drive risk control and alerting logic
Synchronize data to data warehouses, search engines, or caches
Support applications like real-time analytics, recommendations, and trade matching
However, a critical issue is often overlooked: transactional boundaries.
Database writes are transactional. But when these changes are broken down into individual CDC events, downstream systems might split a single transaction, leading to the processing and exposure of "half-finished" data. Such transient inconsistencies can cause fluctuating reports, false alarms, and even the ingestion of incomplete data into data warehouses.
This article details the common problems with transactional boundaries when processing CDC streams. We will then explore how to recognize and respect these boundaries in practice to guarantee atomicity and consistency, ensuring that materialized views and join queries always reflect the true, complete state of the database.
Two Examples
Consider a typical payment scenario where a user completes a transfer, and the system executes a transaction to:
Debit an account
Credit another account
Insert a record into an audit table
If we send each UPDATE
and INSERT
event downstream as soon as it's received, a materialized view might see the debit before the credit, or the audit log before the balances are updated. For real-time risk control and monitoring systems, this could mean:
False alarms: A risk control rule sees an anomalous balance after the debit but before the credit, mistakenly flagging it as a fund loss.
Incorrect analysis: A real-time report shows a decrease in total assets, leading to wrong decisions by operations or traders.
Data corruption: Writing data to a warehouse mid-transaction might only capture half of the transaction, breaking consistency and foreign key constraints.
Let's look at another join scenario. Suppose we have a real-time job that joins an orders
table with a payments
table to calculate Gross Merchandise Volume (GMV):
SELECT o.order_id, o.user_id, p.amount
FROM orders o
JOIN payments p ON o.order_id = p.order_id;
If a transaction first inserts an order and then a payment record, but the CDC stream is split by a Barrier—with the order landing in one Epoch and the payment in the next—the join's intermediate state could return an empty result for a cycle because the payment hasn't arrived yet. A downstream dashboard would show a temporary drop in GMV, triggering unnecessary alerts or manual investigations.
This problem is particularly critical in domains like finance, payments, e-commerce, and logistics:
Transaction Monitoring: Trade matching, payment processing, and risk validation must be based on complete transactions, not partial data.
Real-time Reporting: Metrics like GMV, inventory levels, and warehouse movements require transactional consistency; otherwise, they will fluctuate, misleading operations.
Risk Control & Alerting: Partial transactions can trigger false alarms, leading to frozen funds or intercepted orders, which negatively impacts the user experience.
Therefore, preserving transactional atomicity and order within CDC streams is fundamental to ensuring the reliability of real-time computation and analysis.
Barriers and Epochs: The "Commit Points" of Stream Processing
To understand the problem of transactional boundaries, we must first understand how stream processing systems organize data and ensure consistency. Unlike batch processing, stream processing deals with an infinite stream of events, making it impossible to simply "commit" everything at once as in a database transaction. To achieve consistency and fault tolerance, these systems introduce the concept of a Barrier.
A Barrier is a special control message injected into the data stream. It carries no business data but propagates through the entire stream graph. When an operator receives a barrier, it first processes all data in its buffer, then snapshots its state, and finally passes the barrier downstream. Once all operators in the topology have received the barrier, it signifies that all data up to that point has been fully processed and can be safely made visible. The current state can also be persisted as a consistent checkpoint.
The data between two consecutive barriers is called an Epoch. You can think of it as a "micro-batch" in stream processing:
Within an Epoch, an operator's state is continuously updated, but the output is buffered until the epoch is complete.
The boundary of an Epoch is the barrier, which acts like a commit point in a database.
This design allows stream processing systems to achieve exactly-once semantics (ensuring each event is processed only once) and recover from the latest checkpoint after a failure, avoiding data loss or duplicate processing.
However, Barriers are typically injected based on a fixed schedule (e.g., every second) or event count (e.g., after every N records). They are completely unaware of when a transaction in the upstream database begins or ends. If a transaction's updates span a barrier, they are split into two Epochs, causing temporary inconsistencies in downstream states and materialized views. This is the "half-a-transaction" problem.
The Technical Challenge: Conflicts Between Barriers and Transactions
We now understand the roles of Barriers and Epochs—they guarantee exactly-once semantics and recoverability, but they are oblivious to upstream transactional boundaries. Herein lies the problem: Barriers are injected periodically, while transaction commits are unpredictable. If a transaction spans two Barriers, it gets split into two Epochs.
Imagine a typical bank transfer: first a debit, then a credit, then an audit log entry. If the debit event falls into one Epoch, while the credit and log events fall into the next, downstream materialized views and joins will show incorrect results for a period: Account A's balance decreases, Account B's balance remains unchanged, and the total assets mysteriously shrink. This not only triggers false risk alerts but can also corrupt the downstream data warehouse by causing ETL jobs to write incomplete data.
Cross-table transactions make things even worse. CDC often splits changes from different tables into separate, parallel streams. Without unified coordination, a join operator might see an update from Table A before seeing the corresponding update from Table B, causing the join to return empty or partially matched results in some windows. For GMV calculations, order-payment reconciliation, or inventory-shipment consistency analysis, such transient inconsistencies are unacceptable.
Therefore, the real challenge is this: how can we aggregate all events of a transaction into the same Epoch to ensure atomic commits, without breaking the fault-tolerance semantics of stream processing? This also requires handling cross-table transactions by merging changes from multiple tables into a single logical stream; otherwise, downstream joins cannot guarantee consistency.
RisingWave's Solution: Perceiving and Respecting Transactional Boundaries
To ensure that CDC streams correctly reflect the transaction semantics of the upstream database, RisingWave's design considers several key aspects: ensuring input stream order, preventing Barriers from splitting transactions, and handling cross-table transactions, schema evolution, and fault tolerance. Let's dive into how RisingWave tackles these challenges.
1. Acquiring a Strictly Ordered Transaction Stream
The first step is to ensure the input stream is correctly ordered. While the change log of an upstream database is inherently ordered, middleware like a Debezium Connector with Kafka can split changes from different tables into different topics. With transaction metadata in a separate topic, the downstream system must reassemble the transaction, which adds complexity and introduces risks of out-of-order events and delays.
RisingWave embeds the Debezium Embedded Engine, running the parsing logic directly within the connector node. This provides a single, strictly linear stream of events whose order is identical to the commit order of the upstream database, complete with transaction markers like BEGIN
, COMMIT
, and ROLLBACK
. As a result, the downstream system doesn't need any extra "assembly" logic; all transaction events are naturally complete and contiguous.
Technically, this step eliminates the complexities of multi-threaded consumption and multi-topic merging, allowing for precise perception of transactional boundaries. This is the foundation for everything that follows.
2. Pausing Barriers to Keep Transactions in the Same Epoch
With an ordered transaction stream, the next problem is preventing barriers from splitting transactions. RisingWave's approach is straightforward: pause the propagation of barriers during a transaction.
Specifically, when the Source Executor receives a BEGIN
message, it enters a "transaction mode." All incoming events are buffered in memory instead of being immediately dispatched downstream. Any arriving Barrier is propagated downstream, but without any associated events. Only upon receiving a COMMIT
message are all buffered events atomically flushed. Downstream operators then immediately consume these events, update their state, and generate a new Epoch snapshot.
The key advantages of this design are its simplicity and correctness:
Atomicity: All events within a transaction are either completely invisible or become visible all at once.
Consistency: Materialized views and joins always reflect the true state of the database, without any intermediate "half-transaction" states.
Recoverability: Since Barriers only advance after a transaction is complete, checkpoints always align with transaction boundaries. Upon recovery, transactions won't be lost or re-applied.
Of course, this approach has a trade-off: a particularly large transaction can delay Barrier propagation, causing an Epoch to last longer than usual, which in turn delays checkpoints and downstream visibility. However, we believe this is a reasonable trade-off. A delay of a few hundred milliseconds, or even a few seconds, is more acceptable than exposing partial transactions.
3. Multi-Table CDC Source for Cross-Table Transaction Guarantees
Ensuring single-table transactionality is not enough. In real-world business scenarios, multi-table transactions are common: updating an order's status, inserting a payment record, and modifying inventory all within a single transaction. If these tables are ingested using different sources, the events within the transaction will be scattered across different streams, and downstream joins cannot guarantee consistency.
To address this, RisingWave introduces the multi-table CDC source. Users can specify multiple tables within a single source, and RisingWave will use one connection to subscribe to their change logs. This way, all events within a transaction appear in the same physical stream with their order fully preserved.
For example:
CREATE SOURCE my_cdc WITH (
connector = 'postgres-cdc',
database.name = 'mydb',
table.name = 'public.orders, public.payments, public.inventory'
);
CREATE TABLE orders (...) FROM SOURCE my_cdc TABLE 'public.orders';
CREATE TABLE payments (...) FROM SOURCE my_cdc TABLE 'public.payments';
CREATE TABLE inventory (...) FROM SOURCE my_cdc TABLE 'public.inventory';
Now, if a transaction updates all three tables simultaneously, these changes will be buffered and flushed together, landing in the same Epoch. Downstream joins and aggregations will see an atomically consistent result.
4. Streaming Graph Implementation Details
From an implementation perspective, this means that when RisingWave builds a streaming graph, it places the source node and all its dependent materialized views into the same "transaction domain." A Barrier is intercepted upon entering this domain and only propagates downstream after the transaction is complete. This ensures the integrity of the transaction and keeps checkpoints consistently aligned.
RisingWave's Meta Service manages the lifecycle of these "transaction domains." During a DDL transaction, if a user creates multiple tables that reference the same CDC source, the system recognizes this and runs them under the same physical source job, sharing a single connection and barrier stream.
5. Schema Evolution and Transactional Boundaries
Transaction awareness also simplifies schema evolution. When an upstream table's schema changes (e.g., adding a new column), a user can execute ALTER TABLE
in RisingWave. The system will replan the execution graph at the next transaction boundary, ensuring the schema change and data processing remain consistent. This avoids the awkward situation where half a transaction uses the old schema and the other half uses the new one.
6. Trade-offs: Large Transactions and Latency
While pausing Barriers guarantees transactional atomicity, it has an inevitable side effect: if a transaction is very large, the time needed to receive and buffer all its events might exceed a normal Barrier interval. This forces events that could have been spread across multiple epochs to be processed atomically in a single, longer epoch, thus increasing the system's end-to-end latency.
We evaluated several options during the design phase:
Propagate Barriers early: Send the Barrier downstream first and then send the data after the transaction commits. This maintains a fixed checkpoint interval but exposes an incomplete transaction downstream, destroying consistency.
Optimistic execution with transaction rollback: Send events downstream for preliminary computation, confirm visibility upon transaction commit, or roll back otherwise. This requires almost all stateful operators to support multi-version storage and rollback logic, which is extremely complex.
Strict buffering with delayed Barriers: Simple and direct, ensuring a one-to-one alignment between transactions and Epochs at the cost of increased latency.
RisingWave chose the third option because it guarantees consistency without introducing additional state management complexity. We found that in real-world scenarios, large and cross-table transactions are limited, and most transactions complete within a normal Barrier interval, so the added latency is acceptable. For the rare, extremely large transaction, trading latency for atomicity is a reasonable engineering trade-off.
7. Recovery and Fault Tolerance
Another benefit of this design is simpler recovery logic. When the system fails, RisingWave recovers from the snapshot corresponding to the most recent Barrier, which is always located after a transaction boundary. This ensures that committed transactions are not re-applied and that no partial transaction data is lost upon recovery. In contrast, if a Barrier were to land in the middle of a transaction, recovery would require special logic to skip events that have already been sent, making the process complex and error-prone.
Conclusion: Making CDC Reliable
Correctly processing CDC data isn't just about pushing individual change events downstream; it's about ensuring that the combination of these changes is consistent with the true state of the database. Without respecting transactional boundaries, downstream systems may see intermediate states that never actually existed, leading to fluctuating reports, false risk alerts, data inconsistencies, and even business incidents.
RisingWave's design philosophy is simple:
Use the Debezium Embedded Engine to obtain a strictly ordered transaction stream.
Pause Barriers during transactions to align transactions with Epochs.
Support multi-table CDC sources to handle cross-table transactions within a single logical stream.
This way, materialized views and join queries always reflect the true state of the database; what you see at any moment is a complete, atomic result. This enhances data reliability and makes real-time analytics, risk control, trade matching, and monitoring scenarios dependable.
For engineers, this means you can confidently use RisingWave to drive real-time reports, risk control, and trading decisions without worrying about "half a transaction" wandering through your system and breaking business logic. In other words, we bring the atomicity of databases to the streaming world.
Try RisingWave Today
Download the open-sourced version of RisingWave to deploy on your own infrastructure.
Get started quickly with RisingWave Cloud for a fully managed experience.
Talk to Our Experts: Have a complex use case or want to see a personalized demo? Contact us to discuss how RisingWave can address your specific challenges.
Join Our Community: Connect with fellow developers, ask questions, and share your experiences in our vibrant Slack community.