Change Data Capture (CDC)
Change Data Capture (CDC) is a set of software design patterns used to determine and track the data that has changed in a source data store (typically a database) so that action can be taken using the changed data. CDC captures row-level changes (INSERTs, UPDATEs, DELETEs) as they happen and makes these change events available in a stream format suitable for consumption by downstream systems.
Instead of performing resource-intensive bulk data loads or frequent polling queries on the source database, CDC allows applications to react to data modifications in near real-time, efficiently, and reliably.
The Problem: Accessing Database Changes
Traditionally, getting data changes out of a database involved methods like:
- Timestamp Columns: Querying tables for rows with an 'updated_at' timestamp greater than the last check time. This misses deletes and can be inefficient on large tables.
- Triggers: Adding database triggers to capture changes into separate changelog tables. This adds overhead to the source database transactions.
- Batch Diffing: Periodically comparing snapshots of the source table to identify changes. This introduces latency and can be computationally expensive.
- Application-Level Dual Writes: Modifying applications to write changes both to the database and to a message queue. This increases application complexity and risks inconsistencies if one write fails.
These methods often suffer from performance issues, inaccuracy (missing deletes), added load on the source system, or tight coupling between applications.
How CDC Works: Common Approaches
CDC solutions typically work by tapping into the database's internal mechanisms for tracking changes, minimizing the impact on the source system's performance:
- Log-Based CDC (Most Common): This approach reads the database's transaction log (e.g., PostgreSQL's WAL, MySQL's Binlog, SQL Server's Transaction Log, MongoDB's Oplog). The transaction log contains a detailed, ordered record of all changes made to the database. CDC tools parse these logs, interpret the low-level log entries, and reconstruct them into meaningful row-level change events (often in a standardized format like the Debezium event format). This is generally the lowest-impact and most comprehensive method.
- Trigger-Based CDC: Uses database triggers, as mentioned earlier, but often managed by a dedicated CDC tool.
- Query-Based CDC: Some systems might still use timestamp columns or queryable database features (like SQL Server's Change Tracking) but often face limitations compared to log-based approaches.
Popular CDC Tools:
- Debezium: A leading open-source distributed platform for log-based CDC, supporting various databases (PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, etc.). It typically publishes change events to Kafka.
- Vendor-Specific Tools: Many database vendors offer their own CDC features or tools (e.g., Oracle GoldenGate, AWS DMS, GCP Datastream).
Output Format: Change events usually contain information like:
- The type of operation ('c' for create/insert, 'u' for update, 'd' for delete, 'r' for read/snapshot).
- The state of the row before the change (for updates/deletes).
- The state of the row after the change (for inserts/updates).
- Source metadata (database, table, timestamp, transaction ID, log position).
Key Benefits
- Low Latency: Captures changes as they occur, enabling near real-time data propagation.
- Low Source Impact: Log-based CDC typically has minimal performance overhead on the source database compared to frequent polling or triggers.
- Reliability & Completeness: Captures all changes (inserts, updates, deletes) accurately based on the transaction log.
- Decoupling: Downstream consumers react to the change stream without directly querying or impacting the source database.
- Efficiency: Transmits only the changes, not entire tables or partitions.
Common Use Cases
- Real-time Analytics: Feeding database changes into stream processors or data warehouses for immediate analysis.
- Data Replication: Replicating data to other databases, caches, or search indexes.
- Cache Invalidation: Triggering updates or invalidations in caches based on database changes.
- Data Synchronization: Keeping multiple systems synchronized.
- Audit Logging: Creating detailed audit trails of data modifications.
- Streaming ETL/ELT: Building real-time data pipelines originating from operational databases.
- Microservices Data Exchange: Enabling data sharing between microservices based on database events.
CDC and RisingWave
Change Data Capture is a primary mechanism for getting operational data from existing databases into RisingWave for real-time stream processing:
- Direct CDC Connectors: RisingWave offers built-in source connectors specifically designed for CDC from databases like PostgreSQL and MySQL. These connectors often implement log-based CDC directly, reading the database's transaction log (e.g., WAL) and converting changes into a stream consumable by RisingWave's SQL engine, without requiring an intermediary message broker like Kafka.
- Kafka-Based CDC (via Debezium): A very common pattern is to use Debezium to capture changes from a source database and publish them as Avro or JSON events to Kafka topics. RisingWave can then consume these Debezium-formatted events from Kafka using its standard Kafka source connector, recognizing the Debezium event structure.
Using CDC allows RisingWave to treat database tables effectively as streams, enabling powerful use cases like building real-time materialized views on top of operational database tables, joining database data with other streams, or sinking processed results based on database triggers.
Related Glossary Terms
- Debezium
- Transaction Log (Conceptual)
- Real-time Data Processing
- Event Streaming Platform (ESP)
- Apache Kafka
- RisingWave Source
- Streaming ETL/ELT
- Microservices (in Streaming)