Data is the critical resource of our age. Today, organizations work with massive amounts of data — generated by myriad applications — which they often want to use for data analytics and business intelligence (BI) systems to help drive decisions that lead to growth.
Change data capture (CDC) is a process that captures changes made in a database, and ensures that those changes are replicated to a destination such as a data warehouse.
CDC decreases the resources required for the ETL process, either by using a source database’s binary log (binlog), or by relying on trigger functions to ingest only the data that has changed since the previous ETL operation instead of the entire dataset.
Why data replication is necessary
Transactional database data must be replicated into a BI database or data warehouse because analytic databases are far more performant than transactional databases for executing large analytic workloads. In a transactional database, data is stored, updated, and accessed by row; but an analytic database has a column-based structure, where each column of data is stored together. This design facilitates operating on a large set of data within a given column quickly.
One way to replicate a source database to a destination is to do a full replication — or bulk load update — of the source to the destination. In batch processing, the system groups and processes data according to a schedule, or when certain conditions are met. This method is simple and relatively easy to implement, and it keeps the destination updated with the latest records. The downside of this method is that it uses a lot of system resources, making it impractical for large data sets.
Change data capture methods explained
The more data there is, the more complicated the replication becomes, because new data is constantly being added, and existing data is constantly changing. Every action taken by a business’s customers, employees, suppliers, and partners is potentially one or more rows of data to replicate. In contrast to bulk data updates, continuous CDC results in faster updates and more efficient scaling as more data becomes available for analysis.
You can accomplish CDC with continuous streaming for real-time updates, or asynchronously by looking at the source periodically and replicating the changed data.
|Log-based CDC||Trigger-based CDC||Building a CDC script|
|Benefits||Reliability: No missed changes.
No impact on the transactions at the source application, no additional SQL load on the system.
Continuous CDC monitors changes in the transaction log and streams changes to the destination in real time.
|Implementation: Developer can define triggers to capture changes and then generate a changelog.
Reliability: Considered reliable and detailed; changes can be captured almost as fast as they are made.
|Implementation: Implemented by developer at the SQL level.|
|Drawbacks||Implementation: Works only with databases that support log-based CDC||Implementation: Triggers must be defined for each table, which may impact resources if multiple tables must be replicated.
Reliability: Triggers may be disabled when certain operations take place; users may turn off triggers; triggers capture changes only to the data, not to the table definition.
Transactions in the application may be slowed by the extra load of trigger-based capture.
Must go to the source database at intervals, which puts an additional load on the system.
|Implementation: Building and maintaining a script may be challenging.
Reliability: If there are table schema changes, data integrity could be an issue.
System impact: Data retrieved from source database, which can put an additional load on the system.
Enterprise databases store all changes in a transaction log for system recovery in the event of a crash. Every change users make to the data is written to the transaction logs. A log-based CDC solution reads the changes that go into the logs and pushes them to the destination data warehouse in real time.
Log-based CDC provides high reliability. The transaction logs exist to ensure that the database can recover to a consistent state so there are no missed changes. Additionally, log-based CDC is an asynchronous process. The transaction logs exist separately from the database records (which means the process has no impact on the source database transactions) and there is no need to write additional procedures that put more of a load on the system.
Log-based, continuous CDC operates with minimal latency. It monitors the changes in the transaction log, and ensures that those changes are streamed to the destination (or multiple destinations).
A drawback of log-based CDC is that log formats are proprietary, and many lack sufficient documentation. Another possible drawback is that transaction logs are archived frequently, and CDC software must read the logs before they are archived, or must be able to read the archived logs.
Triggers are software functions written to capture changes based on events. Most triggers run when changes are made to a table’s data, using SQL syntax such as “BEFORE UPDATE” or “AFTER INSERT.”
Triggers can impede performance because they run on the database while data changes are being made. With every transaction, changes are recorded in a separate table (as well as in the transaction log), so the system is slowed by the extra load of trigger-based capture.
Latency also may be an issue. At specified intervals, the CDC code retrieves the changes from the trigger tables, which puts a load on the system. Also, triggers must be defined for each table, which may impact operational resources if multiple tables must be replicated.
Write a CDC script
A developer can create a CDC solution at the application level that relies on a key field or fields to indicate that data in a row has changed. This might require changing the schema to add a datetime field to indicate when the record was created or updated, a version number, or a boolean status indicator.
Optimize ETL with CDC
In general, log-based continuous CDC is highly reliable. Capturing database changes with event-based triggers is mostly reliable and easy for a software engineer to set up. Writing a CDC script at the application level requires more development time than other techniques, and improper code could result in changes being misrecorded or not captured at all.
Stitch integrations require no coding, and Stitch supports CDC through binary log files in MySQL, PostgreSQL, and Oracle databases. The Stitch platform is reliable, extensible, and can support nearly any data source.
Stitch is free to try, and setup takes minutes. Sign up now and start replicating your data today.