Log-based Incremental Replication is a replication method in which Stitch identifies modifications to records - including inserts, updates, and deletes - using a database’s binary log files. A binary log file is a record of events that occur within a database.
In this guide, we’ll cover:
- How it works (with examples),
- When it should be used,
- Limitations of this Replication Method, and
- How to enable it for your database integration
How Log-based Incremental Replication works
There are two types of binary log replication: Statement and row-based. Stitch uses a row-based approach, which means that when rows are modified via a supported event type, the entire row is written to the binary log file as a log message.
During a replication job, Stitch iterates over each message in sequence, or in the same order that the messages were written to the binary log. Data is extracted for any messages that correspond to tables that have been tracked for replication. This means that all tables using Log-based Incremental Replication are extracted from the source database simultaneously as one seamless process.
To ensure only new and updated data is selected for replication, Stitch will use a binary log’s Log Sequence Number (LSN) to ‘bookmark’ its place in a log file. This value is used to resume replicating data where a previous replication job ended.
Historical replication jobs
During the historical replication job for a table using Log-based Incremental Replication, two things will happen:
- Stitch will retrieve and store the maximum LSN from the binary logs.
Stitch will use a
SELECT-based approach to replicate the table in full. If this were a SQL query, it would look like this:
SELECT column_you_selected_1, column_you_selected_2, [...] FROM schema.table
Ongoing replication jobs
After the historical replication of a table is complete, Stitch will read updates for the table from the database’s binary logs. During ongoing replication jobs using Log-based Incremental Replication, a few things will happen:
- Using the maximum LSN from the previous job - in this case, the historical replication job - Stitch begins reading log messages in the binary file. Data for tables set to replicate is extracted.
- At the end of the replication job, Stitch bookmarks its place in the log file by storing its current LSN.
- During the next replication job, Stitch will resume reading data with a greater LSN than the LSN from the previous job.
- At the end of the replication job, Stitch bookmarks its place in the log file again.
When Log-based Incremental Replication should be used
Log-based Incremental Replication may be a good fit if:
- The database is a MySQL or PostgreSQL-backed database that supports Log-based Incremental Replication.
- Data is contained in a table, not a view.
- Modifications to records are made only using supported event types.
- The structure of the table changes infrequently, if at all. Refer to the Limitations section below for more info.
- You’re aware that, for PostgreSQL, only master instances support Log-based Incremental Replication and that retaining binary log files will increase the database’s disk space usage.
If Log-based Incremental Replication isn’t appropriate, Key-based Incremental Replication may be a suitable alternative.
Limitations of Log-based Incremental Replication
Before you select Log-based Incremental Replication as the Replication Method for a table, you should be aware of the limitations this method can have. Being aware of these limitations can help prevent data discrepancies, replication issues, and ensure your data is replicated in the most efficient manner possible.
The limitations of Log-based Incremental Replication are:
Limitation 1: Only available for certain MySQL and PostgreSQL databases
Log-based Incremental Replication is available only for certain MySQL and PostgreSQL-backed databases. While the original implementations of MySQL and PostgreSQL databases support Log-based Incremental Replication some cloud versions may not.
In the table below are the MySQL and PostgreSQL databases Stitch supports and whether Log-based Incremental Replication can be used in Stitch.
indicates that if the database/instance type meets the Minimum version requirement, Log-based Incremental Replication can be used in Stitch.
indicates that the database/instance type cannot be used in Stitch, even if the Minimum version requirement is met. This may be due to:
- The provider not having support for binary logging (MySQL) or logical replication (PostgreSQL), which is what Stitch uses to perform Log-based Incremental Replication.
- The provider not allowing server settings to be configured in the manner Stitch requires. Refer to the documentation for the database for configuration requirements.
- The provider not allowing binary logging on read replicas.
Note: If public-facing information about the lack of support is available, a link to it will display next to the icon.
Support for MySQL databases
|Replication master instance||Replication read replica||Replication minimum version|
|Amazon Aurora (MySQL) RDS||5.6.2|
|Amazon MySQL RDS||5.6.2|
|Google CloudSQL MySQL||5.6.2|
Support for PostgreSQL databases
|Replication master instance||Replication read replica||Replication minimum version|
|Amazon PostgreSQL RDS (v1.0)||(link)||9.4|
|Google CloudSQL PostgreSQL (v1.0)||(link)||(link)|
Limitation 2: Only works with specific database event types
Log-based Incremental Replication reads data from a database’s binary log and then replicates the changes. In order to replicate data, the event that caused a change to the data must be written to the binary log.
For MySQL and PostgreSQL-backed databases, only the following event types are written to a database’s binary log:
This means that if data is modified using an event type not listed here, it won’t be written to the database’s binary log or subsequently detected by Log-based Incremental Replication.
For example: If data in a table is modified using
ALTER, the changes won’t be written to the binary log or identified by Stitch.
Limitation 3: Structural changes require manual intervention
Any time the structure of a source table changes, you’ll need to reset the table from the Table Settings page. This will queue a full re-replication of the table and ensure that structural changes are correctly captured.
Structural changes can include adding new columns, removing columns, changing a data type, etc. Resetting the table is required due to how messages in binary logs are structured and how Stitch’s integrations validate table schemas when extracting data. When a structural change occurs without a table being reset, an extraction error similar to the following will surface in the Extraction Logs:
Error persisting data for table "[table_name]": Record 0 did not conform to schema
For this reason, Stitch recommends using Log-based Incremental Replication with tables that have structures that don’t change frequently.
Schema violation errors, explained
Messages in binary logs are ordinal and don’t contain field information. This means that column values in log messages are in the same order as the columns in the source table, but the messages don’t contain data about which values belong to each column.
Because of this and how extraction is performed using Log-based Incremental Replication, changes in a table’s structure will cause extraction errors.
For example, consider this table:
The values for these fields might look like this in a binary log:
Stitch’s MySQL and PostgreSQL integrations use JSON schema validation to ensure that values in log messages are attributed to the correct fields when data is loaded into your destination. For this reason, schema changes in a source - whether it’s changing a column’s data type or re-ordering columns - will cause an extraction error to occur.
If the column order or data types of a source table change in any capacity, the integration will not persist new or updated records that use this updated schema, as it does not have a means of attributing values to their proper columns based on the ordinal set when compared to the expected schema that was previously detected.
As Log-based Incremental Replication is a seamless process, an extraction error for any one table in a replication job will disrupt replication from any other tables using Log-based Incremental Replication.
Let’s look at an example. Here’s the same table from before, now with a new
And the values in the binary log:
Where Stitch previously detected three columns, the log messages now contain data for four columns. Because the log messages don’t contain field information and are read in order, Stitch would be unable to determine what column the
19 values are for.
Limitation 4: Cannot be used with views
Log-based Incremental Replication can’t be used with database views, as modifications to views are not written to binary log files.
Stitch recommends using Key-based Incremental Replication instead, where possible.
Limitation 5: Logs can age out and stop replication (MySQL)
Binary log files, by default, are not stored indefinitely on a database server. The amount of time a log file is stored depends on the database’s log retention settings.
Log retention settings specify the amount of time before a binary log file is automatically removed from the database server. When a binary log file is removed from the server before Stitch can read from it, replication will be unable to proceed. When this occurs, an extraction error similar to the following will surface in the Extraction Logs:
Unable to replicate binlog stream because the following binary log(s) no longer exist: [binary_log_file_name]
To resolve the error, you’ll need to reset the integration from the Integration Settings page. Note: This is different than resetting an individual table.
This error can be caused by a few things:
- The binary log file is purged before historical replication completes. This is because the max LSN is saved at the start of historical replication jobs, so Stitch knows where to begin reading from the binary logs after historical data is replicated.
- The log retention settings (
binlog_expire_logs_seconds) are set to too short of a time period. Stitch recommends a minimum of 3 days, but 7 days is preferred to account for resolving potential issues without losing logs.
- Any critical error that prevents Stitch from replicating data, such as a connection issue that prevents Stitch from connecting to the database or a schema violation. If the error persists past the log retention period, the log will be purged before Stitch can read it.
Limitation 6: Will increase source disk space usage (PostgreSQL)
PostgreSQL Log-based Incremental Replication uses PostgreSQL’s logical replication feature. Logical replication uses a replication slot, which represents a stream of changes made to a given database.
According to PostgreSQL’s documentation, replication slots will prevent removal of required resources even if no connection is using them:
Replication slots persist across crashes and know nothing about the state of their consumer(s). They will prevent removal of required resources even when there is no connection using them. This consumes storage because neither required WAL nor required rows from the system catalogs can be removed by VACUUM as long as they are required by a replication slot.
This means that binary log files (Write Ahead Log (WAL), for PostgreSQL) aren’t removed from the replication slot until they’re consumed. In this case, until Stitch reads them during a replication job.
While Stitch will issue a
flush_lsn command after messages have been read, an increase in disk space usage is to be expected when using Log-based Incremental Replication due to how PostgreSQL replication slots function. The amount of disk space usage depends on the number of updates made to the database, how quickly Stitch proceeds with replication, and whether any errors that prevent replication arise.
The greatest increase in disk space usage typically occurs during the switch from historical replication (
SELECT-based replication) to consuming the database’s binary logs. Disk space usage may spike, but it typically levels off over time.
Note: If you decide to permanently disable Log-based Incremental Replication for your PostgreSQL database, remove the replication slot to prevent further unnecessary disk space consumption.
Limitation 7: Can only be used with a master instance (PostgreSQL)
For PostgreSQL-backed databases, Log-based replication will only work on master instances due to a feature gap in PostgreSQL 10. Based on their forums, PostgreSQL is working on adding support for using logical replication on a read replica to a future version.
Otherwise, we recommend monitoring the instance’s disk space usage during the first few replication jobs to minimize any negative impact on your database’s performance.
Limitation 8: Multiple connections to a replication slot can cause data loss in Stitch (PostgreSQL)
As previously mentioned, Log-based Incremental Replication for PostgreSQL requires a replication slot which Stitch can read binary log files from. When changes are made to a database, they are written to the binary log file in the replication slot.
Each change to the database is written to the database’s replication slot exactly once.
As binary log files are removed from the replication slot once they’re consumed, this means that once the change is read, the record of it is purged.
If multiple connections - whether it’s multiple integrations in Stitch, or connections elsewhere - are using the same replication slot, data loss can occur as each connection will only receive some of the updates made to the database.
According to PostgreSQL’s documentation:
A logical slot will emit each change just once in normal operation… Multiple independent slots may exist for a single database. Each slot has its own state, allowing different consumers to receive changes from different points in the database change stream. For most applications, a separate slot will be required for each consumer. A logical replication slot knows nothing about the state of the receiver(s). It’s even possible to have multiple different receivers using the same slot at different times; they’ll just get the changes following on from when the last receiver stopped consuming them. Only one receiver may consume changes from a slot at any given time.
This means that if one connection reads the changes from the replication slot, Stitch will only be able to extract the changes from when the other connection stopped consuming them.
To avoid data loss caused by this scenario, Stitch recommends creating a dedicated replication slot for PostgreSQL database you want to connect.
Enable Log-based Incremental Replication
Using Log-based Incremental Replication requires a specific database configuration. Instructions for configuring the required settings varies from database to database.
For setup instructions, refer to the documentation for your database: