Log-based Incremental Replication is a replication method in which Stitch identifies modifications to records - including inserts, updates, and deletes - using a database’s binary log files. A log file is a record of events that occur within a database.

In this guide, we’ll cover:

  1. Some useful terminology,
  2. How it works (with examples),
  3. When it should be used,
  4. Limitations of this Replication Method, and
  5. How to enable it for your database integration

Log-based Incremental Replication terminology

  • Log file - A file in a database containing a list of changes made to the database. Log files are made up of log messages.
  • Log message - A single change made to a database. For example: An UPDATE to a record.
  • Log position ID - A unique identifier corresponding to the position of a log message in a log file. These values are incremental, increasing as log messages are generated.

    • In Microsoft SQL Server, this is called the Change Tracking Version. As Microsoft’s documentation notes, this concept is similar to rowversion.
    • In MongoDB, this is a field in the database log named ts. The ts field is a combination of a timestamp and an ordinal (integer counter) value. For example: "ts": Timestamp(1412180887, 1) The timestamp is in seconds since the Unix epoch, and the ordinal is used to differentiate between entries that occured during the same second.
    • In MySQL and PostgreSQL, this is called a Log Sequence Number (LSN).
    • In Oracle, this is called a System Change Number (SCN).
  • Replication job - The three-step process by which Stitch replicates data. A replication job includes three distinct steps: Extraction, preparation, and loading.
  • Historical replication job - A Stitch replication job that replicates historical data.

In addition to these general terms, each database refers to its log replication feature by a different name. Stitch uses the following database features to perform Log-based Incremental Replication:


How Log-based Incremental Replication works

There are two types of log replication: Statement and row-based. Stitch uses a row-based approach, which means that when rows are modified via a supported event type, the entire row is written to the log file as a log message.

During a replication job, Stitch iterates over each log message in sequence, or in the same order that the log messages were written to the log. Data is extracted for any log messages that correspond to tables that have been tracked for replication. This means that all tables using Log-based Incremental Replication are extracted from the source database simultaneously as one seamless process.

To ensure only new and updated data is selected for replication, Stitch will use a log’s log position ID to ‘bookmark’ its place in a log file. This value is used to resume replicating data where a previous replication job ended.

Historical replication jobs

During the historical replication job for a table using Log-based Incremental Replication, two things will happen:

  1. Stitch will retrieve and store the maximum log position from the logs.
  2. Stitch will use a SELECT-based approach to replicate the table in full. If this were a SQL query, it would look like this:

    SELECT id,
           column_you_selected_1,
           column_you_selected_2,
           [...]
      FROM schema.table
     WHERE id < [max primary key value]
     ORDER BY id
    

    Note: Stitch will automatically omit the WHERE clause in this statement if the table has no Primary Key or a Primary Key that is not sortable.

Ongoing replication jobs

After the historical replication of a table is complete, Stitch will read updates for the table from the database’s logs. During ongoing replication jobs using Log-based Incremental Replication, a few things will happen:

  1. Using the maximum log position ID from the previous job - in this case, the historical replication job - Stitch begins reading log messages in the binary file. Data for tables set to replicate is extracted.
  2. At the end of the replication job, Stitch bookmarks its place in the log file by storing its current log position ID.
  3. During the next replication job, Stitch will resume reading data with a greater log position ID than the log position ID from the previous job.
  4. At the end of the replication job, Stitch bookmarks its place in the log file again.
  5. Repeat.

When Log-based Incremental Replication should be used

Log-based Incremental Replication may be a good fit if:

  1. The database is a Microsoft SQL Server, MongoDB, MySQL, Oracle, and PostgreSQL-backed database that supports Log-based Incremental Replication.
  2. Data is contained in a table, not a view.
  3. Modifications to records are made only using supported event types.
  4. The structure of the table changes infrequently, if at all. Refer to the Limitations section below for more info.
  5. You’re aware that, for PostgreSQL, only master instances support Log-based Incremental Replication and that retaining binary log files will increase the database’s disk space usage.

If Log-based Incremental Replication isn’t appropriate, Key-based Incremental Replication may be a suitable alternative.


Limitations of Log-based Incremental Replication

Before you select Log-based Incremental Replication as the Replication Method for a table, you should be aware of the limitations this method can have. Being aware of these limitations can help prevent data discrepancies, replication issues, and ensure your data is replicated in the most efficient manner possible.

The limitations of Log-based Incremental Replication are:

Limitation 1: Only available for certain databases

Log-based Incremental Replication is available only for certain Microsoft SQL Server, MongoDB, MySQL, Oracle, and PostgreSQL-backed databases. While the original implementations of these databases support Log-based Incremental Replication some cloud versions may not.

In the table below are the databases Stitch supports and whether Log-based Incremental Replication can be used in Stitch for each one.

  • indicates that if the database/instance type meets the Minimum version requirement, Log-based Incremental Replication can be used in Stitch.

  • indicates that the database/instance type cannot be used in Stitch, even if the Minimum version requirement is met. This may be due to:

    • The provider not having support for binary logging (MySQL) or logical replication (PostgreSQL), which is what Stitch uses to perform Log-based Incremental Replication.
    • The provider not allowing server settings to be configured in the manner Stitch requires. Refer to the documentation for the database for configuration requirements.
    • The provider not allowing binary logging on read replicas.

    Note: If public-facing information about the lack of support is available, a link to it will display next to the icon.

Support for Microsoft SQL Server databases

Master instance Read replica Minimum version
Amazon Microsoft SQL Server RDS
v1.0
2012
Microsoft Azure SQL Database
v1.0
Microsoft SQL Server
v1.0
2012

Support for MongoDB databases

Master instance Read replica Minimum version
MongoDB
v11-01-2016
MongoDB Atlas
v1.0

Support for MySQL databases

Master instance Read replica Minimum version
Amazon Aurora MySQL RDS 5.6.2
Amazon MySQL RDS 5.6.2
Google CloudSQL MySQL 5.6.2
Magento
MariaDB 5.6.2
MySQL 5.6.2

Support for Oracle databases

Master instance Read replica Minimum version
Amazon Oracle RDS 8.0
Oracle 8.0

Support for PostgreSQL databases

Master instance Read replica Minimum version
Amazon Aurora PostgreSQL RDS
v1.0
10.6
Amazon PostgreSQL RDS
v1.0
(link) 9.4
Google CloudSQL PostgreSQL
v1.0
(link) (link)
Heroku
v1.0
PostgreSQL
v1.0
(link) 9.4

Limitation 2: Only works with specific database event types

Log-based Incremental Replication reads data from a database’s log and then replicates the changes. In order to replicate data, the event that caused a change to the data must be written to the log.

Stitch will read the following event types from logs:

  • DELETE
  • INSERT
  • UPDATE

This means that if data is modified using an event type not listed here, it won’t be written to the database’s log or subsequently detected by Log-based Incremental Replication.

For example: If data in a table is modified using ALTER, the changes won’t be written to the log or identified by Stitch.

Limitation 3: Structural changes require manual intervention (Microsoft SQL Server, MySQL, PostgreSQL, Oracle)

Any time the structure of a source table changes, you’ll need to reset the table from the Table Settings page. This will queue a full re-replication of the table and ensure that structural changes are correctly captured.

Structural changes can include adding new columns, removing columns, changing a data type, etc. Resetting the table is required due to how messages in logs are structured and how Stitch’s integrations validate table schemas when extracting data. When a structural change occurs without a table being reset, an extraction error similar to the following will surface in the Extraction Logs:

Error persisting data for table "[table_name]": Record 0 did not conform to schema

For this reason, Stitch recommends using Log-based Incremental Replication with tables that have structures that don’t change frequently.

Schema violation errors, explained

Messages in database logs are ordinal and don’t contain field information. This means that column values in log messages are in the same order as the columns in the source table, but the messages don’t contain data about which values belong to each column.

Because of this and how extraction is performed using Log-based Incremental Replication, changes in a table’s structure will cause extraction errors.

For example, consider this table:

id name type
1 Finn human
2 Jake dog
3 Bubblegum princess

The values for these fields might look like this in a binary log:

1,Finn,human,2,Jake,dog,3,Bubblegum,princess

Stitch’s Microsoft SQL Server, MySQL, Oracle, and PostgreSQL integrations use JSON schema validation to ensure that values in log messages are attributed to the correct fields when data is loaded into your destination. For this reason, schema changes in a source - whether it’s changing a column’s data type or re-ordering columns - will cause an extraction error to occur.

If the column order or data types of a source table change in any capacity, the integration will not persist new or updated records that use this updated schema, as it does not have a means of attributing values to their proper columns based on the ordinal set when compared to the expected schema that was previously detected.

As Log-based Incremental Replication is a seamless process, an extraction error for any one table in a replication job will disrupt replication from any other tables using Log-based Incremental Replication.

Let’s look at an example. Here’s the same table from before, now with a new age column:

id name age type
1 Finn 15 human
2 Jake 9 dog
3 Bubblegum 19 princess

And the values in the log:

1,Finn,15,human,2,Jake,9,dog,3,Bubblegum,19,princess

Where Stitch previously detected three columns, the log messages now contain data for four columns. Because the log messages don’t contain field information and are read in order, Stitch would be unable to determine what column the 15, 9, and 19 values are for.

Limitation 4: Cannot be used with views

Log-based Incremental Replication can’t be used with database views, as modifications to views are not written to log files.

Stitch recommends using Key-based Incremental Replication instead, where possible.

Limitation 5: Logs can age out and stop replication (Microsoft SQL Server, MySQL, and Oracle)

Log files, by default, are not stored indefinitely on a database server. The amount of time a log file is stored depends on the database’s log retention settings.

Log retention settings specify the amount of time before a log file is automatically removed from the database server. When a log file is removed from the server before Stitch can read from it, replication will be unable to proceed.

When this occurs, an extraction error similar to the following will surface in the Extraction Logs:

  • For MySQL databases:
    Unable to replicate binlog stream because the following binary log(s) no longer exist: [binary_log_file_name]
    
  • For Oracle databases:
    ORA-01291: missing logfile
    

To resolve the error, you’ll need to reset the integration from the Integration Settings page. Note: This is different than resetting an individual table.

This error can be caused by a few things:

  1. The log file is purged before historical replication completes. This is because the maximum log position ID is saved at the start of historical replication jobs, so Stitch knows where to begin reading from the database logs after historical data is replicated.
  2. The log retention settings are set to too short of a time period. Stitch recommends a minimum of 3 days, but 7 days is preferred to account for resolving potential issues without losing logs.

  3. Any critical error that prevents Stitch from replicating data, such as a connection issue that prevents Stitch from connecting to the database or a schema violation. If the error persists past the log retention period, the log will be purged before Stitch can read it.

Limitation 6: Will increase source disk space usage (PostgreSQL)

PostgreSQL Log-based Incremental Replication uses PostgreSQL’s logical replication feature. Logical replication uses a replication slot, which represents a stream of changes made to a given database.

According to PostgreSQL’s documentation, replication slots will prevent removal of required resources even if no connection is using them:

Replication slots persist across crashes and know nothing about the state of their consumer(s). They will prevent removal of required resources even when there is no connection using them. This consumes storage because neither required WAL nor required rows from the system catalogs can be removed by VACUUM as long as they are required by a replication slot.

This means that log files (Write Ahead Log (WAL), for PostgreSQL) aren’t removed from the replication slot until they’re consumed. In this case, until Stitch reads them during a replication job.

While Stitch will issue a flush_lsn command after messages have been read, an increase in disk space usage is to be expected when using Log-based Incremental Replication due to how PostgreSQL replication slots function. The amount of disk space usage depends on the number of updates made to the database, how quickly Stitch proceeds with replication, and whether any errors that prevent replication arise.

The greatest increase in disk space usage typically occurs during the switch from historical replication (SELECT-based replication) to consuming the database’s logs. Disk space usage may spike, but it typically levels off over time.

Note: If you decide to permanently disable Log-based Incremental Replication for your PostgreSQL database, remove the replication slot to prevent further unnecessary disk space consumption.

Limitation 7: Can only be used with a master instance (PostgreSQL)

For PostgreSQL-backed databases, Log-based replication will only work on master instances due to a feature gap in PostgreSQL 10. Based on their forums, PostgreSQL is working on adding support for using logical replication on a read replica to a future version.

If you’re concerned about the increase in disk space usage and the impact this may have, consider connecting a read replica and using Key-based Incremental Replication instead.

Otherwise, we recommend monitoring the instance’s disk space usage during the first few replication jobs to minimize any negative impact on your database’s performance.

Limitation 8: Multiple connections to a replication slot can cause data loss in Stitch (PostgreSQL)

As previously mentioned, Log-based Incremental Replication for PostgreSQL requires a replication slot which Stitch can read log files from. When changes are made to a database, they are written to the log file in the replication slot.

Each change to the database is written to the database’s replication slot exactly once.

As log files are removed from the replication slot once they’re consumed, this means that once the change is read, the record of it is purged.

If multiple connections - whether it’s multiple integrations in Stitch, or connections elsewhere - are using the same replication slot, data loss can occur as each connection will only receive some of the updates made to the database.

According to PostgreSQL’s documentation:

A logical slot will emit each change just once in normal operation… Multiple independent slots may exist for a single database. Each slot has its own state, allowing different consumers to receive changes from different points in the database change stream. For most applications, a separate slot will be required for each consumer. A logical replication slot knows nothing about the state of the receiver(s). It’s even possible to have multiple different receivers using the same slot at different times; they’ll just get the changes following on from when the last receiver stopped consuming them. Only one receiver may consume changes from a slot at any given time.

This means that if one connection reads the changes from the replication slot, Stitch will only be able to extract the changes from when the other connection stopped consuming them.

To avoid data loss caused by this scenario, Stitch recommends creating a dedicated replication slot for PostgreSQL database you want to connect.


Enable Log-based Incremental Replication

Using Log-based Incremental Replication requires a specific database configuration. Instructions for configuring the required settings varies from database to database.

For setup instructions, refer to the documentation for your database:



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.