Replication Methods define how Stitch will replicate your data during a replication job. While SaaS integration tables have their Replication Methods defined by Stitch, you can define how tables in database integrations are replicated.

Replication Methods are extremely important - we can’t stress this enough. They’ll not only directly impact your row count, but incorrectly defined methods can also cause data discrepancies.

Stitch employs three methods to replicate data from your data sources:


Log-based Incremental Replication

Log-based Incremental Replication is a replication method in which Stitch identifies modifications to records - including inserts, updates, and deletes - using a database’s binary log files. A binary log file is a record of events that occur within a database.

There are two types of binary log replication: statement and row-based. Stitch uses a row-based approach, which means that when rows are modified, the entire row is written to the binary log file. Stitch then reads the changes from the binary log and replicates the appropriate records.

Using Log-based Incremental Replication requires a specific database configuration and is only available for MySQL-based databases. For setup instructions, refer to the database integration documentation for your database.


Key-based Incremental Replication

Key-based Incremental Replication is a replication method in which Stitch identifies new and updated data using a column called a Replication Key. As this method will greatly reduce latency and data usage, we highly recommend using it where ever possible.

To identify new and updated data for replication, Stitch uses what is called a Replication Key.

Replication Keys

When Stitch replicates your data, it will store the last recorded maximum value in the Replication Key column and compare it against the data source - not what’s in your data warehouse - to identify new/updated data.

Any row with a Replication Key value greater than or equal to the stored value is where Stitch will begin the next replication job.

Updated At Incremental Replication

The updated_at method of Incremental Replication uses a timestamp or datetime column (set as the Replication Key) that’s updated whenever information in a row is changed. Stitch will compare the last recorded MAX value in this column with what’s in the data source to identify new and updated data for replication.

This method works for tables that will both add new rows and update existing ones.

For example: the Replication Key for the customers table below is the updated_at column.

During the initial replication job, the following rows are replicated:

| id [PK] | name        | age | updated_at          |
|---------+-------------+-----+---------------------|
| 1       | Finn        | 14  | 2017-01-01 14:34:23 |
| 2       | Jake        | 6   | 2017-01-01 14:34:23 |
| 3       | Beamo       | 1   | 2017-01-01 14:34:23 |
| 4       | Bubblegum   | 16  | 2016-12-31 02:45:53 |

When the replication job completes, Stitch will store the MAX value in the updated_at column (2017-01-01 14:34:23) to identify data for replication.

After the completion of the initial job but before the next job starts, these changes are made in the data source:

| id [PK] | name        | age | updated_at          |
|---------+-------------+-----+---------------------|
| 1       | Finn        | 15  | 2017-01-02 15:00:18 |   // age is changed in existing row
| 2       | Jake        | 6   | 2017-01-01 14:34:23 |
| 3       | Beamo       | 1   | 2017-01-01 14:34:23 |
| 4       | Bubblegum   | 16  | 2016-12-31 02:45:53 |
| 5       | King        | 300 | 2017-01-02 15:10:42 |   // new row is added

When the next replication job begins, Stitch will compare the stored updated_at value of 2017-01-01 14:34:23 against what’s in the data source to select rows to replicate. Anything greater than or equal to this value will be selected.

During the next replication job, the following rows are replicated:

| id [PK] | name  | age | updated_at          |
|---------+-------+-----+---------------------|
| 1       | Finn  | 15  | 2017-01-02 15:00:18 |
| 2       | Jake  | 6   | 2017-01-01 14:34:23 |
| 3       | Beamo | 1   | 2017-01-01 14:34:23 |
| 5       | King  | 300 | 2017-01-02 15:10:42 |

Instead of replicating the entire table again, only the rows with Replication Key values that were greater than or equal to the last value saved by Stitch were replicated. Notice that the record for Bubblegum (4) was not selected for replication because of its Replication Key value.

While there may be a small amount of duplication when using this method (it’s to ensure Stitch doesn’t miss any data), it’s the most efficient way to use Stitch.

Append-Only Incremental Replication

Append-Only Replication is when only new data is added, or appended, to a table. Existing rows are never updated. If a value is changed in an existing record, a new row with the new data will be appended to the end of the table.

This means that there can be many different rows in a table with the same Primary Key, each representing what that record was at that moment in time. Stitch provides the _sdc_sequence column to help distinguish the age of the records.

Our example table contains the following rows to start with:

| Name | Age | Type  | Magic   | _sdc_sequence |
|------+-----+-------+---------+---------------|
| Finn | 14  | Human | False   | 1473773877524 |
| Jake | 6   | Dog   | True    | 1473773877524 |

These rows are replicated during the initial sync of the table. Some time later, new rows are added to the table:

| Name      | Age | Type     | Magic  | _sdc_sequence |
|-----------+-----+----------+--------+---------------|
| Bubblegum | 16  | Princess | True   | 1473776585195 |
| Beamo     | 1   | Game Bot | False  | 1473776585195 |

During the next replication job, the new rows are appended to the table, bringing the total row count to 4:

| Name      | Age | Type     | Magic  | _sdc_sequence |
|-----------+-----+----------+--------+---------------|
| Finn      | 14  | Human    | False  | 1473773877524 |
| Jake      | 6   | Dog      | True   | 1473773877524 |
| Bubblegum | 16  | Princess | True   | 1473776585195 |  // new rows are appended
| Beamo     | 1   | Game Bot | False  | 1473776585195 |

But what happens when an existing row is updated? If Finn’s age were updated to 15, the existing row wouldn’t be updated - instead, an entirely new row would be appended to the end of the table:

| Name      | Age | Type     | Magic  | _sdc_sequence |
|-----------+-----+----------+--------+---------------|
| Finn      | 14  | Human    | False  | 1473773877524 |
| Jake      | 6   | Dog      | True   | 1473773877524 |
| Bubblegum | 16  | Princess | True   | 1473776585195 |
| Beamo     | 1   | Game Bot | False  | 1473776585195 |
| Finn      | 15  | Human    | False  | 1473827984228 |  // updated data as new row

Deleted Records & Incremental Replication

Depending on how records are deleted (hard vs. soft), deletes may not be captured when your data is replicated.

Hard Deletes

Stitch is unable to capture hard deletes because it looks for values that are greater than or equal to the recorded value in the Replication Key column. If a record is hard deleted, there won’t be a value to compare against and the delete won’t be detected.

Soft Deletes

Soft deletes occur when a record remains in the table but uses a flag to indicate deletion. In some cases this is a boolean column. In others this may be a timestamp or datetime column called deleted_at which in turn updates the updated_at value, thus allowing Stitch to identify the change.


Full Table Replication

Full Table Replication means that Stitch will replicate the entire contents of a table on every replication attempt. As this Replication Method can cause latency and quickly use up your monthly row limit, it’s the most inefficient way to use Stitch.

We recommend using Incremental Replication if the table in question contains any timestamped or datetime columns.

Note that Stitch does not currently support Full Table Replication for Mongo integrations.


Learn about SaaS Integration Replication Methods

As previously mentioned, Replication Methods can currently only be defined for tables in database integrations.

To learn more about the Replication Methods used by SaaS integration tables, refer to the Schema section in the SaaS integration docs.



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Tags: replication