Key-based Incremental Replication is a method of replication that replicates new or updated data from a data source. In this guide, we’ll cover:

  1. How it works (with examples),
  2. When it should be used,
  3. Limitations of this Replication Method, and
  4. How to enable it for your integration

How Key-based Incremental Replication works

When using Key-based Incremental Replication, Stitch uses a column called a Replication Key to identify new and updated data in a table for replication. A Replication Key is a timestamp, date-time, or integer column that exists in a source table.

When Stitch replicates a table using Key-based Incremental Replication, a few things will happen:

  1. During a replication job, Stitch stores the maximum value of a table’s Replication Key column.
  2. During the next replication job, Stitch will compare saved value from the previous job to Replication Key column values in the source.
  3. Any rows in the table with a Replication Key greater than or equal to the stored value are replicated.
  4. Stitch stores the new maximum value from the table’s Replication Key column.
  5. Repeat.

Let’s use a SQL query as an example:

SELECT replication_key_column,
       column_you_selected_1,
       column_you_selected_2,
       [...]
  FROM schema.table
 WHERE replication_key_column >= 'last_saved_maximum_value'

Data extraction and Replication Key types

Below are examples of how different Replication Key types impact the data extracted using Key-based Incremental Replication.

Note: These examples only demonstrate how data is extracted from a data source, not how it will be loaded into your destination.

In this example, we’ll demonstrate how key-based-incremental works with a timestamp (or date-time) Replication Key column for initial and ongoing replication jobs.

Let’s assume the Replication Key column for the table in this example is a column named updated_at.

Initial replication job

During the initial job:

  • In the source: A table in the source contains three records: 1, 2, 3. Every time a record is created or modified, its updated_at value is updated.

  • During extraction: Depending on the integration type, the initial extraction for a table using key-based-incremental will vary slightly:

    • For a database integration:
      1. Stitch replicates the entire table in full.
      2. Stitch saves updated_at: 2018-07-27 12:07:00 as the maximum Replication Key value.
    • For a SaaS integration:
      1. Stitch uses the Start Date defined in the Integration Settings as the initial Replication Key (updated_at) value.
      2. Records with an updated_at value greater than or equal to the Start Date are replicated.
      3. Stitch saves updated_at: 2018-07-27 12:07:00 as the maximum Replication Key value.
Source records
id updated_at active
1 2018-07-27 12:00:00 false
2 2018-07-27 12:05:00 false
3 2018-07-27 12:07:00 false
Extracted records
id updated_at active
1 2018-07-27 12:00:00 false
2 2018-07-27 12:05:00 false
3 2018-07-27 12:07:00 false

Second replication job

During the second job:

  • In the source: Before the next replication job, record 2 is updated and a new record, 4, is added to the table in the source. The updated_at values for these records are updated.

  • During extraction: Stitch replicates records 3, 2, and 4, as the records’ updated_at values are either greater than or equal to the last saved maximum Replication Key value of updated_at: 2018-07-27 12:07:00 from the previous replication job.

    Record 3 is selected because its updated_at value is equal to the maximum Replication Key value from the previous replication job.

    If this were a SQL query, it would look like this:

    SELECT *
      FROM sample_table
     WHERE updated_at >= '2018-07-27 12:07:00'
    
Source records
id updated_at active
1 2018-07-27 12:00:00 false
2 2018-07-28 09:12:00 true
3 2018-07-27 12:07:00 false
4 2018-07-28 06:53:00 false
Extracted records
id updated_at active
2 2018-07-28 09:12:00 true
3 2018-07-27 12:07:00 false
4 2018-07-28 06:53:00 false

In this example, we’ll demonstrate how key-based-incremental works with an auto-incrementing integer Replication Key column.

For this table, the Replication Key is a column named customer_id. The value of this column increments every time a new record is added to the table, but it is never updated after the record is created.

Initial replication job

During the initial job:

  • In the source: A table in the source contains three records: 1, 2, 3.

    The customer_id value is auto-incremented when a new record is added to the table. This column is not updated, even if changes are made to the record.

  • During extraction:

    1. Stitch replicates the entire table in full.
    2. Stitch saves customer_id: 3 as the maximum Replication Key value.
Source records
customer_id name active
1 Finn false
2 Jake false
3 Bubblegum false
Extracted records
customer_id name active
1 Finn false
2 Jake false
3 Bubblegum false

Second replication job

During the second job:

  • In the source: Before the next replication job, record 2 is updated and a new record, 4, is added to the table in the source.

  • During extraction:

    1. Stitch replicates records 3 and 4, as the records’ customer_id values are either greater than or equal to the last saved maximum Replication Key value of customer_id: 3 from the previous replication job. Note that:
      • Record 3 is selected because its customer_id value is equal to the maximum Replication Key value from the previous replication job.
      • Record 2 is not selected because its customer_id value wasn’t updated when the record was modified, and was therefore less than the last saved maximum Replication Key value.
    2. Stitch saves customer_id: 4 as the new maximum Replication Key value.

    If this were a SQL query, it would look like this:

    SELECT *
      FROM sample_table
     WHERE customer_id >= 3
    
Source records
customer_id name active
1 Finn false
2 Jake true
3 Bubblegum false
4 Beamo false
Extracted records
customer_id name active
3 Bubblegum false
4 Beamo false

When Key-based Incremental Replication should be used

Aside from Log-based Replication where it’s supported, Key-based Incremental Replication is the most efficient method for replicating your data. If Log-based Replication is unavailable for your source, Key-based Incremental Replication may be a good fit if:

  1. A table contains a modification timestamp column, which is updated when the record changes
  2. Records aren’t hard deleted from the source table. Refer to the Limitations section below for more info.

Note: In the case of SaaS integrations, Stitch will use Key-based Incremental Replication whenever possible. Refer to the Schema section of any integration’s documentation for the Replication Method and Replication Key(s) used by specific tables.


Limitations of Key-based Incremental Replication

Before you select Key-based Incremental Replication as the Replication Method for a table, you should be aware of the limitations this method can have. Being aware of these limitations can help prevent data discrepancies and ensure your data is replicated in the most efficient manner possible.

The limitations of Key-based Incremental Replication are:

Limitation 1: Works best with a modification timestamp column

While an integer column can be used as a Replication Key, Key-based Incremental Replication functions best with a modification timestamp Replication Key. Unlike an auto-incrementing integer, a modification timestamp allows Stitch to identify both new and updated records for replication.

Limitation 2: Hard deletes aren't captured

Hard deletes aren’t able to be replicated with Key-based Incremental Replication. This is due to the usage of Replication Keys to identify data for replication.

When a record is hard deleted, or entirely removed from a source, its Replication Key value is also removed. Without a Replication Key value to check, Stitch can’t identify the change and update the record in the destination. This means that the record will remain in the destination.

Refer to the Deleted Record Handling guide for a more detailed explanation and examples.

Limitation 3: Duplication in replication

Due to the inclusive nature of Replication Keys, there will be some duplication during the extraction process. This is because Stitch checks for values that are greater than or equal to the last saved maximum Replication Key value.

Because of this approach, the record or records with Replication Key values equal to the maximum value will be selected for extraction during subsequent jobs. Most of the time, the number of re-replicated rows will be small. If, however, a bulk update occurs and a large number of records all have the same Replication Key value, you could see a high amount of rows being replicated during every replication job until a greater Replication Key value is detected.

Example

In this example, we’ll use a customers table with a Replication Key column named updated_at.

  1. In a database, you run a process that updates 100 records in the customers table.
  2. These records’ updated_at values are updated to 2018-11-01 00:00:00.
  3. During the next replication job - or Job 1 - Stitch extracts 101 records from the source customers table:
    • The 100 updated records with updated_at values of 2018-11-01 00:00:00, and
    • The one record with an updated_at value equal to the last saved maximum value from the previous replication job.
  4. Stitch saves the new maximum Replication Key value as updated_at: 2018-11-01 00:00:00.
  5. No records are updated between Job 1 and the next job.
  6. When Job 2 begins, Stitch again extracts the 100 records with updated_at: 2018-11-01 00:00:00 because their Replication Key values are equal to the last saved maximum value.

Until a record with a greater updated_at value is added to the customers table, Stitch will continue to extract all records with updated_at: 2018-11-01 00:00:00 values.

Bulk update handling

To avoid the above scenario, add a single record with a greater Replication Key value at the end of a bulk update. This will ensure that the maximum Replication Key value Stitch saves will only be equal to one record instead of many.

For example: If one of the 100 updated records in the customers table had had an updated_at value of 2018-11-01 00:00:01, Stitch would have saved this as the maximum Replication Key value. Then, during Job 2, only one record - instead of 100 - would have been re-replicated.


Enable Key-based Incremental Replication

Key-based Incremental Replication is available for use with any Stitch integration. Depending on the type of integration, enabling this Replication Method will vary:

  • Database integrations: To use Key-based Incremental Replication, a table must contain a column suitable for use as a Replication Key. Note: For MongoDB integrations, there are additional considerations for Replication Keys. Refer to the MongoDB Replication Keys guide for more info.

  • SaaS integrations: With the exception of Salesforce, no configuration is required on your part. Replication Methods are pre-defined for every table set to replicate. Stitch will use Key-based Incremental Replication whenever possible to ensure your data is replicated accurately and efficiently.

  • Webhook integrations: No configuration is required on your part. As webhook data is sent to Stitch in real-time, only new records are ever replicated from a webhook source. This can be thought of as using Key-based Incremental Replication with a Replication Key of created_at.



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Tags: replication