Replication Keys are columns that Stitch uses to identify new and updated data for replication. When you set a table to use Incremental Replication, you’ll also need to define a Replication Key for that table.
As improperly setting Replication Keys can cause data discrepancies, latency, and high row counts, it’s important to understand how they work, what makes a good key, and the gotchas associated with them.
Important: Is this a MongoDB integration?
If so, refer to the Selecting Mongo Replication Keys guide, as Replication Keys work a little differently for Mongo integrations.
Replication Keys and Primary Keys
When it comes to replicating your data, there are a lot of ‘keys’ involved. It can be difficult to keep them all straight, but aside from Replication Keys, there’s one more you should keep in mind: Primary Keys.
In Stitch, Replication Keys and Primary Keys serve two different purposes:
Replication Keys are used during the Extraction phase of the replication process - or when Stitch is querying your data source - to identify new and updated data for replication.
In the Stitch app, Replication Keys have a next to the column name.
Primary Keys are used during the last step of the replication process, which is when Stitch loads replicated data into your destination. Primary Keys identify unique rows within a table and ensure that only the most recently updated version of that record appears in your destination.
In the Stitch app, Primary Keys have a next to the column name.
While a column can sometimes be used as both a Replication Key and a Primary Key, these are not necessarily always the same column.
Replication Key requirements
To use Key-based Incremental Replication, a table must contain one of the following column types to be used as the Replication Key:
|Data type||Available for||Notes|
|FLOAT||MongoDB v1+ integrations|
|INT64||MongoDB v1+ integrations|
|NUMBER||Oracle v1+ integrations|
|OBJECTID||MongoDB v1+ integrations|
|UUID||MongoDB v1+ integrations|
When Stitch replicates your data, it will store the last recorded maximum value in the Replication Key column and compare it against the data source - not what’s in your destination - to identify new/updated data.
Any row with a Replication Key value greater than or equal to the stored value is where Stitch will begin the next replication job.
Recommendations and gotchas
While a column only need be an
timestamp to be a Replication Key, we have some recommendations (and things you should keep in mind) when selecting a column to be a Replication Key.
Replication Key recommendations
For tables where existing records are updated: We’re big fans of using
modified_atcolumns as Replication Keys. This is the best way to ensure that both new records and updates to existing records are captured.
For Append-Only tables: We recommend using a unique, auto-incrementing integer as the Replication Key for these types of tables. However, a
timestampcolumn may also be suitable.
Replication Key gotchas
Before selecting a Replication Key for a table, there are a few things you should keep in mind:
NULLvalues in the Replication Key column will only be replicated during the first extraction of an integration. This means subsequent extractions will not capture rows where the Replication Key is
NULL. Stitch uses the Replication Key column to detect new and updated data - without it, data can’t be correctly detected and replicated.
If the Replication Key field is entirely
NULL, the entire source table will be extracted during each job until a non-
NULLvalue is received and stored as a bookmark.
Auto-incrementing integers are only suitable Replication Keys for Append-Only tables. If you want to use an auto-incrementing integer column as the Replication Key for your table, ensure that the table is Append-Only.
If an auto-incrementing integer is used and existing records are updated, Stitch won’t detect the new data if the values in the Replication Key column don’t also change. This can lead to data discrepancies.
- Replication Keys for MongoDB work a little differently than they do for other integrations. Check out the Selecting MongoDB Replication Keys guide for more info.
- Stitch will not capture hard deletes in tables that use Key-based Incremental Replication.
Data discrepancies and row count impact
Replication Keys are one of the single most important aspects of data replication. Because they’re so important, we felt these two points merited their own section:
Incorrectly selecting a Replication Key can cause data discrepancies. For example: you set the Replication Key for an Append-Only table to an
idcolumn, which is an auto-incrementing integer. Existing rows in this table are updated, but the
idcolumn never changes after the record is created. Stitch will not detect the updated values because the
idcolumn hasn’t changed.
Incorrectly selecting a Replication Key can impact your row count. For example: you set the Replication Key column for a table to a
BIGINTcolumn that’s used in a boolean fashion, meaning it contains 0s and 1s. Every time the values in this column change, the row will be re-replicated to your data warehouse and count against your monthly limit.
If you encounter a data discrepancy, we recommend you start by verifying that the Replication Method and Key for the table are properly set. For further assistance, check out the Data Discrepancy Troubleshooting Guide.
Defining Replication Keys
After you set a table to replicate and select Key-based Incremental Replication as the Replication Method, you’ll need to select a column to be used as the Replication Key for the table.
After you select a column from the drop-down, click the Update Settings button.
Changing existing Replication Keys
Changing an existing Replication Key for a table is simple - just open up the Table Settings page for the table and select the new Replication Key column from the drop-down menu.
Note: When you change a table’s Replication Key, Stitch will queue a full re-replication of the table’s data. We do this to ensure that there aren’t any gaps because of the Replication Key switch.
Resetting Replication Keys
Replication Keys can be reset for database and (most) SaaS integrations.
There may be times when you need to fully replicate a table (or tables) that usually update incrementally. If, for example, you add a new column and want to backfill data for already-replicated rows, forcing a full re-replication of the table will populate the column for existing rows and replicate new records. You can do this by resetting Replication Keys.
Important: Before resetting Replication Keys:
- Will delete and re-create your destination tables with a full re-replication of your source data.
- Will lead to increased row counts which will count towards your limit.
- Cannot be interrupted or reversed once confirmed.
If you have questions or concerns about resetting Replication Keys, reach out to support before proceeding.
Resetting database integration Replication Keys
Replication Keys in database integrations can be reset at the integration or the table level.
- At the integration level, the reset will clear the replication key value for ALL tables AND queue a full re-replication for all tables in the integration.
- At the table level, the reset will clear the replication key value AND queue a full re-replication for that table only.
To reset Replication Keys, do the following:
- Click into the integration from the Stitch Dashboard page.
- To reset the entire integration: Click the Settings link and skip to step 3.
- To reset a table: Locate the table you want and click into it. Click the Table Settings link, located near the top right corner, and proceed to step 3.
- Scroll down to the Reset Replication Keys section.
- Click the Reset Keys button.
- When prompted, click OK to confirm.
- A Success! message will display at the top of the page.
At this point, a full re-replication of the integration or table will be queued. Note: If there is a large volume of data to be replicated, it may take some time before you see the changes in your data warehouse.
Resetting SaaS integration Replication Keys
Resetting the Replication Keys for a SaaS integration is done by changing the Historical Sync date in the Integration Settings page. When this date is changed, all saved values will be overwritten AND a full re-replication of the integration will be queued.
Note: This feature may not be available for some integrations. Because this approach uses date-based replication, some integrations may be incompatible. For example: Pardot doesn’t support date-based replication, meaning this feature will not be available for Pardot connections.
Changing the Historical Sync date has its own set of considerations and gotchas. Please refer to the Syncing Historical SaaS Data guide for more info.
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.