Replication Keys are columns that Stitch uses to identify new and updated data for replication. When you set a table to use Incremental Replication, you’ll also need to define a Replication Key for that table.
As improperly setting Replication Keys can cause data discrepancies, latency, and high row counts, it’s important to understand how they work, what makes a good key, and the gotchas associated with them.
Important: Is this a MongoDB integration?
If so, refer to the Selecting Mongo Replication Keys guide, as Replication Keys work a little differently for Mongo integrations.
Replication Keys vs. Primary Keys
When it comes to replicating your data, there are a lot of ‘keys’ involved. It can be difficult to keep them all straight, but aside from Replication Keys, there’s one more you should keep in mind: Primary Keys.
In Stitch, Replication Keys and Primary Keys serve two different purposes:
Replication Keys are used during the Extraction phaseof the replication process - or when Stitch is querying your data source - to identify new and updated data for replication.
In the Stitch app, Replication Keys have a next to the column name.
Primary Keys are used during the last step of the replication process, which is when Stitch loads replicated data into your data warehouse. Primary Keys identify unique rows within a table and ensure that only the most recently updated version of that record appears in your data warehouse.
In the Stitch app, Primary Keys have a next to the column name.
While a column can sometimes be used as both a Replication Key and a Primary Key, these are not necessarily always the same column.
Replication Key Requirements
To use Incremental Replication, a table must contain one of the following column types to be used as the Replication Key:
- a large integer, which includes
When Stitch replicates your data, it will store the last recorded maximum value in the Replication Key column and compare it against the data source - not what’s in your data warehouse - to identify new/updated data.
Any row with a Replication Key value greater than or equal to the stored value is where Stitch will begin the next replication attempt.
Recommendations and gotchas
While a column only need be an
timestamp to be a Replication Key, we have some recommendations (and things you should keep in mind) when selecting a column to be a Replication Key.
Replication Key Recommendations
- For tables where existing records are updated: We’re big fans of using
modified_atcolumns as Replication Keys. This is the best way to ensure that both new records and updates to existing records are captured.
- For Append-Only tables: We recommend using a unique, auto-incrementing integer as the Replication Key for these types of tables. However, a
timestampcolumn may also be suitable.
Replication Key Gotchas
Before selecting a Replication Key for a table, there are a few things you should keep in mind:
- Rows with
NULLvalues in the Replication Key column will only be replicated during the first sync of an integration. This means subsequent syncs will not capture rows where the Replication Key is
NULL. Stitch uses the Replication Key column to detect new and updated data - without it, data can’t be correctly detected and replicated.
Auto-incrementing integers are only suitable Replication Keys for Append-Only tables. If you want to use an auto-incrementing integer column as the Replication Key for your table, ensure that the table is Append-Only.
If an auto-incrementing integer is used and existing records are updated, Stitch won’t detect the new data if the values in the Replication Key column don’t also change. This can lead to data discrepancies.
- Replication Keys for Mongo work a little differently than they do for other integrations. Check out the Selecting Mongo Replication Keys guide for more info.
- Stitch will not capture hard deletes in tables that use Incremental Replication.
Data discrepancies and row count impact
Replication Keys are one of the single most important aspects of data replication. Because they’re so important, we felt these two points merited their own section:
Incorrectly selecting a Replication Key can cause data discrepancies. For example: you set the Replication Key for an Append-Only table to an
idcolumn, which is an auto-incrementing integer. Existing rows in this table are updated, but the
idcolumn never changes after the record is created. Stitch will not detect the updated values because the
idcolumn hasn’t changed.
Incorrectly selecting a Replication Key can impact your row count. For example: you set the Replication Key column for a table to a
BIGINTcolumn that’s used in a boolean fashion, meaning it contains 0s and 1s. Every time the values in this column change, the row will be re-replicated to your data warehouse and count against your monthly limit.
If you encounter a data discrepancy, we recommend you start by verifying that the Replication Method and Key for the table are properly set. For further assistance, check out the Data Discrepancy Troubleshooting Guide.
Defining Replication Keys
After you set a table to sync and select Incremental Replication as the Replication Method, you’ll need to select a column to be used as the Replication Key for the table.
After you select a column from the drop-down, click the Update Settings button.
Changing Existing Replication Keys
Changing an existing Replication Key for a table is simple - just open up the Table Settings page for the table and select the new Replication Key column from the drop-down menu.
When you change a table’s Replication Key, Stitch will queue a full re-sync of the table’s data. We do this to ensure that there aren’t any gaps because of the Replication Key switch.
Resetting Replication Keys
Replication Keys can be reset for database and (most) SaaS integrations.
There may be times when you need to fully replicate a table (or tables) that usually update incrementally. If, for example, you add a new column and want to backfill data for already-replicated rows, forcing a full re-sync of the table will populate the column for existing rows and replicate new records. You can do this by resetting Replication Keys.
Important: Before resetting Replication Keys:
Note that this process:
- Will delete and re-create your destination tables with a full re-replication of your source data.
- Will lead to increased row counts which will count towards your limit.
- Cannot be interrupted or reversed once confirmed.
If you have questions or concerns about resetting Replication Keys, we encourage you to reach out to support before proceeding.
Resetting Database Integration Replication Keys
Replication Keys in database integrations can be reset at the integration or the table level.
- At the integration level, the reset will clear the replication key value for ALL tables AND queue a full re-sync for all tables in the integration.
- At the table level, the reset will clear the replication key value AND queue a full re-sync for that table only.
To reset Replication Keys, do the following:
Click into the integration from the Stitch Dashboard page.
- To reset the entire integration: Click the Settings link and skip to step 3.
- To reset a table: Locate the table you want and click into it. Click the Table Settings link, located near the top right corner, and proceed to step 3.
- Scroll down to the Reset Replication Keys section.
- Click the Reset Keys button.
- When prompted, click OK to confirm.
- A Success! message will display at the top of the page.
At this point, a full re-sync of the integration will be queued. Note that if there is a large volume of data to be replicated, it may take some time before you see the changes in your data warehouse.
Resetting SaaS Integration Replication Keys
Resetting the Replication Keys for a SaaS integration is done by changing the Historical Sync date in the Integration Settings page. When this date is changed, all saved values will be overwritten AND a full re-sync of the integration will be queued.
This feature may not be available for some integrations. Because this approach uses date-based replication, some integrations may be incompatible. For example: Pardot doesn’t support date-based replication, meaning this feature will not be available for Pardot connections.
Changing the Historical Sync date has its own set of considerations and gotchas. Please refer to the Syncing Historical SaaS Data guide for more info.