Stitch supports three ways to replicate from databases:
- Log-based incremental replication
- Key-based incremental replication
- Full-table replication
Log-based incremental replication
Incremental replication is an efficient way to keep tables in your data warehouse up to date. With log-based incremental replication, Stitch updates data in the destination – via inserts, updates, and deletes – using a database's binary log, which is a file that contains a record of events that occurred within the database.
There are two types of binary logs: statement-based and row-based. Statement-based replication uses SQL statements to make changes to the data at the destination. Stitch uses a row-based approach: when a row is modified, the entire row is written to the binary log file, and Stitch reads the changes from the binary log to replicate the appropriate records. Row-based replication is safer than statement-based; not all SQL statements that modify data can be replicated using statement-based replication.
Log-based incremental replication is available for Oracle and for MySQL- and PostgreSQL-based databases, including:
- Amazon Aurora
- Amazon RDS for MySQL
- Amazon RDS for PostgreSQL
- Google Cloud SQL MySQL
- Google Cloud SQL PostgreSQL
For setup instructions, refer to Stitch's database integration documentation for the database in question.
Key-based incremental replication
With key-based incremental replication, Stitch identifies new and updated data using a column that you designate as a replication key. For tables where you're updating existing records, a field like
modified_at makes a good replication key. For append-only tables, we recommend using a unique, auto-incrementing integer.
When Stitch replicates your data, it stores the maximum value in the replication key column. When you next perform a key-based incremental replication, Stitch compares data in the data source to the value in the replication key column to identify new and updated data. Stitch will begin the next replication job at the first row with a column whose value is greater than or equal to the stored value in the replication key. Stitch's Selecting Replication Keys guide can help you define replication keys for your tables.
One disadvantage of key-based replication involves deleted data. Some applications and databases implement deletion by removing records from a table – we call these hard deletes. Others keep deleted records in a table but use a field to flag a record as being deleted – an approach called soft deletes. Key-based incremental replication cannot replicate hard deletes, because a hard-deleted record is no longer in the source data, so Stitch can't compare it to what's in the destination. Soft deletes are a better option for key-based incremental replication.
When you have a choice, we recommend log-based incremental replication over key-based.
Full-table replication is just what it sounds like: Stitch replicates the entire contents of a table every time.
Full-table replication is fine for historical data loads, or in cases when you're unsure whether you can rely on the data that's already in your data warehouse.
However, full-table replication is inefficient. Because you're replicating each row, including ones that are already present in the destination, it can take a lot of time and eat into your monthly row allowance. We recommend using incremental replication whenever practical.
One final point: Stitch lets you apply different replication methods to different tables in the same schema. Visit our documentation on replication methods for more specifics.
Sign up for Stitch and try it for free for 14 days. Take that time to explore the most appropriate replication methods for your use cases.