Append-Only Replication is a type of Incremental Replication where newly replicated data is appended to the end of a table. Existing rows are not updated - updates are added to the end of the table as new rows.
For tables using Incremental Replication, Stitch currently loads data into Google BigQuery in an append-only fashion. This means that as time goes on, tables will wind up containing many different versions of the same row.
Data stored this way can provide insights and historical details about how those rows have changed over time - creating a timeline of the status changes of an order record, for example - but in some cases, you might just want the latest version of the table.
Grab the latest version of every row
In each Stitch-generated integration table, you’ll see a few columns prepended with
_sdc. The column we’ll focus on here is the
_sdc_sequence column. This column is a Unix epoch (down to the millisecond) attached to the record during replication and can help determine the order of all the versions of a row.
Stitch uses these sequence values in a few places to correctly order rows for loading, but it can be also used to grab the latest version of a record in an append-only table.
Let’s take a look at an example. Assume we have an
orders table that contains:
- A Primary Key of
- The system
_sdccolumns added by Stitch, and
- Other order attribute columns
If you wanted to create a snapshot of the latest version of this table, you could run a query like this:
SELECT DISTINCT o.* FROM [stitch-analytics-bigquery-123:ecommerce.orders] o INNER JOIN ( SELECT id, MAX(_sdc_sequence) AS seq, MAX(_sdc_batched_at) AS batch FROM [stitch-analytics-bigquery-123:ecommerce.orders] GROUP BY id) oo ON o.id = oo.id AND o._sdc_sequence = oo.seq AND o._sdc_batched_at = oo.batch
This approach uses a subquery to get a single list of every row’s Primary Key, maximum sequence number, and maximum batched at timestamp. Since it’s possible to have duplicate records in your warehouse, the query also selects only distinct records of the latest version of the row. It then joins the original table to the Primary Key, maximum sequence, and maximum batched at, which makes all other column values available for querying.
Create views in your data warehouse
To make this easier, you can turn queries like the one above into a view. We recommend this approach because a view will encapsulate all the logic and simplify the process of querying against the latest version of your data.
For more info on creating views in the data warehouses Stitch supports, check out these docs: