This guide is applicable to:
- Destinations configured to use Append-Only Loading, or
- Tables configured to use Append-Only Loading, such as Google Ads’ Report tables
When data is loaded using Append-Only Loading, existing records aren’t updated, but instead appended to tables as new rows. This means that as time goes on, tables will contain different versions of the same record, reflecting how the record has changed over time.
While data stored this way can provide insights and historical details, sometimes you may just want the latest version of a record. In this guide, we’ll cover:
Before using this guide
Before using this guide, note that:
- You may need to modify the queries in this guide to use them yourself
Stitch Support’s expertise lies in replicating data, and as such does not provide data analysis or querying assistance. We can, however, help with data discrepancies.
If you’d like assistance with analysis or business intelligence solutions, we recommend reaching out to one of our analytics partners.
Using system columns to identify record versions
Every table created by Stitch contains columns prepended with
_sdc. These are system columns created and used by Stitch to load data into your destination.
For this guide, we’ll focus on just two columns:
A Unix epoch (in milliseconds) that indicates the order in which data points were considered for loading.
Stitch uses this column’s values in a few places to correctly order rows for loading, but it can be also used to retrieve the latest version of a record from an Append-Only table.This is the primary column our strategy will use.
Timestamp indicating when Stitch loaded the batch the record was a part of into the destination.Our strategy will use this column as a "tie breaker."
Retrieving the latest version of every record
Let’s take a look at an example. Assume we have an
orders table that contains:
- A Primary Key of
- The system
_sdccolumns added by Stitch, and
- Other order attribute columns
Only using _sdc_sequence
If you wanted to create a snapshot of the latest version of this table, you could run a query like this using
SELECT DISTINCT orders.* FROM [stitch-analytics-bigquery-123:ecommerce.orders] orders INNER JOIN ( SELECT id, MAX(_sdc_sequence) AS sequence FROM [stitch-analytics-bigquery-123:ecommerce.orders] GROUP BY id ) latest_orders ON orders.id = latest_orders.id AND orders._sdc_sequence = latest_orders.sequence
Here’s what’s happening in this query:
- The subquery retrieves a list of every record’s Primary Key and maximum
- The outer query selects distinct versions of the latest version of every record.
- Lastly, the outer query joins the table to the list retrieved by the subquery, which makes all other columns available for querying.
Using _sdc_batched_at as a tie breaker
If only using
_sdc_sequence doesn’t yield the desired results, we recommend using
_sdc_batched_at as a “tie breaker”:
SELECT DISTINCT orders.* FROM [stitch-analytics-bigquery-123:ecommerce.orders] orders INNER JOIN ( SELECT id, MAX(_sdc_sequence) AS sequence, MAX(_sdc_batched_at) as batched_at FROM [stitch-analytics-bigquery-123:ecommerce.orders] GROUP BY id ) latest_orders ON orders.id = latest_orders.id AND orders._sdc_sequence = latest_orders.sequence AND orders._sdc_batched_at = latest_orders.batched_at
_sdc_batched_at value indicates the time that Stitch loaded the batch containing the record into the destination. Selecting a record’s maximum
_sdc_sequence values excludes versions of the record from older batches from the results.
Create views in your destination
To make this easier, you can turn queries like the one above into a view. We recommend this approach because a view will encapsulate all the logic and simplify the process of querying against the latest version of your data.
Refer to the documentation for your destination for more info on creating views: