When data is loaded using Append-Only Loading, existing records aren’t updated, but instead appended to tables as new rows. This means that as time goes on, tables will contain different versions of the same record, reflecting how the record has changed over time.

While data stored this way can provide insights and historical details, sometimes you may just want the latest version of a record. In this guide, we’ll cover:


Before using this guide

Before using this guide, note that:

  • You may need to modify the queries in this guide to use them yourself
  • Stitch Support’s expertise lies in replicating data, and as such does not provide data analysis or querying assistance. We can, however, help with data discrepancies.

    If you’d like assistance with analysis or business intelligence solutions, we recommend reaching out to one of our analytics partners.


Using system columns to identify record versions

Every table created by Stitch contains columns prepended with _sdc. These are system columns created and used by Stitch to load data into your destination.

For this guide, we’ll focus on just two columns:

Column name Description
_sdc_sequence

A Unix epoch (in milliseconds) that indicates the order in which data points were considered for loading.

Stitch uses this column’s values in a few places to correctly order rows for loading, but it can be also used to retrieve the latest version of a record from an Append-Only table.

This is the primary column our strategy will use.
_sdc_batched_at

Timestamp indicating when Stitch loaded the batch the record was a part of into the destination.

Our strategy will use this column as a "tie breaker."

Retrieving the latest version of every record

Let’s take a look at an example. Assume we have an orders table that contains:

  • A Primary Key of id,
  • The system _sdc columns added by Stitch, and
  • Other order attribute columns

Only using _sdc_sequence

If you wanted to create a snapshot of the latest version of this table, you could run a query like this using _sdc_sequence:

Example only using _sdc_sequence
SELECT DISTINCT orders.*
           FROM [stitch-analytics-bigquery-123:ecommerce.orders] orders
     INNER JOIN (
                  SELECT id,
                         MAX(_sdc_sequence) AS sequence
                    FROM [stitch-analytics-bigquery-123:ecommerce.orders]
                  GROUP BY id
                ) latest_orders
             ON orders.id = latest_orders.id
            AND orders._sdc_sequence = latest_orders.sequence

Here’s what’s happening in this query:

  1. The subquery retrieves a list of every record’s Primary Key and maximum _sdc_sequence value.
  2. The outer query selects distinct versions of the latest version of every record.
  3. Lastly, the outer query joins the table to the list retrieved by the subquery, which makes all other columns available for querying.

Using _sdc_batched_at as a tie breaker

If only using _sdc_sequence doesn’t yield the desired results, we recommend using _sdc_batched_at as a “tie breaker”:

Example using _sdc_sequence and _sdc_batched_at
SELECT DISTINCT orders.*
           FROM [stitch-analytics-bigquery-123:ecommerce.orders] orders
     INNER JOIN (
                  SELECT id,
                         MAX(_sdc_sequence) AS sequence,
                         MAX(_sdc_batched_at) as batched_at
                    FROM [stitch-analytics-bigquery-123:ecommerce.orders]
                  GROUP BY id
                ) latest_orders
             ON orders.id = latest_orders.id
            AND orders._sdc_sequence = latest_orders.sequence
            AND orders._sdc_batched_at = latest_orders.batched_at

The _sdc_batched_at value indicates the time that Stitch loaded the batch containing the record into the destination. Selecting a record’s maximum _sdc_batched_at and _sdc_sequence values excludes versions of the record from older batches from the results.


Create views in your destination

To make this easier, you can turn queries like the one above into a view. We recommend this approach because a view will encapsulate all the logic and simplify the process of querying against the latest version of your data.

Refer to the documentation for your destination for more info on creating views:



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.