If you’re missing data, this is the place to start.

In this article, we’ll walk you through the common causes for data discrepancies and how to diagnose them.


Data Incompatibilities

From time to time, Stitch may run into problems when attempting to load data into your destination. When data is deemed incompatible by the destination, the record will be “rejected” and logged in a table called _sdc_rejected. What looks like missing data may actually be a compatibility issue.

If you’re missing data, the first place you should look is in the _sdc_rejected table in the integration’s schema. Click here for more info on this table and how to use it to troubleshoot.


Data Loading

While Stitch is designed to quickly and efficiently process large amounts of data, it can take some time to replicate and load your data into your destination. What looks like missing data may actually be incomplete processing, meaning Stitch hasn’t finished loading all the data.

Processing time can be affected by a variety of factors:

  • The volume of data being replicated,
  • The integration’s Replication Frequency, and
  • API quotas (for SaaS integrations).

Most data discrepancies can be solved by simply waiting and giving Stitch time to process and load the data.

You can keep an eye on an integration’s progress by using the Replication Stats dashboard.


Downtime

If a SaaS integration provider (ex: Salesforce) is undergoing maintenance or experiencing downtime, Stitch may be unable to replicate data. We recommend checking the provider’s status page for reported outages.

Stitch may also occasionally encounter performance issues. You can stay up-to-date with the latest by subscribing to our status page.


Queries

Before reporting a data discrepancy to Stitch support, we recommend that you double-check how you’re querying the data.

Make sure you're using a SQL client to directly query your destination. This will eliminate the possibility of:

  1. Report refresh lags, which can occur in visualization tools like Tableau or Mode,
  2. Third party defects or downtime, and
  3. Any other type of data delay.

Unless you specify a start date, the majority of SaaS integrations will sync historical data going back one year from the Stitch connection date.

To verify an integration's start date, click into the integration from the Stitch Dashboard. Click the Settings tab, then scroll down to the Sync Historical Data section.

Sync Historical Data section.

When querying in your destination, check that the timeframe you're querying for matches up with the integration's start date. If you find that the integration's start date was set incorrectly, you can reset it to queue a full re-sync.

Keep in mind that your destination may handle timestamp data in a specific way, even if your data source is configured to report in a specific timezone.

Consider this timestamp as an example: 2017-08-14 11:24:02 GMT-0400 (EDT)

Below is an explanation and example of how each destination will store this same data point.

Data Warehouse Explanation Example Timestamp
PostgreSQL Timestamp data is stored as TIMESTAMP WITH TIME ZONE and converted to UTC using the appropriate offset for the timezone. 2017-08-14 15:24:02
Redshift Timestamp data is stored as TIMESTAMP WITHOUT TIME ZONE and expressed as UTC without the timezone information. 2017-08-14 11:24:02

Discrepancy Consistencies

When investigating a data discrepancy, look for consistencies such as records missing over a specific timeframe or issues that only affect certain records or data types.

For example: Formula fields in Salesforce can occasionally cause data discrepancies due to how they’re updated.

We recommend checking the Additional & Integration Specific Resources section for causes of common data discrepancies and how to resolve them.


Replication Frequency

If the missing records were created very recently, or if Stitch is replicating a large historical data set, you may need to wait for an update of your data to complete before they appear in your destination.

We recommend checking the integration’s Replication Frequency, located in the Settings tab of the Integration Details page. If it’s set to a lower frequency like 12 hours, you may want to increase it temporarily to ensure Stitch kicks off a replication job sooner rather than later. If replication is running, wait for it to complete before analyzing data discrepancies.


Replication Keys

When a table in a database integration is initially set to use Key-based Incremental Replication, a Replication Key must be defined. For Stitch to accurately replicate data, Replication Keys must align with how data in the table is updated.

If the table in question is set to use Key-based Incremental Replication, keep in mind that:

  1. Stitch won’t capture hard deletes.
  2. Replication Key columns with NULL values are only replicated during an integration’s initial replication job.
  3. Records that are updated over time should use a modification timestamp to ensure updates are captured.
  4. MongoDB Replication Keys have additional considerations. For example: Multiple data types in the Replication Key column can lead to missing data.
  5. Append-Only destinations. What looks like duplicate data may actually be updated records being appended to a table. Refer to the Understanding loading behavior guide for more info and examples.

Contacting support

If the discrepancy can’t be explained by any of the points above, reach out to support. Depending on the type of discrepancy, we’ll ask you to provide us some information that will help us investigate.

Row count discrepancies describe discrepancies that affect the number of records in your destination. Complete records may be missing, duplicated, etc.

Field value discrepancies describe discrepancies that affect values in individual columns.

Row Count Discrepancies

For row count discrepancies, please provide us with the info in this section.

  1. 3-5 examples of records that exist in the source but not in your destination. You can send us the entire record, but at the very least we need:
    • The Primary Key
    • The Replication Key
    • The field with the discrepancy
  2. The results for the queries listed below, run in the data source for [source_integration_schema].[table_name]:
    • MIN(PRIMARY_KEY)
    • MAX(PRIMARY_KEY)
    • MIN(REPLICATION_KEY)
    • MAX(REPLICATION_KEY)
    • COUNT(*)
  3. The results for the queries listed below, run in the destination for [data_warehouse_schema].[table_name]:
    • MIN(PRIMARY_KEY)
    • MAX(PRIMARY_KEY)
    • MIN(REPLICATION_KEY)
    • MAX(REPLICATION_KEY)
    • COUNT(*)
  4. For SaaS integrations, whenever possible, please provide us with:
    • Raw exports showing row-level data that illustrates the discrepancy
    • Screenshots from the source integration’s UI that illustrate discrepancy
    • Exact API calls and full responses that illustrate the discrepancy (make sure you exclude your API key)

Field Value Discrepancies

For field value discrepancies, please provide us with the info in this section.

  1. 3-5 examples of discrepancies between the source integration and your destination. Please include:
    • The record's id
    • The field with the discrepancy
    • The replication key value, if applicable
  2. For database integrations:
    • Confirmation that the Replication Method is appropriately capturing changed values
    • Confirmation that the Replication Key is being appropriately populated and does not contain NULL values
  3. For SaaS integrations, whenever possible, please provide us with:
    • Raw exports showing row-level data that illustrates the discrepancy
    • Screenshots from the UI that illustrate discrepancy
    • Exact API calls and full responses that illustrate discrepancy (make sure you exclude your API key)

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.