If you’re missing data, this is the place to start.

In this article, we’ll walk you through the common causes for data discrepancies and how to diagnose them.


Data Incompatibilities

From time to time, Stitch may run into problems when attempting to load data into your destination. When data is deemed incompatible by the destination, the record will be “rejected” and logged in a table called _sdc_rejected. What looks like missing data may actually be a compatibility issue.

If you’re missing data, the first place you should look is in the _sdc_rejected table in the integration’s schema. Click here for more info on this table and how to use it to troubleshoot.


Data Loading

While Stitch is designed to quickly and efficiently process large amounts of data, it can take some time to replicate and load your data into your data warehouse. What looks like missing data may actually be incomplete processing, meaning Stitch hasn’t finished loading all the data.

Processing time can be affected by a variety of factors:

  • The volume of data being replicated,
  • The integration’s Replication Frequency, and
  • API quotas (for SaaS integrations).

Most data discrepancies can be solved by simply waiting and giving Stitch time to process and load the data.

You can keep an eye on an integration’s progress by using the Replication Stats dashboard.


Downtime

If a SaaS integration provider (ex: Salesforce) is undergoing maintenance or experiencing downtime, Stitch may be unable to replicate data. We recommend checking the provider’s status page for reported outages.

Stitch may also occasionally encounter performance issues. You can stay up-to-date with the latest by subscribing to our status page.


Queries

Before reporting a data discrepancy to Stitch support, we recommend that you double-check how you’re querying the data.

Make sure you're using a SQL client to directly query your data warehouse. This will eliminate the possibility of:

  1. Report refresh lags, which can occur in visualization tools like Tableau or Mode,
  2. Third party defects or downtime, and
  3. Any other type of data delay.

Unless you specify a start date, the majority of SaaS integrations will sync historical data going back one year from the Stitch connection date.

To verify an integration's start date, click into the integration from the Stitch Dashboard. Click the Settings tab, then scroll down to the Sync Historical Data section.

Sync Historical Data section.

When querying in your data warehouse, check that the timeframe you're querying for matches up with the integration's start date. If you find that the integration's start date was set incorrectly, you can reset it to queue a full re-sync.

Keep in mind that your data warehouse may handle timestamp data in a specific way, even if your data source is configured to report in a specific timezone.

Consider this timestamp as an example: 2017-08-14 11:24:02 GMT-0400 (EDT)

Below is an explanation and example of how each data warehouse will store this same data point.

Data Warehouse Explanation Example Timestamp
PostgreSQL Timestamp data is stored as TIMESTAMP WITH TIME ZONE and converted to UTC using the appropriate offset for the timezone. 2017-08-14 15:24:02
Redshift Timestamp data is stored as TIMESTAMP WITHOUT TIME ZONE and expressed as UTC without the timezone information. 2017-08-14 11:24:02

Discrepancy Consistencies

When investigating a data discrepancy, look for consistencies such as records missing over a specific timeframe or issues that only affect certain records or data types.

For example: formula fields in Salesforce can occasionally cause data discrepancies due to how they’re updated.

We recommend checking the Additional & Integration Specific Resources section for causes of common data discrepancies and how to resolve them.


Replication Frequency

If the missing records were created very recently, you may need to wait for an update of your data to complete before they appear in your data warehouse.

We recommend checking the integration’s Replication Frequency, located in the Settings tab of the Integration Details page. If it’s set to a lower frequency like 12 hours, you may want to increase it temporarily to ensure Stitch kicks off a replication job sooner rather than later.


Replication Keys

When a table in a database integration is initially set to use Incremental Replication, a Replication Key must be defined. For Stitch to accurately replicate data, Replication Keys must align with how data in the table is updated.

If the table in question is set to use Incremental Replication, keep in mind that:

  1. Stitch won’t capture hard deletes.
  2. Replication Key columns with NULL values are only replicated during an integration’s initial replication job.
  3. Records that are updated over time should use a modification timestamp to ensure updates are captured.
  4. Mongo Replication Keys have additional considerations. For example: multiple data types in the Replication Key column can lead to missing data.
  5. BigQuery destinations only support Append-Only Incremental Replication. What looks like duplicate data may actually be updated records being appended to a table.

Contacting Support

If the discrepancy can’t be explained by any of the points above, please reach out to support. Depending on the type of discrepancy, we’ll ask you to provide us some information that will help us investigate.

Row count discrepancies describe discrepancies that affect the number of records in your data warehouse. Complete records may be missing, duplicated, etc.

Field value discrepancies describe discrepancies that affect values in individual columns.

Row Count Discrepancies

For row count discrepancies, please provide us with the info in this section.

  1. 3-5 examples of records that exist in the source but not in your data warehouse. You can send us the entire record, but at the very least we need:
    • The Primary Key
    • The Replication Key
    • The field with the discrepancy
  2. The results for the queries listed below, run in the data source for [source_integration_schema].[table_name]:
    • MIN(PRIMARY_KEY)
    • MAX(PRIMARY_KEY)
    • MIN(REPLICATION_KEY)
    • MAX(REPLICATION_KEY)
    • COUNT(*)
  3. The results for the queries listed below, run in the data warehouse for [data_warehouse_schema].[table_name]:
    • MIN(PRIMARY_KEY)
    • MAX(PRIMARY_KEY)
    • MIN(REPLICATION_KEY)
    • MAX(REPLICATION_KEY)
    • COUNT(*)
  4. For SaaS integrations, whenever possible, please provide us with:
    • Raw exports showing row-level data that illustrates the discrepancy
    • Screenshots from the source integration’s UI that illustrate discrepancy
    • Exact API calls and full responses that illustrate the discrepancy (make sure you exclude your API key)

Field Value Discrepancies

For field value discrepancies, please provide us with the info in this section.

  1. 3-5 examples of discrepancies between the source integration and your data warehouse. Please include:
    • The record's id
    • The field with the discrepancy
    • The updated_at value, if applicable
  2. For database integrations:
    • Confirmation that the Replication Method is appropriately capturing changed values
    • Confirmation that the Replication Key is being appropriately populated and does not contain NULL values
  3. For SaaS integrations, whenever possible, please provide us with:
    • Raw exports showing row-level data that illustrates the discrepancy
    • Screenshots from the UI that illustrate discrepancy
    • Exact API calls and full responses that illustrate discrepancy (make sure you exclude your API key)

Questions? Feedback?

Did this article help? If you have questions or feedback, please reach out to us.