Missing some Mongo data in your destination? Due to how Mongo sorts data based on data type, Stitch may be unable to correctly identify new and updated data. If you don’t see data that you’d expect to, the root cause may be multiple data types in the collection’s Replication Key or Primary Key (_id) fields.
Symptoms
Missing or stale data in the destination for Mongo-backed database integrations.
Cause
The cause of this problem is two-fold:
- Fields in Mongo may contain more than one BSON data type
- Mongo ranks data types, which affects how Mongo determines the current maximum value for a field
Replication Methods and value sorting
For Mongo-backed database integrations, Stitch uses a field’s maximum value to identify new and updated data during replication.
The field itself and how its values are used depend on the Replication Method the collection uses:
| Replication Method | Field used | Description |
| Key-based Incremental |
Replication Key |
Documents with a Replication Key value greater than or equal to the last saved maximum value for the Replication Key field are replicated. |
| Full Table |
Primary Key ( |
Documents with a Primary Key ( This ensures that replication can resume if the replication job is interrupted. |
| Log-based Incremental |
Primary Key ( |
Applicable only to the historical replication of a collection. This is not applicable when Stitch reads updates from the database’s logs. Documents with a Primary Key ( This ensures that historical replication for a collection can resume if the replication job is interrupted. |
Mongo’s data type ranking determines what the current maximum value of a field is. This, in turn, can affect how Stitch identifies and replicates data from a Mongo database.
Examples
Consider these examples, which demonstrate how multiple data types in either the Replication Key or Primary Key (_id) field can cause data discrepancies.
Example: Replication Key
This example demonstrates how multiple values in a Replication Key field can cause data discrepancies.
- A collection is set to replicate, using a field named
table_idas the Replication Key. Thetable_idfield contains bothObjectIdandStringdata. - A historical replication of the collection completes.
- Stitch saves the maximum value of
table_id. Because Mongo ranksObjectIddata types as greater thanStrings, the maximum value Stitch saves is anObjectIdvalue. - New documents are added to the collection.
- During the next replication job, Stitch uses the last recorded maximum value - an
OjbectIdvalue ` to identify new and updated data. - Because
ObjectIds > Strings, all documents withStringsare considered to be less than the last recorded maximum value. This means Stitch won’t be able to detect these documents and replicate them.
Example: Primary Key
This example demonstrates how multiple values in a Primary Key (_id) field can cause data discrepancies.
- A collection is set to replicate. Stitch automatically uses its
_idfield as the Primary Key. The_idfield contains bothObjectIdand UUID data. - During the replication job, Stitch identifies and saves the maximum value of
_id. In this example, it’s anObjectIdvalue. - Stitch queries for all documents with an
_idvalue less than or equal to the saved maximum_idvalue. - Because Mongo considers
ObjectIdsand UUID values to be neither greater than or less than each other, UUID records may be excluded from the results of Stitch’s query. This means Stitch won’t be able to detect these documents and replicate them.
Diagnose the issue
To determine if a field contains multiple data types, you’ll run queries and compare the count of specific data type values in the Replication Key or Primary Key (_id) field to the total number of documents in the collection.
Step 1: Get a count of data types for the field
First, you’ll need to get a count how many instances of a single data type there are in a given field in the collection.
Run the query below, replacing the following:
nameOfCollection: The name of the collectionkeyField: This is dependent on the Replication Method the collection uses:- For Key-based Incremental Replication: The name of the field used as the collection’s Replication Key
- For Log-based Incremental or Full Table Replication: This value should be
_id
knownDataTypeId: The ID of the known BSON data type used by thekeyField. Refer to Mongo’s documentation for a list of BSON data type IDs.
db.<nameOfCollection>.count({<keyField>: {$type: <knownDataTypeId>}});
Step 2: Count all records in the collection
Next, run this query to get a count of all records in the collection:
db.<nameOfCollection>.count();
Step 3: Retrieve the field's current maximum value
Next, run the following query to return the maximum value for the specified Replication or Primary Key field in the collection. This can be helpful when comparing your source database to what’s in your destination:
db.<nameOfCollection>.find().sort({<keyField>:-1}).limit(1);
Step 4: Compare the query results
Compare the results between the queries from Step 1 and Step 2.
If the results are equal, then the Replication or Primary Key field contains only one data type. The root cause may require additional investigation.
If the results aren’t equal, multiple data types in the Replication or Primary Key field may be interfering with Stitch’s replication process. Refer to the Solution section for next steps.
Solution
If you’ve determined the field contains multiple data types, you have a few options:
-
To continue using the collection’s current Replication Method:
- Modify the field to only contain a single data type.
- After this is completed in the source, reset the collection to queue a historical replication.
-
To use a different Replication Method:
- Verify that the Replication Key field (if switching to Key-based Incremental Replication) or the
_idfield (if switching to Full Table or Log-based Incremental Replication) only contains a single data type. Make any modifications before proceeding. - Configure the new Replication Method for the collection in Stitch. Changing a Replication Method automatically queues a historical replication.
- Verify that the Replication Key field (if switching to Key-based Incremental Replication) or the
If you’ve determined multiple data types aren’t causing the discrepancy, we recommend working through the Data discrepancy troubleshooting guide before contacting support.
Additionally, providing support with the info from the queries in this guide can help us investigate more quickly.
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.