Google BigQuery is a fully managed, cloud-based big data analytics web service for processing very large read-only data sets. BigQuery was designed for analyzing data on the order of billions of rows, using a SQL-like syntax.
For more information, check out Google’s BigQuery overview.
Unlike many other cloud-based data warehouse solutions, BigQuery’s pricing model is based on usage and not a fixed-rate. This means that your bill can vary over time.
Before fully committing yourself to using BigQuery as your data warehouse, we recommend familiarizing yourself with the BigQuery pricing model and how using Stitch may impact your costs.
To set up BigQuery, Stitch requires:
- A user with full access to an existing Google Cloud Platform project within BigQuery. Stitch won’t be able to create one for you.
- Admin permissions for BigQuery and Google Cloud Storage (GCS). This includes the BigQuery Admin and Storage Admin permissions.
- Access to a project where billing is enabled and a credit card is attached. Even if you’re using BigQuery’s free trial, billing must still be enabled for Stitch to load data.
Every database has its own supported limits and way of handling data, and BigQuery is no different. The table below provides a very high-level look at what BigQuery supports, including any possible incompatibilities with Stitch’s integration offerings.
BigQuery’s full list; Stitch reserves
|Table Name Length||1,024 characters||Column Name Length||128 characters|
|Max # of Columns||10,000||VARCHAR Max Width||None|
|Nested Structure Support||Native Support||Case||Case Insensitive|
Not sure if BigQuery is the data warehouse for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.
Stitch replicates data from your sources based on the integration’s Replication Frequency and the Replication Method used by the tables in the integration. In Stitch, you have the ability to control what and how often data is replicated for the majority of integrations.
The current release of Stitch’s BigQuery destination uses Append-Only Incremental Replication.
For SaaS and database tables that use Incremental Replication, Stitch will replicate data into BigQuery in an append-only fashion. This means that data that updates an existing row will NOT overwrite the row. Instead, a new row with the new data will be appended to the end of the table.
This means that there can be many different rows in a BigQuery table with the same Primary Key, each representing what the data was at that moment in time.
Querying an append-only table requires a different strategy than you would normally use. For more info, check out the Querying Append-Only Tables guide.
The time from the sync start to data being loaded into your data warehouse can vary depending on a number of factors, especially for initial historical loads.
To learn more about Stitch’s replication process and how loading time can be affected, check out the Stitch Replication Process guide.
After you’ve successfully connected your BigQuery data warehouse to Stitch, you can start adding integrations and replicating data.
For each integration that you add to Stitch, a schema specific to that integration will be created in your data warehouse. This is where all the tables for that inegration will be stored.
Stitch will encounter dozens of scenarios when replicating and loading your data. To learn more about how BigQuery handles these scenarios, check out the Data Loading Guide for BigQuery.
Rejected Records Log
Occasionally, Stitch will encounter data that it can’t load into the data warehouse. For example: a table contains more columns than BigQuery’s allowed limit of 10,000 columns per table.
When this happens, the data will be “rejected” and logged in a table called
_sdc_rejected. Every integration schema created by Stitch will include this table as well as the other tables in the integration.