Amazon Redshift is fully managed, cloud-based data warehouse. As Redshift is built for online analytic processing and business intelligence applications, it excels at executing large-scale analytical queries. For this reason, it exhibits far better performance than traditional, row-based relational databases like MySQL and PostgreSQL.
To learn more about transactional and analytic databases and how they compare, check out our Data Strategy Guide.
Redshift is based on PostgreSQL 8.0.2 and while there are many similarities, Redshift differs in some key ways. Before you spin up a cluster, we recommend checking out our destination comparison guide to ensure you pick the best data warehouse for your needs.
In this guide, we'll walk you through:
- Redshift's pricing model,
- Some high-level limitations (including any incompatible data sources),
- How to spin up a Redshift data warehouse of your own, and
- How Stitch loads and organizes data in Redshift.
Currently, Redshift bases their pricing on an hourly rate that varies depending on the type and number of nodes in a cluster. Check out their Pricing page for an in-depth look at their current plan offerings.
So, what’s a node? A node is a single computer that participates in a cluster. Your Redshift cluster can have one to many nodes; the more nodes, the more data it can store and the faster it can process queries. Amazon currently offers four different types of nodes, each of which has its own CPU, RAM, storage capacity, and storage drive type.
The type and number of node(s) you choose when creating your cluster is dependent on your needs and dataset. We do, however, recommend you set up a multi-node configuration to provide data redundancy.
For some guidance on choosing the right number of nodes for your cluster, check out Amazon’s Determining the Number of Nodes guide.
Every database has its own supported limits and way of handling data, and Redshift is no different. The table below provides a very high-level look at what Redshift supports, including any possible incompatibilities with Stitch’s integration offerings.
|Incompatibile Sources||Possible incompatibilities. Learn more.|
Redshift’s full list; Stitch reserves
|Table Name Length||127 characters||Column Name Length||115 characters|
|Max # of Columns||1,600||VARCHAR Max Width||65K|
|Nested Structure Support||None; structures will be de-nested. Learn more.||Case||Case Insensitive|
Not sure if Redshift is the data warehouse for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.
Creating a Redshift data warehouse for Stitch involves spinning up a cluster in Amazon Web Services and creating a database in the cluster.
- Create a new AWS account & Redshift cluster. If you’re brand new to AWS, you can sign up here to create an AWS account and then use this tutorial to connect your Redshift database.
- Connect a Redshift database via an SSH tunnel. If you want to use an SSH tunnel to connect Redshift to Stitch, there are some additional steps you’ll need to complete to set up the connection.
Connect an existing Redshift instance. If you already have an AWS account and a Redshift cluster, you won’t need to complete the initial cluster provisioning steps. You will, however, still need to:
After you’ve successfully connected your Redshift data warehouse to Stitch, you can start adding integrations and replicating data.
For each integration that you add to Stitch, a schema specific to that integration will be created in your data warehouse. This is where all the tables for that inegration will be stored.
Stitch will encounter dozens of scenarios when replicating and loading your data. To learn more about how Redshift handles these scenarios, check out the Data Loading Guide for Redshift.
Rejected Records Log
Occasionally, Stitch will encounter data that it can’t load into the data warehouse. For example: a table contains more columns than Redshift’s allowed limit of 1,600 columns per table.
When this happens, the data will be “rejected” and logged in a table called
_sdc_rejected. Every integration schema created by Stitch will include this table as well as the other tables in the integration.
Stitch replicates data from your sources based on the integration’s Replication Frequency and the Replication Method used by the tables in the integration. In Stitch, you have the ability to control what and how often data is replicated for the majority of integrations.
The time from the sync start to data being loaded into your data warehouse can vary depending on a number of factors, especially for initial historical loads.
To learn more about Stitch’s replication process and how loading time can be affected, check out the Stitch Replication Process guide.
Encodings, SORT, & DIST Keys
Want to improve your query performance? You can apply encodings, SORT, and DIST keys to Stitch-created tables in your Redshift data warehouse. Even when new data is replicated, your settings will remain intact.
Did this article help? If you have questions or feedback, please reach out to us.