Like our clients, Stitch is a data-driven company. We ETL data from more than 35 different sources into our data warehouse, with replication jobs running 24 hours a day. We run dbt to model the data we load and build tables that our analytics reporting tool, Mode, relies on for both scheduled and ad hoc reporting. Our team also uses desktop SQL clients and Python scripts to accomplish various miscellaneous tasks and jobs. All of this means we’re hard on our data warehouse.
We've been using an Amazon Redshift data warehouse since we started Stitch. It was a powerful tool for our growing business. Redshift made it easy for us to unify our data and build a foundation for the internal analytics that powered lots of our critical business processes. It satisfied our company's needs, as it does for many of our customers.
But as we grew, and as our use of our data became more complex, we began having performance issues, including long query wait times, crashes, and lockups. We’d have to restart our Redshift cluster, kick off manual runs of dbt, manually cancel queries to get things moving, and limit our scheduled runs in Mode. We had report outages, leading to urgent Slack and email messages on the weekend, not to mention the cumulative hours of diminished productivity for our team as they waited for queries to complete.
We might have been able to remedy these issues if we’d hired a database administrator to actively manage Redshift. In the absence of that technical help, our crude solutions to making Redshift work better took up time that we'd rather have spent doing the jobs we were hired for. At the end of the day, we weren’t able to get the level of reliability we needed from Redshift.
To be fair, some of the issues we had are actively being addressed by the AWS team. Nevertheless, we began considering other data warehouse solutions.
How we chose Snowflake
In early 2017 we built a Stitch destination for Snowflake. We’ve seen some exciting adoption since then. Today, hundreds of companies are using Stitch and Snowflake together, and Snowflake has become a close partner of ours. You can even sign up for Stitch from within your Snowflake account. Our team has had lots of experience working with Snowflake from both a partnership and a technical standpoint. As a result, we have become hyperaware of both the ideological alignment of our companies' mission statements and the high degree of success our shared customers are having.
In addition to all of this, Talend, which acquired Stitch last year, is a Snowflake customer. Migrating to the same data warehouse that the rest of the company uses was another compelling reason for choosing Snowflake.
Snowflake's powerful features promised to help our warehouse perform well even if we began to experience heavy loads. Right away, we saw several Snowflake features that we were eager to take advantage of:
Auto-scaling — Snowflake’s warehouses (compute nodes) can scale for both performance (warehouse size) and concurrency (warehouse clusters), concurrently or independently. We decided to set up a single extra small warehouse with autoscaling up to 5 nodes. When Snowflake sees a high query load, it can automatically start adding nodes to our cluster to deal with the increased demand. That gives us more power when we need it, and less cost when we don’t.
At the time we made the decision to migrate, scaling Redshift required active management, taking time that we weren't willing to invest. Amazon has since added capabilities in this area.
Native support for structured and semistructured data — Snowflake supports the VARIANT data type, which can store values of any other type, including OBJECT and ARRAY. When Stitch loads JSON data to Snowflake, it uses the VARIANT type to store the JSON contents in a single column of a table. When you have all of your data in one table, you can use the powerful JSON querying tools to work with JSON in Snowflake.
Redshift has limited support for semistructured data types; instead, we relied on subtables, which forced us to do multiple joins to get all the data into one table again for reporting.
Snowflake Data Sharing — With Snowflake Data Sharing, we can securely share datasets with anyone in or outside of our organization. This comes in handy now that Stitch is part of Talend. When you share data in Snowflake, it doesn’t move any data from S3, other folks just get access.
It's not as easy to share data in Redshift. Giving another person access takes manual work with an S3 bucket.
And more —
- Time Travel lets us instantly roll back the entire warehouse to any point in time during a chosen retention window.
- Cloning lets us quickly clone databases, schemas, and tables without copying the data.
- Snowflake uses result caching to retrieve only data that’s different since the last time you executed your query.
The data warehouse migration process
Migrating the data warehouse on which all of our analytics reports rely was a big and scary prospect. To get started, I pointed our internal Stitch account at a new Snowflake warehouse and, within a day, Stitch had completed historical loads of almost all of our production datasets. This was great for 80% of our data migration needs.
Stitch dutifully loaded all of our historical data for SaaS tools like Zuora, Jira, and Intercom into Snowflake. However, we had a few datasets where historical data wasn’t available for Stitch to resync for various reasons — rolling deletions in the source database, for example, or manually assembled datasets. We had fresh data from these sources flowing into Snowflake, but the historical data needed to be migrated manually and inserted into the table in Snowflake where new events were being loaded. The best example of this was our Snowplow data, which we use for web analytics. This process is more complex than it sounds.
In these special cases, our consulting partner Bytecode IO helped get our data where we needed it by walking us through several tasks:
- Loading a complete Redshift snapshot to Snowflake
- Refactoring our dbt models to the proper syntax for Snowflake. Both Redshift and Snowflake are ANSI-SQL but use slightly different commands, options, and syntax, including things like handling of timezones and timestamps, window functions, datatype casting, and null handling. This was no easy job.
- Validating all dbt models and datasets to make sure the data we had in Redshift and Snowflake matched (this task took the longest)
- Inserting historical data into the “live” tables being updated by Stitch
In the special case of Snowplow data, we had two different datasets in Redshift that needed to be merged with the fresh dataset in Snowflake. The first dataset was our own self-hosted Snowplow, which had historical data. The second (and our favorite) is hosted by our friends at Fishtown Analytics and is our source of truth for up-to-date Snowplow events. Since the latter is based on the Stitch Import API, it took all of two minutes to repoint the data stream to Snowflake. After that, the historical datasets from Redshift needed to be inserted into the “live” table in Snowflake with proper attention to dates, timezones, and JSON fields from subtables in Redshift. Because Redshift doesn’t support hierarchical datatypes, some of our JSON data was split into subtables when Stitch loaded the data into Redshift. Snowflake supports JSON fields to contain those values. We needed to migrate the historical values from the subtables in Redshift into the “live” tables in Snowflake, recreating the JSON structure.
Migrating data models and reports
Once our data was in good shape, I turned my attention to our data models and our reports. Stitch has more than 160 dbt models that were written against our Redshift database. Though the SQL syntax is similar between Redshift and Snowflake, the differences are significant enough that each of our 13,000 lines of SQL had to be inspected and refactored where needed.
Once we had replicated all of our Redshift data to Snowflake, I began to migrate our Mode reports. Mode supports multiple database connections on the same organization. Once we added the Snowflake connection it was easy to go into each query and flip the data source from Redshift to Snowflake, then rerun the query. Most of the time the queries failed on the first run, so some refactoring was required. Most notably, we needed to add considerations around timezones, timestamps, join keys, and uppercase/lowercase column names.
My process for migrating a production report in Mode, which was designed to minimize report downtime, looked like this:
- Clone the report twice (a backup version and a Snowflake version).
- Refactor the queries in the Snowflake version of the report.
- Copy the succeeding refactored queries to the production report.
- To get all the charts to work, rename all fields to lower case in the query. This helps avoid needing to rebuild each chart using the uppercase versions of the column names. (Column names are uppercase in Snowflake, and lowercase in Redshift.)
- Compare migrated production report to backup clone to ensure the two match.
Migrating the Mode reports was our final task.
Today our migration is complete. We now have fresher data, lower query wait times, and less report downtime. Our dbt run used to take 45 to 60 minutes to complete in Redshift, and because that was a pretty long time, we ran dbt only twice each day to refresh our reports. In Snowflake, the same run takes 5 minutes or less, so we execute the run hourly.
We’re spending more money on Snowflake. We could have upped our spending on Redshift, and likely would have gotten performance improvements, but we think we're getting more value from our spending on Snowflake. We went with Snowflake because we want the freedom of a fully managed and optimized data warehouse without the need to invest expensive people hours. Snowflake has had automatic optimization baked in from the start, and we’re ready to benefit from their vision as we scale our business and data.
We're happy to have this migration behind us. Snowflake is capable of delivering more power than we're using, and that portends success as Stitch scales. If you're looking for a boost to your data warehouse, give Snowflake a try, and sign up for Stitch to get data flowing to it quickly and easily.