A trip through Stitch’s data pipeline

We often see new users implement Stitch in less than five minutes. While the user experience for adding a data source and a destination is simple, there’s a lot of complexity behind the scenes. Let’s pull the curtains back and see how each record gets from its source to its destination through Stitch’s data pipeline.

The journey typically begins in a SaaS application or database — where it’s either pushed to Stitch via our API or through a webhook, or pulled on a schedule by the Singer-based replication engine that Stitch runs against data sources like APIs, databases, and flat files.

Data pipeline

The data’s next stop is the Import API — a Clojure web service that accepts JSON and Transit — either point-at-a-time or in large batches. The Import API does a validation check and an authentication check on the request’s API token before writing the data to a central Apache Kafka queue.

At this point, Stitch has accepted the data. Our system is architected to meet our most important service-level target: don’t lose data. To meet this goal, we replicate our Kafka cluster across three different data centers and require each data point to be written to two of them before it’s accepted. Should the write fail, the requestor will try again until it’s successful.

Under normal conditions data is read off of the queue seconds later by the streamery — a multithreaded Clojure application that writes the data to files on S3 in batches, separated according to the database tables the data is destined for. We have the capacity to retain data in Kafka for multiple days to ensure nothing is lost in the event downstream processing is delayed. The streamery cuts batches after reaching either a memory limit or an amount of time elapsed since the last batch. Its low-latency design aims to maximize throughput while guarding against data loss or data leaking between data sets.

Batches that have been written to S3 enter the spool — a queue of work waiting to be processed by one of our loaders — Clojure applications that read data from S3 and do any processing necessary (such as converting data into the appropriate data types and structure for the destination) before loading the data into the customer’s data warehouse. We currently have loaders for Redshift, Postgres, BigQuery, Snowflake, and S3. Each is a separate codebase and runtime because of the variation in preprocessing steps required for each destination. Operating them separately also allows each to scale and fail independently, which is important when one of the cloud-based destinations has downtime or undergoes maintenance.

All of this infrastructure allows us to process more than a billion records per day, and allows our customers to scale their data volumes up or down by more than 100X at any time. Stitch customers don’t need to worry about any of this, however. They just connect a source, connect a destination, and then let Stitch worry about making sure the data ends up where it needs to be.

To see how it all works for yourself, sign up for free 14-day trial.