We recently wrote an in-depth post that looked at how companies like Spotify, Netflix, Braintree, and many others are building their data pipelines. As a refresher, a data pipeline is the software that consolidates data from multiple sources and makes it available to be used strategically. This data typically powers internal analytics and product features.
The one thing that almost every company we researched had in common was that they built their data pipelines using open source software and lots of custom code. Braintree devoted a team of four full-time engineers for six months to get their data infrastructure off the ground. Even after the initial launch, a two-person team is required to maintain and extend the project; a fairly serious commitment.
This approach is completely appropriate if you have the resources to devote to the pipeline, and if that’s the best use of their time. But companies perennially underestimate the effort necessary to build a stable system, and the ongoing maintenance required once that system is up and running.
We recently launched Stitch, an ETL service to help companies solve this problem without spending months of engineering time (setup takes 5 minutes and it’s free forever for up to 5MM events a month). Along the way, we learned quite a lot about what’s involved in building your own data pipeline, and the challenges that cause homegrown data infrastructure initiatives to be so much more involved than teams originally anticipate.
Your company is likely adding new data sources all the time (most growing businesses are) and each new integration can take anywhere from a few days to a few months to complete. Some speed bumps that can inflate the time and cost involved are:
The integration is different from what you’ve built in the past. Not every product provides a vanilla REST API. Some REST APIs are surprisingly convoluted, and some are still stuck on protocols like SOAP.
The API is not rigorously and accurately documented. Even APIs from reputable, developer-friendly companies sometimes have poor documentation.
The API has a large surface area. For example, a big CRM or ERP system might have dozens of different built-in resource endpoints along with custom resources and fields, all of which add to the complexity.
Once you have the data connections set up, they break. Once you have built more than a handful of connections, you’ll spend a significant amount of engineering time addressing breaking API changes.
For example, Facebook’s “move fast and break things” approach to development means frequent updates to their reporting APIs. It’s not like you’ll get notified in advance either, unless your team invests in building a close enough relationship with the team building the API to get a “through the grapevine” heads-up.
The only way to build trust with data consumers is to make sure that your data is auditable. One best practice that’s easy to implement is to never discard inputs or intermediate forms when altering data.
But this isn’t enough to guarantee accuracy and transparency in a scalable streaming data pipeline, because of the following complexities:
Ordering: To handle massive scale with high availability, data pipelines, including ours, are often distributed systems. This means that arriving data points can take different paths through the system, which also means they can be processed in a different order than they were received. If data is being updated or deleted, processing in the wrong order will lead to bad data. Maintaining and auditing ordering is critical for keeping data accurate.
Schema evolution: What happens to your existing data when a new property is added, or an existing property is changed? Some of these changes can be destructive or leave data in an inconsistent state. For example, what happens if the pipeline starts receiving string values for a field that is expected to be an integer datatype?
The fresher your data, the more agile your company’s decision-making can be. But, even with the growing ecosystem of low-latency stream processing tools like Apache Kafka and Spark, achieving low latency from end-to-end isn’t easy. Extracting data from APIs and databases in real-time can be difficult, and many target data sources, including large object stores like Amazon S3 and analytics databases like Amazon Redshift, are optimized for receiving data in chunks rather than a stream. Solving these problems increases the complexity of the system, meaning it will take longer to build and have a higher maintenance risk.
The more successful your company, and the more data-reliant your team becomes, the more quickly you’ll go from generating thousands of rows per hour to millions of rows per second. Your pipeline needs to be architected from the ground up to scale in order to survive in this scenario. Otherwise you’ll find yourself locked in to an architecture that you quickly outgrow and need to rebuild to accommodate your increased data volume.
At Stitch, we have been building data pipelines since 2008 (first as RJMetrics), and our solutions have evolved to support dramatic increases in volume and variety of data. Today, we can scale to support arbitrarily large throughput, but the amount of resources it took to build and rebuild it along the way would not have been feasible for a company that didn’t focus entirely on data.
Interacting with systems you don’t control like web APIs means you might get results you don’t expect. Some examples we have seen include:
APIs like Salesforce allow custom fields to be defined, which means the structure of the data is constantly changing.
Sudden changes in the output of an API after you’ve built an integration, like UTF-8 characters you’ve never seen before.
APIs with rate limits. This is especially important if you’re using other critical tools on top of the API, which we often see with Salesforce. It’s not acceptable for your data extraction job to take your entire sales team offline for the day.
If you’re not prepared to accommodate data of different shapes and types, and quickly address changes as they happen, your pipeline will break.
With data coming from such a large number of sources, failures are inevitable. The question is, how long will it take you to catch them?
Failure scenarios include:
An API is down for maintenance
API credentials expire
API calls are returning successfully, but do not actually contain any data
There’s network congestion preventing communication with an API
The pipeline destination (e.g. a data warehouse) is offline
We spend weeks testing and validating every connector we build to shake out avoidable errors. To be responsive when unavoidable issues do arise, we’ve implemented systems like:
Proactive notification directly to end-users when API credentials expire
If a third-party API reports an issue, we retry the query with exponential backoff before giving up and marking it as an error with them
If there’s an unexpected error in a connector, we abort the job and automatically create a ticket to have an engineer look into it
Utilizing systems-level monitoring for things like errors in networking or databases so that we’re automatically alerted to system-wide disruptions
The true safety net of this system? A rotating team of developers who use PagerDuty to handle urgent support at any time of day. That’s ultimately what it takes to avoid data discrepancies and missing data: automated alerts combined with a team of human beings dedicated to fixing issues as they happen, even if it’s 2 am.
7. If your product is your baby, internally-built tools are the stereotypical middle child
Internally-built tools never receive the same attention as your product. While supporting technologies are being iterated on literally every day, internal resources are rarely dedicated to keeping up with them and stakeholders pay the price.
One of our sales engineers was recently at AWS re:Invent, speaking with developers who had already completed a pipeline project for their companies. When he asked what part they’d built, they usually say that they’d helped get the project up and running, and now they just deal with the kafka machines. When he asked if they knew how the data got to kafka or how much work had been done since the project wrapped, the responses were usually “not my department”.
It’s a challenge to justify ongoing investment in a home-grown data pipeline. And without this investment, it’s all too easy for things to derail.
The real cost of building
Building a data pipeline can be a really exciting engineering challenge for your company. It certainly has been for us. But it’s a project that is never done. It requires constant dedicated engineering resources to keep data flowing while staying on top of the new technologies that will keep up with the speed and diversity of data proliferation. Most companies would rather have their engineering teams fully dedicated to their product instead of ongoing data infrastructure maintenance. If that sounds like you, you should give Stitch a try, and benefit from all the lessons we’ve learned the hard way over the last 5+ years.