ETL: build or buy?

When you first implement a data warehouse and need to populate it with data, you face a “build or buy” decision: Should we use a commercial data pipeline or should we write our own ETL code in-house? Here are a few factors that may help you decide which path makes sense for your organization.

For any “build or buy” decision, there are several factors to consider: your upfront investment cost, ongoing maintenance costs, how flexible your solution will be, and whether it can meet your specific needs.

Bear in mind that ETL comprises several pieces, including not only code to extract, transform, and load data, but also the infrastructure for making those pieces run reliably and at peak performance. That means “build or buy” doesn’t have to be one monolithic decision.

Let’s look at each of the pieces, starting with data extraction. For popular data sources, there’s no good reason to build the “extract” part of the pipeline yourself. Salesforce has more than 150,000 customers; it would be a tremendous waste of time and energy if they all wrote their own code to pull from the Salesforce API. And Salesforce is one of thousands of SaaS tools out there; the same logic applies to them all. If someone offers a high-quality, easy-to-use extraction tool, it makes sense to take advantage of it.

If you’re using less popular data sources, the decision is slightly more complicated. You may not be able to find a commercial ETL tool that has a native connector for your niche software. In that case, you can turn to an open source framework like Singer to build your own integration. By using existing Singer libraries you can build a tap for your data source much faster than if you were starting from scratch. You can also take advantage of the Singer community, which includes consultants and other users who have experience using Singer tools to build taps.

Even if the software you need to extract from has a market of exactly one company — that is, you built it yourself for an internal use case — it still can pay to build it in Singer, to get interoperability with the rest of the Singer ecosystem and all of the utilities that Singer provides around things like metadata management, data typing, and state management. Each new tap that conforms to the Singer spec is automatically compatible with other Singer-built components, which means you can load the data from your tap into CSV files, Google Sheets, and any other Singer target. And — another benefit — anything you write to the Singer standard can be run on Stitch, which lets you offload the infrastructure management part of the pipeline.

Loading is another task where it doesn’t pay to reinvent the wheel. Let’s say you’re loading to an Amazon Redshift data warehouse. Tens of thousands of companies use Redshift, and it doesn’t make sense for all of them to invest time in optimizing Redshift loader performance when they can take advantage of ETL platforms that have already done that work. As an example, with Redshift, if you’re not careful with the number of concurrent write streams and the batch size of your data inserts, you’ll hurt the performance of your analytical queries — but you wouldn’t know this until you ran into a problem.

Unlike the situation with the extract and load phases, the answer to whether to build your own transformations may be specific to your company. Data transformation involves custom business logic, and may change over time. However, even here you can take advantage of some commonalities. An open source tool like dbt, which lets you run transformations from the command line, allows you to take advantage of prebuilt models for common data sources and then customize them to suit your needs.

When you build your own data pipeline, you face a host of housekeeping tasks you need to account for in addition to the extract, transform, and load processing. You need to provide authentication and credential management. You need to monitor the data as it transits the pipeline and create alerts if you encounter problems. You also need need to monitor data volume and autoscale to handle peaks in traffic load. None of that is fun or creative, and it takes developer time away from more valuable projects.

All of the factors we’ve looked at so far apply to the initial development. Bear in mind that the cost of code maintenance can be significant.

Finally, there’s the question of support. Do you want to assign your best engineers to support an internally built ETL pipeline? If not, do you want to wear a pager for when the CEO’s dashboards go down or data is stale?

Nowadays, many people say that engineers shouldn’t write ETL. We agree that, for most organizations, it doesn’t make sense to reinvent the wheel. And we make it easy for you check our thinking — Stitch is free to use for data volumes of less than five million rows per month. Test it out against one or several of your data sources and draw your own conclusions about whether it makes sense nowadays to write your own ETL code.