When should you build your own data pipeline?

The biggest and most sophisticated software companies on the planet—famously Facebook and Twitter, but others such as Etsy and Spotify as well—have built their own data pipelines, but they've done so by dedicating dozens to hundreds of the best engineers in the world to the problem.

These companies built their own pipelines because when they were initially investing in this technology it was a competitive differentiator for them. There wasn't a software vendor offering an alternative. No one else had the engineering prowess to do what they did in 2009, 2010, and 2011, and this data sophistication is how they monetized their huge user bases.

Today, there's not a compelling reason to build your own pipeline. We believe that the vast majority of online businesses today are actually trying to solve the same data challenges, over and over again, and spending far too much valuable time and energy doing it. We believe that these companies would be far better served by buying a data pipeline rather than building their own from scratch.

Ten years ago, it wasn't uncommon for an online business to build their own CRM. This wasn't an unreasonable choice back then—enterprise solutions were a poor fit for many growing companies and Salesforce.com and others hadn't yet provided an effective alternative. But today, businesses don't just decide to build their own CRM. It's too much work, it's too expensive, and the alternatives are just too good. We believe this same transition is happening today for data pipelines.

Buying a data pipeline

When setting out to purchase a data pipeline, start by answering three important question about your data needs:

What data sources are important for you to analyze today?
What data sources will likely be added in the future?
What is your best projection of data growth?

Every aspect of your organization now has data associated with it, and this data lives in different systems: customer data, transactional data, product usage, web clickstream, advertising, email marketing, CRM, accounting, operational data, and more. The breadth of the data you'll need to consolidate as well as the specific sources that are the highest priority will impact how you build your data pipeline.

Grow one data source at a time

One area where we see leaders get tripped up is assuming that they need to start with a big bang approach, waiting to roll out a data pipeline until it incorporates all important organizational data. We suggest, instead, an incremental approach: prioritize the data sources with the highest immediate return and get those up and running. Then, work to grow the universe of data you're consolidating one source at a time.

Using this approach requires that you build a comprehensive list of sources that you will eventually want to consolidate prior to making any purchase decisions. Even if a given source isn't a high priority today, you need to make sure that your platform of choice will support it when you get there.

Additional criteria to use when evaluating your data pipeline purchase

In addition to being able to consolidate all the data that is relevant to your organization, there are several additional criteria that you should use when evaluating tools:

Level of implementation & maintenance effort required. How much regular engineering time will you be expected to dedicate to getting the pipeline up and running, and then maintaining it? If you'll need to dedicate employees to maintenance, be sure to factor this into your budget.

Real-time / streaming updates. How long do you have to wait to see new data?

Change data capture. Does your vendor rely on CDC, or are they copying the whole data set every time? Copying the whole set every time is not a viable option for today's huge datasets.

Scalability. Does the vendor have other clients who are processing larger volumes of data than your current and likely future needs?

Auditability. Do you have access to seeing/auditing all parts of your data pipeline? Just as in building, if there is not transparency built into the pipeline you buy, it'll be tough for your organization to develop trust in the results.

Flexibility. When you purchase a data pipeline, do you get locked to a single suite of tools or can you access your data via standard protocols like SQL, ODBC and JDBC?

Extensibility. If you have proprietary data formats that you need to support, can a vendor accommodating them?

Security & compliance. Does the vendor have an adequate plan in place to protect the security of your data? Do they have a history of data security? Are they compliant with SOC 2 criteria and regulations like HIPAA and GDPR?

Making the decision

Once you've done all of your research, it's time to make a decision. This decision will impact your organization significantly. Here's our summary of the considerations for your build-vs-buy decision:

Considerations - buying vs building your data pipeline

Consideration	Build	Buy
Technical Control	Higher	Lower
Cost of Ownership	Higher	Lower
Development Resources	Internal	External
Time to Value	Slower	Faster
Risk of Failure	Higher	Lower
Analytical Functionality	Lower	Higher

It's not a secret by this point that we strongly discourage you from building your own data pipeline. The cost benefit equation of making this technical decision simply doesn't add up today in the same way that it did even a few years ago. There are plenty of areas in your data strategy where you absolutely should roll your sleeves up and get technical; we strongly caution against doing that here. Building a data pipeline is hard work and takes you away from what should be your core focus: growing your business.

← Previous Chapter

Introduction

Next Up →

What technology should we use to store and analyze our data?

Home

Introduction

Chapter 1

How should we consolidate our disparate data sources?

Chapter 2

What technology should we use to store and analyze our data?

Chapter 3

How should we facilitate data exploration?

Chapter 4

How do we collect the data we will need to analyze?

Chapter 5

Putting this guide to work