Your data pipeline is the most important part of your data strategy because it forms the foundation of all the analysis that will be built upon it. Get your data pipeline right and your engineers won’t struggle with maintenance, your analysts will be able to work efficiently, your decision makers will trust the data, and your production systems won’t be adversely impacted. Get it wrong and things get ugly. Companies have wasted many years and dollars on failed data pipeline projects.
Here's what a data pipeline looks like:
The first, and most important choice you’ll make when creating your data pipeline is build vs. buy. Both are viable options, and there are times when each is appropriate. But there are very significant tradeoffs involved, and it’s all too common that decision-makers don’t understand these trade-offs prior to diving in.
Let’s take a look at what’s involved in building a data pipeline.
Building your own data pipeline
Historically, all organizational data lived behind the firewall in a series of SQL-based relational databases. All it took to pull together a data pipeline was SQL, shell scripts, and a cron scheduler. Dump some tables, move the files to another server, and load them into another database. Easy-peasy. Companies were used to building these types of data pipelines, and it wasn’t hard to find an engineer with experience doing it.
But that was before the world changed in two important ways.
Web-scale businesses have seen data volumes explode.
Online businesses often have hundreds of thousands or millions of users pushing hundreds of millions or billions of events every single month. Ad impressions, email clicks, page views, purchases, product interactions. This explosion in data volume is what people mean when they say “big data”—it’s a completely different scale of data collection, and requires completely different technologies to process effectively.
- Organizational data is no longer on-premise.
It used to be common for all organizational systems to live “behind the firewall”. With the advent of SaaS as the primary method for delivering innovative business software, this no longer holds. Organizations no longer have root access to the databases that their data lives in; instead, they need to build integrations that pipe data via the APIs of each product they use.
The combination of these two changes completely rips the rug out from under the simplistic data pipelines of yesterday. SQL, shell scripts, and cron are just not up to the task anymore—there’s just too much data, and it all speaks different languages. Today’s data pipelines require a lot more moving pieces to deal with these challenges.
What is involved in building a modern data pipeline?
Here are some of the specific core challenges involved in building a modern data pipeline.
- API integration is costly to maintain. Each API is a unique snowflake, and unfortunately, they’re often not well-documented. They also change constantly—it’s not unusual to set up an API, and then see a sudden change, like UTF-8 characters you’ve never seen before. And if you can’t keep up, your jobs inevitably start failing. Rate limiting is also an issue. The Salesforce API, for example, has a very low daily limit. Exceed it, and you’ll suddenly shut down access to all your business-critical CRM-integrated tools.
- Huge datasets are too large to copy on a regular basis. Most of the datasets you’re working with are huge. In the interest of timeliness, this means you’re going to need to implement CDC (change-data capture) rather than copy the whole data set each time.
- It’s easy to underestimate data transformation requirements. Traditional data transformation technologies (the “T” in “ETL”) are simply too slow for large datasets, so you need to make sure you’re choosing something fit for the task. And even once you have a good tool, you’ll have to be ready for significant maintenance along the way. Something as simple as inconsistent date or currency reporting can result in impedance mismatch problems that you’ll need to dedicate engineering time to resolving.
- Behind-the-firewall data sources are no longer just relational databases. Even when you’re dealing with your in-house data sources, you’re looking at a proliferation of data in various formats. Ultimately, you need to consolidate all of your data into a single platform for analysis, and transforming data in a MongoDB instance into a relational structure is non-trivial.
- You have to be prepared to scale. Building a modern data pipeline involves significant scale challenges as data volumes grow from thousands of rows per hour to millions of rows per second. Working with data volumes of this size is a very specialized engineering skillset that is extremely hard to hire for.
- Complexity of the environment increases the potential for job failures. With data coming in from such a large number of sources, job failures are inevitable. Companies need to implement robust monitoring and alerting processes so that your end-data users can see the status, or you’ll risk a negative impact to user trust.
- Complexity of the environment requires auditing for the business users to trust the data. If part of your data pipeline happens in a black box, you’ll need to build in auditability. Otherwise, your employees will have a hard time accepting its outputs and developing trust in the system.
- Data pipeline technology requires a different kind of engineering skills. Managing the various services that will be required to accomplish this successfully requires Devops skill—cluster management, load balancing, redundancy. At a certain point, you’re asking your engineers to build an entire product in addition to your actual product.
The thing you get for doing all of this work is this: ultimate control to do exactly what you want. And yes, that’s the kind of thing that will get any nerd excited, but it’s an expensive and time consuming undertaking. What your CEO, analysts, and decision makers will care about isn’t how groundbreaking your data pipeline is, but what they can do with the data once they get it.
If you choose to build, new technologies exist to provide you with the building blocks—which is great!—but it’s non-trivial to implement successfully. For example, Luigi, Chronos, Azkaban, Oozie are all job schedulers. And that’s just job schedulers! The jobs themselves can be mapreduce, pig, hive, R, Python, Spark, or any number of a huge variety of data science tools.
It certainly can be done, though. There are amazing companies who have written about their successes building data pipelines. Take a look and see what you think.
How much does it cost to build a data pipeline?
This complexity translates into a high price tag. In the table below, we’ve presented the range that organizations are paying today to build their own pipelines. This data has been gathered from dozens of interviews with online businesses. On the left, we have a basic data pipeline for small data sizes. This pipeline will be less reliable, deal with fewer data sources, and smaller total data volume. On the right, we have a more advanced pipeline.
How much does it cost to build your own data pipeline?
When should you build your own data pipeline?
The biggest and most sophisticated software companies on the planet—famously Facebook and Twitter, but others such as Etsy and Spotify as well—have built their own data pipelines, but they’ve done so by dedicating dozens to hundreds of the best engineers in the world to the problem.
These companies built their own pipelines because when they were initially investing in this technology it was a competitive differentiator for them. There wasn’t a software vendor offering an alternative. No one else had the engineering prowess to do what they did in 2009, 2010, and 2011, and this data sophistication is how they monetized their huge user bases.
But today, there’s not a compelling reason to build your own pipeline. We believe that the vast majority of online businesses today are actually trying to solve the same data challenges, over and over again, and spending far too much valuable time and energy doing it. We believe that these companies would be far better served by buying a data pipeline rather than building their own from scratch.
Ten years ago, it wasn’t uncommon for an online business to build their own CRM. This wasn’t an unreasonable choice back then—enterprise solutions were a poor fit for many growing companies and Salesforce.com and others hadn’t yet provided an effective alternative. But today, businesses don’t just decide to build their own CRM. It’s too much work, it’s too expensive, and the alternatives are just too good. We believe this same transition is happening today for data pipelines.
Buying a data pipeline
When setting out to purchase a data pipeline, start by answering two important question about your data needs:
- What data sources are important for you to analyze today?
- What data sources will likely be added in the future?
Every aspect of your organization now has data associated with it, and this data lives in different systems: customer data, transactional data, product usage, web clickstream, advertising, email marketing, CRM, accounting, operational data, and more. The breadth of the data you’ll need to consolidate as well as the specific sources that are the highest priority will impact how you build your data pipeline.
One area where we see leaders get tripped up is assuming that they need to start with a big bang approach, waiting to roll out a data pipeline until it incorporates all important organizational data. We suggest, instead, an incremental approach: prioritize the data sources with the highest immediate return and get those up and running. Then, work to grow the universe of data you’re consolidating one source at a time.
Using this approach requires that you build a comprehensive list of sources that you will eventually want to consolidate prior to making any purchase decisions. Even if a given source isn’t a high priority today, you need to make sure that your platform of choice will support it when you get there.
Additional criteria to use when evaluating your data pipeline purchase
In addition to being able to consolidate all the data that is relevant to your organization, there are several additional criteria that you should use when evaluating tools:
Real-time / streaming updates. How long do you have to wait to see new data?
Change data capture. Does your vendor rely on CDC, or are they copying the whole data set every time? Copying the whole set every time is not a viable option for today’s huge datasets.
Scalability. Does the vendor have other clients who are processing larger volumes of data than your current and likely future needs?
Auditability. Do you have access to seeing/auditing all parts of your data pipeline? Just as in building, if there is not transparency built into the pipeline you buy, it’ll be tough for your organization to develop trust in the results.
Flexibility. When you purchase a data pipeline, do you get locked to a single suite of tools or can you access your data via standard protocols like SQL, ODBC and JDBC?
Security. Does the vendor have an adequate plan in place to protect the security of your data? Do they have a history of data security?
Level of implementation & maintenance effort required. How much regular engineering time will you be expected to dedicate to getting the pipeline up and running, and then maintaining it? If you’ll need to dedicate employees to maintenance, be sure to factor this into your budget.
Making the decision
Once you’ve done all of your research, it’s time to make a decision. This decision will impact your organization significantly. Here’s our summary of the considerations for your build-vs-buy decision:
Considerations - buying vs building your data pipeline
|Cost of Ownership
|Time to Value
|Risk of Failure
It’s not a secret by this point that we strongly discourage you from building your own data pipeline. The cost benefit equation of making this technical decision simply doesn’t add up today in the same way that it did even a few years ago. There are plenty of areas in your data strategy where you absolutely should roll your sleeves up and get technical; we strongly caution against doing that here. Building a data pipeline is hard work and takes you away from what should be your core focus: growing your business.