The data infrastructure meta-analysis: How top engineering organizations built their big data stacks

In the process of validating the market for the RJMetrics Pipeline launch – the platform that became Stitch – we kept running across data points that we had never anticipated getting. It turns out that the “How we built our data infrastructure at [company name]” is approaching meme-like status on engineering blogs across the internet. We’ve found many such blog posts in the past several months, and have enjoyed reading each of them:

  1. Zulily

  2. Asana (1, 2)

  3. Spotify

  4. Looker and Viglink

  5. Pinterest

  6. Metamarkets

  7. Braintree

  8. Seatgeek

  9. Buffer

  10. Netflix

We learned a lot from these posts. They’re all “success-biased” — no one that we’ve found has chosen to write about their data infrastructure screw ups — but they contain many useful insights from some of the smartest people building this type of technology today. In this post, I have attempted to aggregate what I found to be the most interesting trends when looking at the group as a whole.

Insights come from integrated data

Asana’s Justin Krause said it best:

First party owned data is simply the only way to achieve true business intelligence — if we can’t join data from different sources, we can’t answer questions like:

Is our marketing campaign delivering quality users?
Requires joining ad attribution data onto engagement data

Are our customer success programs successfully driving revenue expansion?
Requires joining lists — probably in our CRM — with engagement and billing data

Did our most important customer just hit a bad bug, and we need to reach out?
Requires joining error/bug logs onto customer data

Most of these companies focused on two primary datasets: transactional data from production databases and user engagement data from event collectors. But some specifically highlighted pulling data from other sources as well. Many of these were marketing-focused: advertising, email, A/B testing, etc. Here’s the breakdown:

Source-vendor table

S3 and Redshift are dominant; Kafka usage is growing

There was heavy usage of the AWS throughout the other companies, most heavily focused on S3. 7 out of the 11 companies in this set used S3 as a part of their data infrastructure. Netflix provides an excellent rundown on its reasoning to use S3 as its primary data warehouse:

Firstly, S3 is designed for 99.999999999% durability and 99.99% availability of objects over a given year, and can sustain concurrent loss of data in two facilities. Secondly, S3 provides bucket versioning, which we use to protect against inadvertent data loss (e.g. if a developer errantly deletes some data, we can easily recover it). Thirdly, S3 is elastic, and provides practically “unlimited” size. We grew our data warehouse organically from a few hundred terabytes to petabytes without having to provision any storage resources in advance.

Zulily is an outlier, using Google’s cloud platform throughout its big data stack. In the companies we surveyed, Zulily was the only one to make heavy usage of Google’s cloud offerings. We didn’t find any companies using Microsoft’s cloud platform, but that could change once its Azure SQL Data Warehouse reaches general availability later this year.

Seven out of 11 companies we profiled mentioned using an analytic data warehouse. For those who did, Redshift was by far the dominant choice. The single company not using Redshift was Zulily, who uses Google’s BigQuery. We didn’t actually come across any discussions of why a particular columnar database was chosen. We can only presume that companies either found these technologies highly substitutable and went with the default option (typically AWS) or there is wide acceptance that Redshift is the superior platform.

Kafka is the other technology whose usage we found noteworthy. A few years ago, Kafka was not widely deployed in building data pipelines. The Netflix and Spotify writeups are a good example of this: their pipelines are heavily based on batch jobs. But as companies have more diverse data needs and begin to focus on real-time data, Kafka has become a foundational technology. Braintree, Metamarkets, and Pinterest all use Kafka as a core part of their data infrastructures.

Tech-Vendor table

Companies are aware of three separate visualization needs

Sophisticated data consumers have begun to vocalize three separate needs within data visualization:

Dashboarding

The companies in these posts have very specific requirements for dashboards and frequently have decided to build their own tools. Asana, especially, was very clear about these requirements: smoothing, annotation, parameterization, and more.

Interactive analytics

Visualization tools that support quick, iterative, collaborative data discovery. Looker was the primary tool cited by these companies, showing up in 4 out of 11 posts. In every case, usage of Looker was paired with usage of Redshift.

Stream analytics

Used to monitor real time data streams for anomalies and trends so that immediate action can be taken. This was the least-cited need; Asana specifically calls out Interana and others have built custom solutions.

The separation of visualization into three discrete categories hasn’t always been the case. It’s only in the recent past that visualization products have been built specifically to serve one of these use cases; in the past, visualization products were more general-purpose. We expect this trend to continue.

Analytics-vendor table

Data infrastructures have two primary use cases

Each of the companies we’ve looked at is building an infrastructure primarily to support either business analytics or delivery of data-enabled product features.

Delivery of data-enabled product features

Frequently, data infrastructure is used to power product features. Braintree uses its data infrastructure to deliver real-time fraud detection. Pinterest uses its data infrastructure to deliver analytics to its advertisers. Netflix and Spotify famously use their data infrastructures to power content recommendation algorithms.

Business analytics

Companies use data infrastructure to power business analytics. Seatgeek, Looker, and Asana use data for funnel analysis, A/B testing, marketing optimization and more.

These two uses for data heavily drove technology choices made throughout the pipeline. In general, companies like Spotify, Netflix, Metamarkets, and Pinterest that were heavily focused on using data to deliver product features had very specific and technical requirements for their data infrastructures. This pulled them towards technologies like Spark, Pig, and Hive. Companies that used their infrastructures to support business analytics primarily use SQL-based tools paired with columnar data stores.

Data infrastructure has geek cred

Beyond the specific implementation details, it’s clear that companies have a pain point around data infrastructure. If they didn’t, they wouldn’t have collectively spent many thousands of hours working on and blogging about solutions to that problem. That, in and of itself, was interesting and valuable information for us.

But pursue this train of thought further and you realize something else: big data tech is hot with software engineers. Amazing people at top companies have spent many hours chronicling their efforts; either they are very interested in the topic or they know who their potential recruits are. Quoting Michael Erasmus from Buffer:

“Building our new data architecture has been an amazing and fun adventure.”

When we set out to build Stitch, we knew we were solving an important problem. We never anticipated it would be a problem that engineers would find interesting — plumbing historically hasn’t exactly been the most glamorous of professions. But it’s exciting (and validating!) for us to realize that’s changing.

If you’re thinking about building out a data infrastructure for your company, start by giving Stitch a try. Building a data infrastructure is hard work, and we don’t think engineers should have to reinvent the wheel at every company. Pipeline is free for up to 5 million events a month and can support arbitrarily large data volumes.