How to build a data-driven company: From infrastructure to insights

Data stacks aren’t structured the way they used to be, and for good reason. Teams that want to gain insight from the tons of data they collect every day need to be able to perform analysis quickly and know without a doubt that they are looking at the same data in the same way as the rest of the company.

Luckily, thanks to technological advances and the advent of powerful applications, modern day data infrastructures are built to do just that. This was the topic of yesterday’s webinar, led by Shaun McAvinney, Sales Engineer at Stitch, and Dillon Morrison, Manager, Product Marketing & Analytics at Looker. Shaun and Dillon discussed why and how companies like Buffer, SeatGeek, and Asana are investing in data, the core challenges of data integration, and examples of the powerful insights you can unlock with integrated data.

If you missed the webinar, you can download the slide deck, watch the video of the entire presentation, or read the recap below to catch the highlights.

How people used to approach data infrastructures

For the last 30 years, since the inception of data warehousing, this is how it’s been:

Historical ETL

Data is extracted from various sources, then loaded into the data warehouse. The business intelligence teams would request that the data warehouse team run queries on that data, which would eventually be presented to the end user, or the individual responsible for making business decisions based on that data.

If we focus in on the steps after the data is loaded, we get a process that looks something like this, where departments are working in silos and relying on summaries from data analysts.

Data analysis bottleneck

This process results in “data cubes”, where analytics are separated by key groupings for various departments, which presents a few problems:

  1. It’s very resource-intensive (and expensive) to manage all of the transformations and data loading.

  2. It results in latency in the analytics process. Those that are looking for actionable insights are actually only receiving high-level analysis which is typically too broad or inflexible to guide nimble decision-making.

  3. It restricts drilling. If an end-user finds an interesting piece of information, they need to go back to the ETL team to request more data, who will then take some time to return the request.

How companies are doing it today

The process has changed to adapt to an influx in the amount of data and the need for more actionable insights delivered instantly. It looks something like this:

ELT

By adding a modeling and analytics layer, modern data infrastructures empower the end user — instead of a team of engineers — to query the actual database using a familiar language (usually SQL), visualize the results, and tweak their queries based on those results. All analytics can be performed directly on the central database.

Ultimately, this leads to much more transparent analysis that’s cheaper, and faster to get insights from.

How top engineering organizations are building their analytics stack

As we highlighted in a previous post, The Data Infrastructure Meta-Analysis, building (and writing about building) your analytics stack has become something of a trend among top-engineering organizations. Across these posts, we found the trend discussed above — a shift away from ETL toward ELT. Here’s how this process breaks down.

Extract (data integration)

Engineering teams from Asana and MetaMarkets both provided details on the challenges of data integration. From our own experience in talking to engineering teams (and reading these “how we did it” blog posts) we’ve identified seven core challenges of data integration:

  • Connections: Every API is a unique and special snowflake, each new integration can take anywhere from a few days to a few months to complete.

  • Accuracy: It isn’t enough to guarantee accuracy and transparency in a scalable streaming data pipeline, because ordering data on a distributed system is extremely complex.

  • Latency: Large object data stores (S3, Redshift) are optimized for batches not streams

  • Scale: Data will grow exponentially as your company grows. Your pipeline needs to be architected from the ground up to scale in order to survive in this scenario.

  • Flexibility: Interacting with systems you don’t control like web APIs means you might get results you don’t expect. If you’re not prepared to accommodate data of different shapes and types, and quickly address changes as they happen, your pipeline will break.

  • Monitoring: With data coming from such a large number of sources, failures are inevitable. The question is, how long will it take you to catch them?

  • Maintenance: Internally-built tools never receive the same attention as your product. While supporting technologies are being iterated on literally every day, internal resources are rarely dedicated to keeping up with them and stakeholders pay the price.

Our solution around this? Stitch. It streams all your data from your many data sources right into Redshift and other data warehouses.

Load (warehousing)

In our analysis, Redshift was above-and-beyond the preferred data warehouse. It was chosen by Asana, Braintree, Looker, and Buffer. And for one very good reason — Redshift’s speed. People are seeing dramatic improvements in query time using Redshift. A test run by Airbnb found Redshift beat Hive hands down in each of their criteria.

Hive vs. Redshift

Research from DiamondStream also shows how much better their internal dashboards performed when built on Redshift vs. Microsoft SQL Server.

SQL Server vs. Redshift

Transform (business intelligence & analytics)

After your data is piped into Redshift, it’s ready to be transformed, queried, and visualized. This is where Looker comes in.

Data modeling layer

By creating a data modeling layer as an intermediary, metrics and data transformations are defined in one place, where all users can access and understand them.

Because users are working off the same definitions and looking at the same data from your sources, you can maintain “data governance,” or adhere to precise standards that maintain analytical integrity.

Increasing the speed to insight

Building your own analytics stacks takes months of intensive engineering efforts. With Stitch + Looker, you can achieve the same results in a matter of weeks:

Building an analytics stack

In the past, you’d need to perform complex transformations on your data to get it to a point where you could generate a report, but with a tool like Pipeline, you can dump your data into a centralized location, then use a tool like Looker to get reliable insights that your team can act on.

You avoid siloed data in several disparate applications with unequal access for users, and instead have data centralized in a modern database, with a full analytics suite on top, that can be accessed by any user.

Sign up for a free 14-day trial of Stitch.