CHAPTER 4

How do we collect the data we will need to analyze?

Data about your business is ripe for the picking. Every action taken by your customers, your employees, your suppliers, and your partners is a potentially relevant datapoint. Every click, every phone call, every product purchase (or return), and every shipment notification is an opportunity to learn more about your business.

But to learn from that data, you first need to capture it. If an event happens but never gets written to a table, it's like the tree falling in the forest with no one around to hear: can we really be sure it happened at all? As you're making decisions with data, your visibility will be limited by the data you capture.

In this section we'll dive into how to make sure you're capturing all the data you will need to analyze.

The value of raw data, straight from the source

First, let's get one thing straight: when you set about to instrument your business and gather data for analysis, you want raw data. Raw data is the unprocessed, recorded output of internal and external processes. It often looks like a stream of events. For example, the raw data about customer usage of a mobile app looks much like this:

timestamp	user_id	action
1/1/2020 13:56:12	913188	completed signup
1/1/2020 13:56:21	599766	login
1/1/2020 13:56:29	34603	login
1/1/2020 13:56:38	944377	article view
1/1/2020 13:56:47	734124	onboarding screen 1 view
1/1/2020 13:56:55	33712	share click
1/1/2020 13:57:04	858815	article view
...	...	...

This really isn't so different from the raw data you get from Salesforce:

timestamp	record_id	task_type	rep_id
1/1/2016 13:56:12	00QE000000neXUz	email	salesrep5@software.com
1/1/2016 13:59:05	00QE000000needA	discovery call	salesrep2@software.com
1/1/2016 14:01:58	00QE000000nelns	email	salesrep4@software.com
1/1/2016 14:04:50	00QE000000nepxG	demo	salesrep5@software.com
1/1/2016 14:07:43	00QE000000nf04N	email	salesrep1@software.com
1/1/2016 14:10:36	00QE000000nf8gf	email	salesrep1@software.com
1/1/2016 14:13:29	00QE000000nfGM0	discovery call	salesrep2@software.com
...	...	...	...

The raw data from Magento (a popular ecommerce platform) is much the same:

timestamp	customer_id	order_id	product_id	amount
1/1/2016 13:56:12	10903	155819	76	96.14
1/1/2016 13:59:05	4190	155820	4	137.64
1/1/2016 14:01:58	12859	155821	30	92.06
1/1/2016 14:04:50	30437	155822	63	71.02
1/1/2016 14:07:43	62404	155823	86	72.66
1/1/2016 14:10:36	77757	155824	97	119.55
1/1/2016 14:13:29	17251	155825	83	52.11
...	...	...	...

Raw data is the gold standard when data is being used for analysis. An event stream like this is “lossless” in much the same way that CD-quality audio is lossless—it doesn't strip out anything along the way, leaving the data in pristine condition. With lossless data capture, analysts can come back at any point in the future and recreate exactly what happened, in what order.

When getting raw data, always go directly to the system of record. The more hands data passes through, the more chances there are for errors to occur. If you need to get product inventory data, get that directly from your inventory system. If you need to get web analytics data, go to your canonical web analytics system. Avoid analyzing data that you get from coworkers in spreadsheets and other ad-hoc sources if at all possible, as the data they contain almost always comes from somewhere else. Better to get your data straight from the source.

Getting data from your source data systems

Your organization's raw data sits in the systems that you trust to run your business: your website, your app, your CRM, customer support, email marketing, and more. Each of these systems is the source of truth for the part of your business that it's responsible for, and as a result, this is exactly where you want to turn to when looking for answers about organizational performance.

There are three primary types of technologies that will house your organization's data:

internal databases,
software-as-a-service tools, and
instrumentation you install within your website and software.

Internal databases

Pulling data from an internal database is a matter of knowing who has the credentials to get you access. Once you can connect to the database, just plug it into your data pipeline and your data will keep flowing.

The primary data collection consideration when working with internal databases is dead simple, but often overlooked: don't delete data. Ever. Deleting data creates black holes that prevent analysts from truly understanding an area of the business. Updating the values in a given record is just as bad, updating a record is, by definition, deleting the prior version of that record. Updates happen all-too-frequently in production applications.

If you discover that your internal applications are deleting data that's important for analysis, you have two options: either ask your software engineers to modify the application code to avoid deletions, or implement a data pipeline that includes Change Data Capture (CDC). CDC preserves the state of a database at every point in its history so that, even if data is deleted from the production schema, it is still available for analysis. This solution is often far less invasive than re-architecting an application to avoid deletions.

SaaS applications

Whereas you have complete control over and root access to your internal databases, your ability to access data within SaaS applications is limited to the APIs that they provide. Some SaaS applications are walled gardens, providing limited or no access to the data that they house. Increasingly, SaaS applications have APIs that allow you to programmatically access the data they contain.

We believe that API access is a critical feature for any SaaS application. Using a SaaS application that doesn't provide API endpoints to extract data significantly limits the flexibility of the platform. A review of API endpoints should be part of the review of any major SaaS purchase made within your business.

But having an API is not binary. There are certain API features that make for more and less suitable analytical data ingestion:

What data does the API expose?
Most SaaS platforms capture a wide range of data. Make sure before making a purchase decision that all of the data that you care about can be extracted from the available endpoints.

Does the API support webhooks?
Webhooks are a near-real-time method of informing other systems of events that occur within a SaaS platform. If a SaaS product you're evaluating supports webhooks for the events you care about, that's a best-case scenario.

Does the API allow you to query records by the time they were last modified?
One of the most important features of an API for analytical purposes is the ability for records to be pulled based on when they were most recently modified. This allows developers to pull changes in a dataset rather than attempting to sync the entire thing with every update.

Does the API allow you to pull historical data?
If you have months or years of usage data within a SaaS tool, it's important to know whether or not you can retroactively go back and pull the data for your entire usage history.

What is the API rate limit and how much do calls cost?
Many SaaS APIs are rate limited by their providers; this makes sense, so as to protect the vendor from an unanticipated burden on its infrastructure. But if you need to pull extremely large volumes of data for analytics, you need to ensure that this is possible. You'll also need to anticipate how much these calls will cost you if that is a pricing dimension for that particular vendor.

If you're evaluating a SaaS product that checks all of these boxes, then great—you're working with a company that empowers its users to interact with their data however they want. If not, you're giving up too much control of vital organizational data that you own. Choose a competitor who is more open.

Instrumentation you install

Your internal database and your SaaS platforms collect a lot of data: orders, customers, inventory and more. But they don't collect all of your data. What about the log of every page a visitor views on your website? What about the progress of new signups through your app's onboarding funnel? What about the application errors that your product users run into? Unless you specifically design a strategy for collecting all of this data, it simply won't exist.

You're probably familiar with instrumenting systems with analytics, although you may not have thought about it before. If you've ever installed a Google Analytics code snippet on a web page, you've done exactly this. Websites don't have an underlying mechanism to track visitors, and as a result GA asks you to install a bit of code on your website to let their system know every time someone loads a page. This code fires every time the page is loaded, and the GA servers register the event. That stream of events is what is analyzed to give you the reports you know and love.

So, the question becomes: where are you missing data? Page load data on your website is an obvious one, but there are other types of data you may want to collect as well as other contexts you should consider. Here are some ideas:

Website events
Page views aren't the only thing that happens on a website. Things like video plays, scrolling behavior, element clicks, and logins offer a much more complete picture of user interactions.

Mobile app events
Event tracking really took off with the rise of mobile. Every mobile app is a different animal, but make sure that you're tracking the events that matter to measure behavior within yours.

Sensors
If you make a physical product with integrated sensors, sending a stream of data from these sensors can be exactly what you need to keep on top of the performance and usage of your devices.

So much more!
In a world where your iPhone tracks your daily step count, it's possible to instrument almost anything. Don't get lost in the possibilities though—focus on collecting the data you need to make critical business decisions.

The last 1%

While most of your data come from these sources, some data may reside in spreadsheets and other files. Some of the most common examples for this include mapping data, data on your organizational goals, and ad-hoc systems that support new and evolving business processes. These less formal data sources can provide a level of richness on top of what's stored in your more formal systems. Inventory the data that lives in sources like this, even if only a few people who know of its existence.

The end of digital parsimony: never delete a thing

Do you know anyone who lived through the Great Depression? Many people who survived the Great Depression are extremely conservative with their finances: living through the Depression left an indelible imprint on the way they think about the world.

For those of us who were alive in 1990 when a pricey hard drive was 10 megabytes, we have a similar proclivity towards digital parsimony. We have a tendency to not want to “waste space”—to not save digital relics that don't seem particularly necessary. But the world has changed. Hard drives are massive. Bandwidth is plentiful. Cheap cloud services allow for virtually unlimited scalability.

Here's some quick math: Redshift, Amazon's cloud-based analytic database, allows users to store up to two petabytes of data per cluster. That's 2,000,000,000 megabytes. If you stored 20 MB of data every minute, it would take you 190 years to fill that up.

What does that mean? As you set out to collect data on your business, don't spend much time worrying about whether you'll need a particular datapoint in the future. Just keep everything. Build your systems to not delete data, only work with SaaS applications that will allow you to access the data they store via API, and instrument every important area of your business. It's better to have the data and not need it than to need the data and not have it.

← Previous Chapter

How should we facilitate data exploration?

Next Up →

Putting this guide to work

Home

Introduction

Chapter 1