Setting the Data Strategy for Your Growing Organization


How do we collect the data we will need to analyze?

Data about your business is ripe for the picking. Every action taken by your customers, your employees, your suppliers, and your partners is a potentially relevant datapoint. Every click, every phone call, every product purchase (or return), and every shipment notification is an opportunity to learn more about your business.

But to learn from that data, you first need to capture it. If an event happens but never gets written to a table, it’s like the tree falling in the forest with no one around to hear: can we really be sure it happened at all? As you’re making decisions with data, your visibility will be limited by the data you capture.

In this section we’ll dive into how to make sure you’re capturing all the data you will need to analyze.

The value of raw data, straight from the source

First, let’s get one thing straight: when you set about to instrument your business and gather data for analysis, you want raw data. Raw data is the unprocessed, recorded output of internal and external processes. It often looks like a stream of events. For example, the raw data about customer usage of a mobile app looks much like this:

timestamp user_id action
1/1/2020 13:56:12 913188 completed signup
1/1/2020 13:56:21 599766 login
1/1/2020 13:56:29 34603 login
1/1/2020 13:56:38 944377 article view
1/1/2020 13:56:47 734124 onboarding screen 1 view
1/1/2020 13:56:55 33712 share click
1/1/2020 13:57:04 858815 article view
... ... ...

This really isn’t so different from the raw data you get from Salesforce:

timestamp record_id task_type rep_id
1/1/2016 13:56:12 00QE000000neXUz email
1/1/2016 13:59:05 00QE000000needA discovery call
1/1/2016 14:01:58 00QE000000nelns email
1/1/2016 14:04:50 00QE000000nepxG demo
1/1/2016 14:07:43 00QE000000nf04N email
1/1/2016 14:10:36 00QE000000nf8gf email
1/1/2016 14:13:29 00QE000000nfGM0 discovery call
... ... ... ...

The raw data from Magento (a popular ecommerce platform) is much the same:

timestamp customer_id order_id product_id amount
1/1/2016 13:56:12 10903 155819 76 96.14
1/1/2016 13:59:05 4190 155820 4 137.64
1/1/2016 14:01:58 12859 155821 30 92.06
1/1/2016 14:04:50 30437 155822 63 71.02
1/1/2016 14:07:43 62404 155823 86 72.66
1/1/2016 14:10:36 77757 155824 97 119.55
1/1/2016 14:13:29 17251 155825 83 52.11
... ... ... ...

Raw data is the gold standard when data is being used for analysis. An event stream like this is “lossless” in much the same way that CD-quality audio is lossless—it doesn’t strip out anything along the way, leaving the data in pristine condition. With lossless data capture, analysts can come back at any point in the future and recreate exactly what happened, in what order.

When getting raw data, always go directly to the system of record. The more hands data passes through, the more chances there are for errors to occur. If you need to get product inventory data, get that directly from your inventory system. If you need to get web analytics data, go to your canonical web analytics system. Avoid analyzing data that you get from coworkers in spreadsheets and other ad-hoc sources if at all possible, as the data they contain almost always comes from somewhere else. Better to get your data straight from the source.

Getting data from your source data systems

Your organization’s raw data sits in the systems that you trust to run your business: your website, your app, your CRM, customer support, email marketing, and more. Each of these systems is the source of truth for the part of your business that it’s responsible for, and as a result, this is exactly where you want to turn to when looking for answers about organizational performance.

There are three primary types of technologies that will house your organization’s data:

  1. internal databases,
  2. software-as-a-service tools, and
  3. instrumentation you install within your website and software.

Internal databases

Pulling data from an internal database is a matter of knowing who has the credentials to get you access. Once you can connect to the database, just plug it into your data pipeline and your data will keep flowing.

The primary data collection consideration when working with internal databases is dead simple, but often overlooked: don’t delete data. Ever. Deleting data creates black holes that prevent analysts from truly understanding an area of the business. Updating the values in a given record is just as bad, updating a record is, by definition, deleting the prior version of that record. Updates happen all-too-frequently in production applications.

If you discover that your internal applications are deleting data that’s important for analysis, you have two options: either ask your software engineers to modify the application code to avoid deletions, or implement a data pipeline that includes Change Data Capture (CDC). CDC preserves the state of a database at every point in its history so that, even if data is deleted from the production schema, it is still available for analysis. This solution is often far less invasive than re-architecting an application to avoid deletions.

SaaS applications

Whereas you have complete control over and root access to your internal databases, your ability to access data within SaaS applications is limited to the APIs that they provide. Some SaaS applications are walled gardens, providing limited or no access to the data that they house. Increasingly, SaaS applications have APIs that allow you to programmatically access the data they contain.

We believe that API access is a critical feature for any SaaS application. Using a SaaS application that doesn’t provide API endpoints to extract data significantly limits the flexibility of the platform. A review of API endpoints should be part of the review of any major SaaS purchase made within your business.

But having an API is not binary. There are certain API features that make for more and less suitable analytical data ingestion:

If you’re evaluating a SaaS product that checks all of these boxes, then great—you’re working with a company that empowers its users to interact with their data however they want. If not, you’re giving up too much control of vital organizational data that you own. Choose a competitor who is more open.

Instrumentation you install

Your internal database and your SaaS platforms collect a lot of data: orders, customers, inventory and more. But they don’t collect all of your data. What about the log of every page a visitor views on your website? What about the progress of new signups through your app’s onboarding funnel? What about the application errors that your product users run into? Unless you specifically design a strategy for collecting all of this data, it simply won’t exist.

You’re probably familiar with instrumenting systems with analytics, although you may not have thought about it before. If you’ve ever installed a Google Analytics code snippet on a web page, you’ve done exactly this. Websites don’t have an underlying mechanism to track visitors, and as a result GA asks you to install a bit of code on your website to let their system know every time someone loads a page. This code fires every time the page is loaded, and the GA servers register the event. That stream of events is what is analyzed to give you the reports you know and love.

So, the question becomes: where are you missing data? Page load data on your website is an obvious one, but there are other types of data you may want to collect as well as other contexts you should consider. Here are some ideas:

The last 1%

While most of your data come from these sources, some data may reside in spreadsheets and other files. Some of the most common examples for this include mapping data, data on your organizational goals, and ad-hoc systems that support new and evolving business processes. These less formal data sources can provide a level of richness on top of what's stored in your more formal systems. Inventory the data that lives in sources like this, even if only a few people who know of its existence.

The end of digital parsimony: never delete a thing

Do you know anyone who lived through the Great Depression? Many people who survived the Great Depression are extremely conservative with their finances: living through the Depression left an indelible imprint on the way they think about the world.

For those of us who were alive in 1990 when a pricey hard drive was 10 megabytes, we have a similar proclivity towards digital parsimony. We have a tendency to not want to “waste space”—to not save digital relics that don’t seem particularly necessary. But the world has changed. Hard drives are massive. Bandwidth is plentiful. Cheap cloud services allow for virtually unlimited scalability.

Here's some quick math: Redshift, Amazon's cloud-based analytic database, allows users to store up to two petabytes of data per cluster. That's 2,000,000,000 megabytes. If you stored 20 MB of data every minute, it would take you 190 years to fill that up.

What does that mean? As you set out to collect data on your business, don’t spend much time worrying about whether you’ll need a particular datapoint in the future. Just keep everything. Build your systems to not delete data, only work with SaaS applications that will allow you to access the data they store via API, and instrument every important area of your business. It’s better to have the data and not need it than to need the data and not have it.

← Previous Chapter

How should we facilitate data exploration?

Next Up →

Putting this guide to work

Chapter 4
How do we collect the data we will need to analyze?