Setting the Data Strategy for Your Growing Organization

CHAPTER 4

How do we collect the data we will need to analyze?

Data about your business is ripe for the picking. Every action taken by your customers, your employees, your suppliers, and your partners is a potentially relevant datapoint. Every click, every phone call, every product purchase (or return), and every shipment notification is an opportunity to learn more about your business.

But to learn from that data, you first need to capture it. If an event happens but never gets written to a table, it’s like the tree falling in the forest with no one around to hear: can we really be sure it happened at all? As you’re making decisions with data, your visibility will be limited by the data you capture.

In this section we’ll dive into how to make sure you’re capturing all the data you will need to analyze.

What is raw data and why do you want it?

But first, let’s get one thing straight: when you set about to instrument your business and gather data for analysis, you want raw data. Raw data is the unprocessed, recorded output of internal and external processes. It often looks like a stream of events. For example, the raw data about customer usage of a mobile app looks much like this:

Want to keep reading?

Ready to learn more about setting the data strategy for your growing organization? The full guide contains six chapters and over 10,000 words on everything you need to know about building a data-driven company.

timestamp user_id action
1/1/2016 13:56:12 913188 completed signup
1/1/2016 13:56:21 599766 login
1/1/2016 13:56:29 34603 login
1/1/2016 13:56:38 944377 article view
1/1/2016 13:56:47 734124 onboarding screen 1 view
1/1/2016 13:56:55 33712 share click
1/1/2016 13:57:04 858815 article view

This really isn’t so different from the raw data you get from Salesforce:

timestamp record_id task_type rep_id
1/1/2016 13:56:12 00QE000000neXUz email salesrep5@software.com
1/1/2016 13:59:05 00QE000000needA discovery call salesrep2@software.com
1/1/2016 14:01:58 00QE000000nelns email salesrep4@software.com
1/1/2016 14:04:50 00QE000000nepxG demo salesrep5@software.com
1/1/2016 14:07:43 00QE000000nf04N email salesrep1@software.com
1/1/2016 14:10:36 00QE000000nf8gf email salesrep1@software.com
1/1/2016 14:13:29 00QE000000nfGM0 discovery call salesrep2@software.com

The raw data from Magento (a popular ecommerce platform) is much the same:

timestamp customer_id order_id product_id amount
1/1/2016 13:56:12 10903 155819 76 96.14
1/1/2016 13:59:05 4190 155820 4 137.64
1/1/2016 14:01:58 12859 155821 30 92.06
1/1/2016 14:04:50 30437 155822 63 71.02
1/1/2016 14:07:43 62404 155823 86 72.66
1/1/2016 14:10:36 77757 155824 97 119.55
1/1/2016 14:13:29 17251 155825 83 52.11

Raw data is the gold standard when data is being used for analysis. An event stream like this is “lossless” in much the same way that CD-quality audio is lossless—it doesn’t strip out anything along the way, leaving the data in pristine condition. With lossless data capture, analysts can come back at any point in the future and recreate exactly what happened, in what order.

When getting raw data, always go directly to the system of record. The more hands data passes through, the more chances there are for errors to occur. If you need to get product inventory data, get that directly from your inventory system. If you need to get web analytics data, go to your canonical web analytics system. Avoid analyzing data that you get from coworkers in spreadsheets and other ad-hoc sources if at all possible, as the data they contain almost always comes from somewhere else. Better to get your data straight from the source.

Getting data from your source data systems

Your organization’s raw data sits in the systems that you trust to run your business: your website, your app, your CRM, customer support, email marketing, and more. Each of these systems is the source of truth for the part of your business that it’s responsible for, and as a result, this is exactly where you want to turn to when looking for answers about organizational performance.

There are three primary types of technologies that will house your organization’s data: internal databases, third-party software-as-a-service tools, and instrumentation you install within your website and software.

Internal databases

Pulling data from an internal database is a matter of knowing who has the credentials to get you access. Once you can connect to the database, just plug it into your data pipeline and your data will keep flowing.

The primary data collection consideration when working with internal databases is dead simple, but often overlooked: don’t delete data. Ever. Deleting data creates black holes that prevent analysts from truly understanding an area of the business. Updating the values in a given record is just as bad, updating a record is, by definition, deleting the prior version of that record. Updates happen all-too-frequently in production applications.

If you discover that your internal applications are deleting data that’s important for analysis, you have two options: either ask your software engineers to modify the application code to avoid deletions, or implement a data pipeline that includes Change Data Capture (CDC). CDC preserves the state of a database at every point in its history so that, even if data is deleted from the production schema, it is still available for analysis. This solution is often far less invasive than re-architecting an application to avoid deletions.

SaaS applications

Whereas you have complete control over and root access to your internal databases, your ability to access data within SaaS applications is limited to the APIs that they provide. Some SaaS applications are walled gardens, providing limited or no access to the data that they house. Increasingly, SaaS applications have APIs that allow you to programmatically access the data they contain.

We believe that API access is a critical feature for any SaaS application. Using a SaaS application that doesn’t provide API endpoints to extract data significantly limits the flexibility of the platform. A review of API endpoints should be part of the review of any major SaaS purchase made within your business.

But having an API is not binary. There are certain API features that make for more and less suitable analytical data ingestion:

  • What data does the API expose?
    Most SaaS platforms capture a wide range of data. Make sure before making a purchase decision that all of the data that you care about can be extracted from the available endpoints.

  • Does the API support webhooks?
    Webhooks are a near-real-time method of informing other systems of events that occur within a SaaS platform. If a SaaS product you’re evaluating supports webhooks for the events you care about, that’s a best-case scenario.

  • Does the API allow you to query records by the time they were last modified?
    One of the most important features of an API for analytical purposes is the ability for records to be pulled based on when they were most recently modified. This allows developers to pull changes in a dataset rather than attempting to sync the entire thing with every update.

  • Does the API allow you to pull historical data?
    If you have months or years of usage data within a SaaS tool, it’s important to know whether or not you can retroactively go back and pull the data for your entire usage history.

  • What is the API rate limit and how much do calls cost?
    Many SaaS APIs are rate limited by their providers; this makes sense, so as to protect the vendor from an unanticipated burden on its infrastructure. But if you need to pull extremely large volumes of data for analytics, you need to ensure that this is possible. You’ll also need to anticipate how much these calls will cost you if that is a pricing dimension for that particular vendor.

If you’re evaluating a SaaS product that checks all of these boxes, then great—you’re working with a company that empowers its users to interact with their data however they want. If not, you’re giving up too much control of vital organizational data that you own. Choose a competitor who is more open.

Instrumentation you install

Your internal database and your SaaS platforms collect a lot of data: orders, customers, inventory and more. But they don’t collect all of your data. What about the log of every page a visitor views on your website? What about the progress of new signups through your app’s onboarding funnel? What about the application errors that your product users run into? Unless you specifically design a strategy for collecting all of this data, it simply won’t exist.

You’re probably familiar with instrumenting systems with analytics, although you may not have thought about it before. If you’ve ever installed a Google Analytics code snippet on a web page, you’ve done exactly this. Websites don’t have an underlying mechanism to track visitors, and as a result GA asks you to install a bit of code on your website to let their system know every time someone loads a page. This code fires every time the page is loaded, and the GA servers register the event. That stream of events is what is analyzed to give you the reports you know and love.

So, the question becomes: where are you missing data? Page load data on your website is an obvious one, but there are other types of data you may want to collect as well as other contexts you should consider. Here are some ideas:

  • Website events
    Page views aren’t the only thing that happens on a website. Things like video plays, scrolling behavior, element clicks, and logins offer a much more complete picture of user interactions.

  • Mobile app events
    Event tracking really took off with the rise of mobile. Every mobile app is a different animal, but make sure that you’re tracking the events that matter to measure behavior within yours.

  • Sensors
    If you make a physical product with integrated sensors, sending a stream of data from these sensors can be exactly what you need to keep on top of the performance and usage of your devices.

  • So much more!
    In a world where your iPhone tracks your daily step count, it’s possible to instrument almost anything. Don’t get lost in the possibilities though—focus on collecting the data you need to make critical business decisions.

The last 1%

While most of your data come from one of the above sources, it’s still common for data to reside in spreadsheets and other files. Some of the most common use cases for this include mapping data, data on your organizational goals, and ad-hoc systems that support new and evolving business processes. It’s important to do an inventory of data that lives in sources like this, because often there are only a few people who know of its existence.

These less formal data sources can provide a much-needed level of richness on top of what’s stored in your more formal systems, so don’t leave this out in your data collection process.

Final thoughts on data collection

Do you know anyone who lived through the Great Depression? Many people who survived the Great Depression are extremely conservative with their finances: living through the Depression left an indelible imprint on the way they think about the world.

For those of us who were alive in 1990 when a hard drive was 10 megabytes, we have a similar proclivity towards digital parsimony. We have a tendency to not want to “waste space”—to not save digital relics that don’t seem particularly necessary. But the world has changed. Hard drives are massive. Bandwidth is plentiful. Cheap cloud services allow for virtually unlimited scalability.

Here’s some quick math: Redshift, Amazon’s cloud-based analytic database, allows users to store up to a petabyte of data as of this writing. That’s 1,000,000,000 megabytes. If you stored 10 MB of data every minute, it would take you 190 years to fill that up.

What does that mean? As you set out to collect data on your business, don’t spend much time worrying about whether you’ll need a particular datapoint in the future. Just keep everything. Build your systems to not delete data, only work with SaaS applications that will allow you to access the data they store via API, and instrument every important area of your business. It’s better to have the data and not need it than to need the data and not have it.

← Previous Chapter

How should we facilitate data exploration?

Next Up →

Putting this guide to work

Chapter 4
How do we collect the data we will need to analyze?