CHAPTER 3

How are you going to explore your data?

Data exploration is the iterative process of uncovering hidden insights in data. Good data exploration unfolds almost as though it were an interview between the analyst and her data. The analyst sets out to answer a broad question of interest, asking successively more specific questions as she goes. Here’s an example “conversation” where an analyst uses data to try and answer this question: “Why is our churn from last month so much higher than anticipated and what should we do about it?”

It turns out that the CEO of this company had asked that push notifications be turned off because one of the company’s investors found them annoying. He asked the product team to remove the feature without studying the impact that decision would have on churn rate and ended up causing the company to miss its quarterly churn goal as a result!

This is a very typical data exploration process. The analyst is asked a broad question about the business and needs to both come up with theories, and test those theories against the data. This process requires a significant amount of creativity, context, and rigor on the part of the analyst.

This is different than KPI reporting. KPI reporting is a broad look into the health of a business’s KPIs and how they are trending over time. There's more to know about KPI reporting, but for now, let’s dig deeper on the tools used for data exploration.

Choosing the right tools

If you’ve made it this far, congratulations: you have the keys to the city. Your data has been aggregated into a single central repository, updates automatically as your business evolves, and is housed in a platform that can be queried using SQL, the dominant language of data.

Amazingly, however, all of your labor thus far has just been table stakes. You haven’t actually derived any real value from your data yet— you’ve just been doing the plumbing! Now it’s time to reap the fruits of your labor. In this section, we explore the many types of tools that you can attach to your data warehouse in order to run a more data-driven business. If used correctly, these tools can influence product development, customer acquisition, cost management, hiring, fundraising, and much, much more. Let’s dig in.

Statistical programming

Statistical programming is where analytics all began. SQL queries are great, but when you’re getting your hands dirty with a serious analysis, there are no shortage of scenarios where they fall short:

Running multiple queries in sequence, behaving differently based on the output of each
Using the output of one query as the input to another
Running complex statistical analyses like a time series analysis, regression, or correlation that aren't natively supported by SQL
Incorporating open source libraries to avoid building out special functionality that serves your use case

Statistical scripting languages like R, SAS, Matlab, etc, provide all these advantages and more. No self-respecting data scientist made it through school without using them extensively, and for many they are still the fastest, most flexible, and most powerful way to answer data-driven questions.

That said, these tools are for serious analysts only. You need to learn syntax, understand how to correctly interpret the statistical models at work, and work within the best practices of programming to even attempt collaboration or the interpretation of your work by others. And don’t count on the output winning any design awards: these tools can provide functional visualizations, but the appeal of the output stops at functional.

These tools are powerful weapons for powerful people, but they don’t solve everyone’s problems when it comes to data. That’s where the rest of the field comes in.

Querying and visualization tools

These tools take data analysis a step farther by providing a more user-friendly interface for exploring, querying, and visualizing data:

Tools like Tableau and Microsoft Power BI specialize in stunning, powerful visualizations
Tools like Mode, Periscope Data, and Chartio help analysts collaborate around queries or statistical programming scripts
Tools like Looker help analysts model the relationships in their data and collaborate around queries and visualizations used in business analysis

Some of these tools also include basic dashboarding functionality so that you can present multiple analyses within a single view.

Tools like these are the most common compliment to a robust data pipeline like the one described in earlier sections of this document. However, unlike full-stack business intelligence tools (described next), these querying and visualization tools are only useful if you’ve already solved the problem of data consolidation and optimization. As a result, they tend to be better suited for analysts at organizations with an mature analytics function.

Full-stack business intelligence (BI) platforms

In today’s leading companies, data isn’t just for analysts anymore. A truly data-driven organization will find ways to make data accessible and explorable for every member of their team. This extends beyond reporting— BI should empower business users to ask and answer their own questions, even if they don’t have experience with data modeling or writing SQL. The analytical and organizational benefits of such a system include:

Creating a single source of the truth around business data: The most successful BI deployments end up adding nearly every employee at a company as a user. If multiple departments are provided with a centralized set of business rules that govern how data is used (e.g. how we define revenue or calculate churn), it gives data-driven arguments credibility and provides transparency into the KPIs that exist across the organization.

Data model abstraction: Chances are, your data model has some funky rules. How do you run a query to calculate sales revenue? If you’re like most companies, you do something like this: "run a query that calculates the aggregate sum of order_total minus order_discounts in the transactions table where "status = 2,” a join on the “refunds” table yields no records, and the customer’s email (joined in from yet another table) doesn’t match a regular expression /.*testaccount.*./ Yikes.

Your business users don’t need to know all this— and they especially shouldn't have to worry about updating all their analyses when those rules inevitably change. A BI system that delivers value to business users will abstract away such details, providing a clean starting point for analyses on key business metrics.

An intuitive user experience: Building a report, creating a visualization, organizing a dashboard, inviting colleagues to collaborate, and commenting on data are all key to the business user’s data workflow. If any of these tasks lack intuitiveness, a BI tool's chances of success will plummet. In the world where users are accustomed to the simplicity of Dropbox, the convenience of Uber, and the addictiveness of Slack, nothing less will keep users coming back. A good BI system will integrate itself into a user’s daily workflow by providing important data and making it a pleasure to consume.

Machine learning

Most of the tools we’ve explored so far fit into a discrete workflow: user has a question, tool helps user answer that question, new questions emerge, repeat. The tool in such a framework delivers value by minimizing the delay between asking and answering. But what if there was a tool that told you what question you should be asking in the first place? Enter machine learning.

Machine learning tools base their offerings on statistical methods like cluster analysis, correlation analysis, and time series analysis to identify noteworthy anomalies in your data, classify entities into related groups, recommend actions to take, and more.

For the most part, these methods aren’t new; they’ve been available for decades. But modern SaaS machine learning tools like IBM Watson, Mintigo, and Spinnakr are productizing these techniques to make them available to companies without teams of data science PhDs.

Here’s what makes these tools stand out:

Automatic Detection: in the modern era of high-volume, low-cost compute cycles, modern machine learning tools analyze enormous volumes of data at near real-time speeds. Such tools can study thousands of metrics within your data and compare them to each other looking for correlations. They can analyze every data point in each of those metrics looking for unexpected events. And, best of all, they can do this over and over again, automatically, every time you get new data. This technology removes the highly fallible human “discovery” process from data exploration. New user clusters, correlations between key metrics, and suspicious purchasing behaviors can all be surfaced automatically.

Automatic Reaction: Once you’ve surfaced these interesting pockets of activity in your data, what do you do next? Some of these tools take it a step further and allow you to react to such events. Spinnakr, for example, allows you customize the experience of your users based on what’s most effective with other users in their “cluster.” This functionality can have a big impact on strategies like personalization, testing, and security.

Productization: Developing effective machine learning algorithms is a non-trivial task. The process typically involves a high degree of data engineering, statistical knowledge, and lots and lots of training data. Rather than asking individual companies to build machine learning solutions internally, SaaS products have developed and trained these models and then sold the insights as a product. This productization is what has happened over the past few years and it’s only now just getting started.

Choosing a tool

People have a profusion of preferences when it comes to analytical tools. Some people prefer command line interfaces like R and Python. Some people prefer to lay data out visually in Excel. Some people like manipulating data in a GUI customized for the type of analysis they do most.

Sometimes these choices are optimal, but sometimes they’re just based on what a user happens to know. It’s important to respect these choices in either case, because the most important thing is that people use data to make decisions at all. Limiting their flexibility in terms of how exactly they go about doing that will only serve to prevent adoption—the worst possible outcome.

The good news is that these tools aren’t mutually exclusive — each one is a different way of exploring and leveraging your company’s data to learn new things and make decisions. What tool is best for you is simply a function of the problems you want to solve, the people you have available, and the extent to which you want to get your hands dirty.

Today, visualization tools and full-stack BI solutions lead the way from an adoption perspective, but that is largely driven by the existing knowledge base and the products that exist in the market. In a world where the data scientist is the highest-growth job description on the planet, it wouldn’t be surprising to see existing and new tools built for this group to gain market share.

← Previous Chapter

What technology should we use to store and analyze our data?

Next Up →

How do we collect the data we will need to analyze?

Home