How do you begin combining data from the cloud applications your organization uses with data in your internal databases to gain insight into how to improve your business? Maybe your organization has already standardized on Tableau as your BI tool, but you're still learning about using it with data from multiple sources.
We've created a video that walks you through the process of using Stitch to get your data from a SaaS application into a data warehouse, where you can analyze it with Tableau.The rest of this post takes a look at the process in detail.
To analyze data from multiple sources, you need a data warehouse that consolidates all of your data in a single location. Most businesses take advantage of cloud data warehouses such as Amazon Redshift, Google BigQuery, or Snowflake.
You can extract data that you have stored in SaaS applications and databases and load it into the data warehouse using an ETL (extract, transform, load) tool. Once the data is available, your analysts can use it to create reports. In this post, we'll look at how to start from scratch and create a report using Tableau.
Our data analytics architecture
We'll start with Stitch, a simple, powerful ETL service for businesses of all sizes, up to and including the enterprise. It can move data from dozens of data sources. Sign up and you can be moving data to a data warehouse in five minutes. Choose at least one of Stitch's integrations as a data source. Most businesses get more value out of their data as they integrate more data sources; for our example, we'll look at data from two separate sources.
If you don't already have a data warehouse, one of the easiest ways to get one is by using Panoply, a platform that provides a managed version of Redshift. You can set up a Panoply destination easily from within Stitch. At the Select a Destination screen, click on Panoply, click on Create a New Account, and follow the prompts. Stitch will use the email address you logged in with, autogenerate a secure password, and set up a database destination. Save your login credentials so you can access your new data warehouse in the future.
For illustration purposes, I'll use Stitch's own corporate data, which we store in an Amazon Redshift data warehouse. Redshift was one of the first entrants in the cloud data warehouse market, and it remains a popular choice.
We'll use Tableau for the BI layer of our data analytics stack. You're probably already familiar with Tableau, but if not, you can download Tableau Desktop for a 14-day free trial.
A word of advice: Tableau is a complex application. You can walk through the Tableau Desktop tutorial when you get started, but chances are you're still going to need a lot of hand-holding. I watched Tableau's Getting Started video, and about three-quarters of the way through it, my head started spinning with the number and complexity of the options Tableau offers. If you don't have an experienced, helpful Tableau user nearby, I recommend getting a good book. I found "Practical Tableau," which is one of the most recently published books on the topic, valuable.
Building a sample report
I got approval from our CEO to share some of Stitch's real data for this exercise. I'll use data from a table in one of our back-end product databases to select all of our active clients, and join that list against another table that tracks the number of Google Analytics connections our clients have established. I'll create a report that shows the number of connections to Google Analytics over time, filtered to include only active clients.
On Tableau Desktop's Connect pane, which displays when you start the application, the left column shows the data sources Tableau can connect to.
If you click More under the To a Server section, you'll see an impressive list of databases, cloud services, and connectors. Click on Amazon Redshift and Tableau opens a window that lets you specify the server and database you want to connect to, along with your authorization information. If you're using a Panoply data warehouse, you should still select Amazon Redshift, but use
db.panoply.io as the server you're connecting to.
I'll select the
connections table in our
I dragged the
connections table to the workspace. At this point you can specify a live connection to the table data or an extract, or a snapshot of the table at this point in time. I chose Extract, because I'm fine with a snapshot that I can update as needed, and the table is small enough to easily store in my local environment. If you're working with a large number of records, or data that needs to be accurate in near real time, you'd be better served by using a live connection.
I also added a filter to limit the amount of data I was working with by clicking on Add in the upper right corner of the pane. I set it to give me only records in which the
created_at field was from October 2015 or later. The filter creation process requires clicking through a number of windows: first add, then select a field. Tableau recognizes that this is a date field and gives you a number of ways to filter. I chose Month /Year, then Condition, then specified that the value had to be greater than or equal to the date I specified.
Similarly, I filtered Type, specifying that the field must contain
I needed only three columns from the
connections table, so I removed the others by clicking on the drop-down next to their names in the Dimensions and Measures area of the Data pane and choosing Hide.
Dimensions and measures are two important concepts in Tableau. If you're data-savvy, you can think of them as independent and dependent variables. If you're not, think of dimensions as categories and measures as quantitative data – the numbers that fall into those categories. Tableau automatically tries to categorize fields as one or the other, but you can move fields from one area of a workspace to another by dragging them, or by clicking on the drop-down menu associated with each field and choosing Convert to Measure or Convert to Dimension.
Next, I returned to the Connections pane and dragged into the workspace the
rjm_clients table from our
platform schema, which holds all the information about our clients. Immediately, Tableau tries to join the two tables, proposing a join on what it thinks are key fields. Clicking on the Join icon between the two tables brings up a window that lets you specify the type of join and the key fields. In this case the keys Tableau suggested were wrong, but it was easy to specify the correct ones through drop-down selectors.
As I did with the
connections table, I hid the columns I didn't need, preserving only the client ID and a Boolean indicator of whether the client was active. This time I hid each field by using a drop-down next to their names in the workspace. You'll notice that Tableau generally gives you more than one way to do things, which makes it really flexible and sometimes really confusing.
On the worksheet tab, I dragged the
Active_Boolean field to the Filters area and ticked True to specify that I wanted to see data on active accounts only.
At this point I'd done all of my prep work on the data. I had only the tables and fields I wanted, joined and filtered appropriately. From this point, creating a visualization was fairly easy.
First, I dragged the
created_at field to the Columns shelf in the workspace. Tableau guesses that I wanted to look at our data by year, but I actually want to see it by month and year. That's simple to change by clicking on the green "pill" on the Columns shelf and choosing the appropriate function from the drop-down list. So far my visualization shows only months and years – not too useful yet.
Now I need to drag a measure –
Number of Records, which Tableau creates automatically – to the Rows shelf.
Notice that the pills in both the Columns and Rows shelves are colored green. In Tableau, green indicates continuous data, while blue indicates discrete data. If we change the column type from continuous to discrete, we'll get labels under each data point.
Either way, I've accomplished what I set out to do – I have a useful visualization of our data. In it we see a couple of different data sets: an older version of our GA connector that was based on an old Google API, whose usage is trending down, and a newer version that we introduced more recently that uses the Google Analytics Reporting API, for which usage is growing. A report like this would be useful to our product team as it watches new, improved versions of our technology take over from old ones we're phasing out.
Of course you can customize the visualization. The Marks Cards in the Marks Shelf let you alter the look of graph elements, and double-clicking on axes and titles lets you change or hide them.
Tableau also offers a Show Me area that you can toggle on or off at the right side of the workspace. It provides thumbnails of 24 chart types; any that aren't available because your data won't support them are grayed out.
When you're satisfied with your report, you can publish it to Tableau Server. Once it's there, you can get an embed code to publish it on the web, or get a link you can email to people who should view your visualization.
Tableau is incredibly powerful. It gives you a way to visualize just about anything you can think of – but its drag-and-drop interface is correspondingly complex. Tableau lets you do amazing things, but you'll need training and experience before you can be productive.
Take a look, then sign up for a free trial of Stitch and start creating your own business intelligence.