In this guest post, Looker Senior Sales Engineer Bruce Sandell introduces three kinds of derived tables and talks about use cases for each. While some of the syntax is Looker-specific, the concepts apply to virtually all business intelligence (BI) software.
If your job title is data analyst, data engineer, or business analyst, you carry a great deal of responsibility. People in your organization, from the CEO to the summer interns, rely on you to deliver accurate results in a timely manner so they can make informed decisions.
If you're responsible for delivering business analytics to the users who need data, derived tables can be a valuable tool for you. A derived table is a SQL query that defines a set of business logic, returns reduced amounts of data, and can include complex calculations and data transformations. You can use derived tables to improve the performance of analytic queries, encapsulate business logic, ease the burden of maintaining complex queries, and facilitate logic that can't be easily achieved in a single query.
However, as helpful as derived tables can be in allowing you to develop fast and accurate analytics, they aren't always the right tool for the job. As we look at the benefits of derived tables, we'll also suggest when they may not be the best option.
Types of derived tables
Let's start by discussing the types of derived tables that are available, then talk about when it's appropriate to use each type.
The types of derived tables you have available depend on the functionality provided by the database you're using. Generally speaking, there are three classes of derived tables. The main difference between them is how long the data associated with the derived table is retained, which we refer to as persistence.
- Regular derived tables, also called ephemeral derived tables, are created using a common table expression (CTE) with the SQL WITH clause. A CTE defines a temporary result set, stored in memory, that persists only for the duration of the query in which they are run.
- Session-level derived tables, often referred to as temp tables, are available to any query that's run within the same session. They're not available to queries run outside of the session in which they are created, and they're automatically dropped as soon as the session is terminated. These tables are typically stored in memory, but can be written to disk as well. Session-level derived tables are not as commonly available as CTEs. At Looker, we typically don't use them unless the database we're working with doesn't support CTEs, as was the case with MySQL prior to version 8.0.
- Persistent derived tables are physical tables that are stored on disk, so they persist beyond a single query or user session. How long the data remains in a persistent derived table depends on the type of data, how often it's updated, and the types of analysis you want to perform with it.
Looker users typically make liberal use of derived tables. Our LookML modeling layer allows data analysts to define derived tables that can be shared and reused by many different queries. Although Looker's architecture is built to make this type of reusable code particularly simple, you can use the same principles regardless of what BI tool you use.
How to use derived tables
Now that we know a little more about derived tables, let's discuss why we might want to use them.
One of the most common uses of derived tables is pre-aggregating data. Suppose, for example, you have a table that contains frequent sales transactions, and you want to see a report of sales summarized by user. If the data in your data warehouse is updated in near real time, it would make sense to calculate the appropriate aggregation using a CTE, because the results could change with each query execution, as new data is added. On the other hand, if the sales data is loaded into the data warehouse every four hours by an ETL process, you might create more efficient queries if you persist the summarized data and trigger a recalculation based on the completion of an ETL cycle.
Our general rule of thumb at Looker is to use an ephemeral table until performance becomes an issue. Because CTEs are built in memory and don't require a physical table for storage, they provide a lot more agility than persistent tables. A data engineer can design a CTE, add or remove columns, and test and redesign as required without having to involve a DBA to create physical database structures.
To be a little more specific, great use cases for ephemeral derived tables or CTEs include those cases in which a view:
- is quick to run
- should include real-time data
- should be dynamically built based on user filter inputs
- needs to be dynamic, but the number of permutations is manageable and likely to be reused
By contrast, persisting the table make more sense when:
- the performance provided by a CTE is no longer acceptable
- the data is not updated in real time, or when having up-to-the-minute accuracy is not a requirement
- database resources are constrained, and the amount of resources required to persist the data is measurably less than that required to regenerate it with each query invocation
- you're prototyping – laying the groundwork for views, business logic, and future ETL processes
Another great use for derived tables, whether they are CTEs or persistent derived tables, is for encapsulating business logic. A library of derived table definitions can provide a number of benefits. First, it allows you to create actionable definitions of how you calculate frequently used business metrics. Any time a user needs a report that includes a specific metric, you can refer to your library of CTE code to ensure that you are consistent in your definitions, and these definitions can be shared across your organization. It can also help you to modularize your code, making your SQL easier to understand and access. The combination of consistent definitions and modular code makes things easier to maintain, particularly when the person who has to maintain the work is not the same person who originally wrote it.
To illustrate this, here's a definition for a CTE that provides summarized details for orders in a database. We'll probably refer to values like
order cost and
order gross margin frequently, so we can reuse this CTE definition any time we want to calculate those values. In Looker, these definitions are stored within the LookML model so that they can be easily shared and reused. If your business intelligence platform does not provide an easy way for collecting and organizing them, you might collect these CTE definitions in a document that's shared among your data analysts.
WITH order_facts AS (SELECT order_items.order_id AS order_id, order_items.user_id AS user_id, order_items.created_at AS created_at, COUNT(DISTINCT order_items.id ) AS items_in_order, COALESCE(SUM(order_items.sale_price ), 0) AS order_amount, COALESCE(SUM(inventory_items.cost ), 0) AS order_cost, COALESCE(SUM((order_items.sale_price - inventory_items.cost) ), 0) AS order_gross_margin FROM order_items AS order_items FULL OUTER JOIN inventory_items AS inventory_items ON inventory_items.id = order_items.inventory_item_id GROUP BY 1,2,3)
ELT or ETL?
If you're using a highly performant, highly scalable MPP cloud data warehouse, such as Amazon Redshift, Google BigQuery, or Snowflake, you should take advantage of the power of those databases and perform most of your transformations and aggregations in SQL. As we've already discussed, the use of regular and persistent derived tables allows you to maintain the most flexibility over your data, because you have access to both your raw data and aggregated or transformed data, and you can quickly redefine your derived table logic as your business needs change.
However, if your transformations are too complex to be easily handled by derived tables, you can use dbt (database build tool), a powerful command-line tool for data professionals that allows you to develop sophisticated transformations within your data warehouse. Dbt uses a templating language along with SQL commands to develop arbitrarily complex models that incorporate IF statements, FOR loops, and other more complex programming structures to give you control over your transformations. As you might expect from that description, using dbt is a bit more complicated than creating derived tables based on a single SQL command.
Even though we have these database-resident tools available to us, there are still times when using a purpose-built ETL tool is the best choice – though we recommend modeling and testing as much of your transformation logic as possible using derived tables, so that you don't have to go through the trouble of reprocessing all of your data through your ETL tool if your transformation logic changes.
Some of the use cases for which we consider the use of an ETL tool to be the right tool for the job include:
- When a powerful ETL/transformation tool can be leveraged, and the underlying database does not have the performance or resources available to support the other techniques previously discussed
- When a persistent derived table's logic is well understood, stable, and consistently used, it can be migrated to an ETL tool.
- When the raw data needs to be processed only once, or processed incrementally as it is ingested
- For Looker users, we recommend using an ETL tool for loading the transformed data if the transformed table is going to be used outside of Looker, as this can help avoid some table locking issues.
Deriving value from derived tables
I hope you now have a better understanding of derived tables and the value they offer you as a data professional, as well as the potential benefits that they offer your users and customers. They're a great tool for helping you write performant, easy-to-manage code. They also give you a great deal of agility to quickly modify your code as your users' needs change.
To learn more about how we use derived tables at Looker, or how Looker's data platform can help you to deliver the data your customers need, where and when they need it, please contact us at firstname.lastname@example.org.