Best practices for data modeling

The data in your data warehouse are only valuable if they are actually used. To make your data usable, you need to consider how the data are presented to end users and how quickly users can answer their questions. In this post I cover some guidelines on how to build better data models that are more maintainable, more useful, and more performant.

Data modeling has become a topic of growing importance in the data and analytics space. Recent technology and tools have unlocked the ability for data analysts who lack a data engineering background to contribute to designing, defining, and developing data models for use in business intelligence and analytics tasks.

The term "data modeling" can carry a lot of meanings. For our purposes we'll refer to data modeling as the process of designing data tables for use by users, BI tools, and applications.

Much ink has been spilled over the years by opposing and pedantic data-modeling zealots, but with the development of the modern data warehouse and ELT pipeline, many of the old rules and sacred cows of data modeling are no longer relevant, and can at times even be detrimental.

In this post we'll take a dogma-free look at the current best practices for data modeling for the data analysts, software engineers, and analytics engineers developing these models.

Evolution of the business analytics stack

The business analytics stack has evolved a lot in the last five years. With current technologies it's possible for small startups to access the kind of data that used to be available only to the largest and most sophisticated tech companies. The modern analytics stack for most use cases is a straightforward ELT (extract, load, transform) pipeline. Thanks to providers like Stitch, the extract and load components of this pipeline have become commoditized, so organizations are able to prioritize adding value by developing domain-specific business logic in the transform component.

The transform component, in this design, takes place inside the data warehouse. Data are extracted and loaded from upstream sources (e.g., Facebook's reporting platform, MailChimp, Shopify, a PostgreSQL application database, etc.) and directly copied into a data warehouse (Snowflake, Google BigQuery, and Amazon Redshift are today's standard options). Once the data are in the warehouse, the transformations are defined in SQL and computed by the warehouse in the format of a CREATE TABLE AS SELECT … statement. Throughout this post I'll be giving examples that assume you're using something like an ELT pipeline context, but the general lessons and recommendations can be used in any context.

What do data modelers think about?

When you sit down at your SQL development environment[1] what should you be thinking about when it comes to designing a functioning data model?

A data model-developer often wears multiple hats — they're the product owner of a piece of software that will be used by downstream applications and users as well as the software engineer striving to deliver that value.

As when you're writing any software, you should be thinking about how your product will fit at the intersection of your users' needs and the limitations of the available technology. In the case of a data model in a data warehouse, you should primarily be thinking about users and technology:

  • Users
    • Data analysts and data scientists who want to write ad-hoc queries to perform a single analysis
    • Business users using BI tools to build and read reports
  • Technology
    • How does the data model affect query times and expense?
    • How does the data model affect transformation speed and data latency?

You want to design a data model that:

  • Is comprehensible by data analysts and data scientists (so they make fewer mistakes when writing queries).
  • Works well with the BI tool you're using.
  • Minimizes transform time (time-to-build).
  • Minimizes response time to both the BI tool and ad-hoc queries.
  • Minimizes costs[2].
    Often these goals are in conflict with each other, so the design of a data model requires tradeoffs along these dimensions. That's expected and totally fine! As data modelers, our job is to recognize these tradeoffs, make informed decisions, then make adjustments in the future as it's required.

Since every organization is different, you'll have to weigh these tradeoffs in the context of your business, the strengths and weaknesses of the personnel on staff, and the technologies you're using. What might work well for your counterpart at another company may not be appropriate in yours!

The Kimball legacy

Many data modelers are familiar with the Kimball Lifecycle methodology of dimensional modeling originally developed by Ralph Kimball in the 1990s. I recommend that every data modeler be familiar with the techniques outlined by Kimball. Terms such as "facts," "dimensions," and "slowly changing dimensions" are critical vocabulary for any practitioner, and having a working knowledge of those techniques is a baseline requirement for a professional data modeler.

That being said, it's important to remember that the techniques Kimball developed were designed for a world in which the modern data warehouses most organizations use today did not exist. Understanding the underlying data warehousing technologies and making wise decisions about the relevant tradeoffs will get you further than pure adherence to Kimball's guidelines. Dogmatically following those rules can result in a data model and warehouse that are both less comprehensible and less performant than what can be achieved by selectively bending them.

For example, in the most common data warehouses used today a Kimball-style star schema with facts and dimensions is less performant (sometimes dramatically so) than using one pre-aggregated really wide table. This extra-wide table would violate Kimball's facts-and-dimensions star schema but is a good technique to have in your toolbox to improve performance!

Data model design considerations

When it comes to designing data models, there are four considerations that you should keep in mind while you're developing in order to help you maximize the effectiveness of your data warehouse:

  • Grain
  • Naming
  • Materialization
  • Permissioning and governance

Grain

The most important data modeling concept is the grain of a relation. (I'm using the abstract term "relation" to refer generically to tables or views.) The grain of the relation defines what a single row represents in the relation. In a table like orders, the grain might be single order, so every order is on its own row and there is exactly one row per order. Or in users, the grain might be a single user.

At other times you may have a grain of a table that is more complicated — imagine an order_states table that has one row per order per state of that order. In this relation each order could have multiple rows reflecting the different states of that order (placed, paid, canceled, delivered, refunded, etc.).

When designing a new relation, you should:

  1. Determine the grain of the relation.
  2. Name the relation such that the grain is clear.
  3. Ensure that all of the columns in the relation apply to the appropriate grain (i.e., don't have a users_age column in the orders relation)

By ensuring that your relations have clear, consistent, and distinct grains your users will be able to better reason about how to combine the relations to solve the problem they're trying to solve.

Naming

Naming things remains a challenge in data modeling. Rule number one when it comes to naming your data models is to choose a naming scheme and stick with it. Here are some naming rules that I tend to use for my projects, but using my exact rules is much less important than having rules that you use consistently.

Relation naming

  • Use schemas to name-space relations that are similar in terms of data source, business unit, or abstraction level. For example, you might use the marketing schema to include all of the tables most relevant to the marketing team and the analytics schema to house higher-level concepts like long-term value.
  • Always use snake_case.
  • Use the pluralized grain as the table name: orders, users, subscriptions, order_item_names.

Column naming

  • IDs should get an _id suffix, and primary keys should be called $OBJECT_id (e.g., order_id, user_id, subscription_id, order_item_name_id).
  • Timestamps should get an _at suffix and dates should get an _date suffix (e.g., ordered_at or user_creation_date).
  • Booleans should begin with is_ or has_ (e.g., is_canceled or has_discounted_item).

In general you want to promote human-readability and -interpretability for these column names. Since the users of these column and relation names will be humans, you should ensure that the names are easy to use and interpret.

In addition to just thinking about the naming conventions that will be shown to others, you should probably also be making use of a SQL style guide. There are lots of great ones that have been published, or you can always just write your own.

Materialization

As a data modeler one of the most important tools you have for building a top-notch data model is materialization. By "materialization" I mean (roughly) whether or not a given relation is created as a table or as a view. If you create the relation as a table, you precompute any required calculations, which means that your users will see faster query response times. If you leave the relation as a view, your users will get more up-to-date data when they query, but response times will be slower. Folks from the software engineering world also refer to this concept as "caching."

Depending on what data warehousing technology you're using (and how you're billed for those resources) you might make different tradeoffs with respect to materialization. In general, when building a data model for end users you're going to want to materialize as much as possible. This often means denormalizing as much as possible so that, instead of having a star schema where joins are performed on the fly, you have a few really wide tables (many many columns) with all of the relevant information for a given object available.

In addition to denormalizing your data so that querying is faster (because the database doesn't have to execute the joins on the fly) you also get the added benefit of making queries simpler for end users to write. In my experience, most non-experts can adeptly write a query that selects from a single table, but once they need to include joins the chance of errors climbs dramatically.

To ensure that my end users have a good querying experience, I like to review database logs for slow queries to see if I could find other precomputing that could be done to make it faster. If an expensive CTE (common table expression) is being used frequently, or there's an expensive join happening somewhere, those are good candidates for materialization.

Permissioning and governance

In addition to determining the content of the data models and how the relations are materialized, data modelers should be aware of the permissioning and governance requirements of the business, which can vary substantially in how cumbersome they are. You should work with your security team to make sure that your data warehouse obeys the relevant policies. For example, businesses that deal with health care data are often subject to HIPAA regulations about data access and privacy. Any customer-facing internet business should be worried about GDPR, and SaaS businesses are often limited in how they can use their customers' data based on what is stipulated in the contract.

As a data modeler, you should be mindful of where personally identifying customer information is stored. Often, it's good practice to keep potentially identifying information separate from the rest of the warehouse relations so that you can control who has access to that potentially sensitive information. You should be aware of the data access policies that are in place, and ideally you should be working hand-in-hand with your security team to make sure that the data models you're constructing are compatible with the policies that the security team wants to put in place.

Conclusion

These are the most important high-level principles to consider when you're building data models. The most important piece of advice I can give is to always think about how to build a better product for users — think about users' needs and experience and try to build the data model that will best serve those considerations. While having a large toolbox of techniques and styles of data modeling is useful, servile adherence to any one set of principles or system is generally inferior to a flexible approach based on the unique needs of your organization.

As long as you put your users first, you'll be all right.

Notes


  1. Vim + TMUX is the one true development environment don't @ me ↩︎

  2. For some warehouses, like Amazon Redshift, the cost of the warehouse is (relatively) fixed over most time horizons since you pay a flat rate by the hour. However, for warehouses like Google BigQuery and Snowflake, costs are based on compute resources used and can be much more dynamic, so data modelers should be thinking about the tradeoffs between the cost of using more resources versus whatever improvements might otherwise be obtainable. ↩︎