The foundation of any organization's data analytics stack is its data warehouse. Chances are you put a lot of thought into the structure and contents of yours, but have you given any thought to data warehouse maintenance — that is, making sure the data warehouse columns, tables, views, and schemas are accurate and up to date?
As a company’s data warehouse ages, you'll need to:
Let's look at how a data engineer can address these routine maintenance issues.
The way you need to measure your business changes over time. Your organization will launch new products, look into different user behaviors, or try to create new predictive models, and you'll need to track metrics for these efforts in new tables, views, or columns.
You need to make sure that the new data structures not only track the data you need, but also that their purpose is clear to anyone in the organization who does reporting. If, for example, you create a view for your support team but do a poor job of expressing what the view is for, other users looking for a similar view may not know one already exists or understand what it's for, so they may recreate it — and nobody likes duplicate resources.
When adding new metrics you need to consider:
How to add new metrics to a schema
You need to track new metrics, but should you add the data to an existing table, add it to an existing view, or create a whole new view? Let’s review the best case for doing each:
Add to table
Add to existing table and view*
Create new view*
*Typically the new metric would be added to an existing table but queried from the view.
Once you've added data structures for your new metrics, you should backfill data whenever you can determine what the values should be.
In a dimension table you might be able to determine values based on values of other columns, or use default values. If you can't, leave the value as null — but consider the impact of nulls for reporting and aggregation. If working with null values would have a negative impact on your analytics, consider using a stand-in value to indicate you couldn't backfill it.
In a measures table the same principles hold, so if you can determine a value, backfill it. Since measures are more often aggregated, the impact of nulls can be even greater. If you have a measures table that has time-based data, another negative of nulls is that they may limit your analysis to the period before you added the new column. Sometimes you can work around this issue by bringing in data from other data sources or by inputting values based on overall statistics or dimensions of each row.
Naming conventions create a standard language for everyone who queries the database, and make it easy to understand the purpose of every column and table. Any time you add tables, views, or columns to a data warehouse you should ensure these objects follow your organization’s naming conventions. At Chartio, we have a naming convention and style guide that we follow when adding or updating metrics. Your organization should have your own style guide, and employees should follow it.
So much for new metrics — what about old ones? Metrics may become inaccurate or no longer worth analyzing. As companies grow, the tools they use to work with data change, which can lead to having multiple places from which the same type of data can be queried. An analyst might not be aware of which source they're using and could query the wrong data.
Business objectives change too, and that can affect what data is appropriate for analysis. Features or products may be deprecated, that can make their associated metrics misleading.
You want to prevent stale or misleading data from being queried and yielding false conclusions.
First, let users know when metrics are no longer useful via email or through your BI tool so they’re not caught off-guard. Then update the names for these objects to something like _deprecated or _do_not_use. If you're still concerned about people using the data incorrectly you can make new tables, or views that no longer contain the deprecated columns.
Naming conventions for old metrics should be incorporated into your company’s style guide. In this case, naming plays an integral role in keeping users from querying data warehouse objects incorrectly.
Not having access to the data you need halts analysis — but giving everyone access may create a legal liability, especially in the case of personally identifiable information. The trick is to walk the line between too restrictive and too permissive. Permissions also control things like the ability to modify schemas and drop tables — things you don't want most users to do. With some data warehouses, permissions can even save money.
Consider a scenario where you have analysts working on BigQuery projects. Google charges for BigQuery by data volume processed, and you’re worried about query costs. To keep users from running too many queries and raising your costs, you can create a custom quota to manage costs by specifying a limit on the amount of query data processed per day, either at the project or the user level.
Permission issues happen when data warehouse access doesn't mirror employee status. Here's what you should do in four common situations.
|new employees need to query the data warehouse, or existing employees change roles and need access||make sure they have the appropriate permissions in the warehouse|
|employees leave||deactivate their accounts, just as you would turn off their email|
|employees are temporarily granted special permission to access more data||revoke permission when the temporary period ends|
|employees share accounts||keep this from happening. This is a bad practice, as it doesn't leave an audit trail that shows who looked at what data.|
Generally speaking, make sure access to your data warehouse mirrors the employment status of your employees. Grant permission to allow users access to only the data they need. Don't give them permission to delete data that can’t be recovered.
We recommend programmatically adding and removing users to keep employees from being blocked and to ensure they don't maintain access to data they shouldn't have. In large or growing companies you may not know all of the people you work with, and you may not know when their roles or employment status changes. Programmatic permissions will help address this.
If instead you manually add, remove, and change privileges for users, you need to be especially vigilant in removing or updating the permissions.Set user permissions at a team rather than an individual level, because as you scale up your usage and add more users, it’s easier to track teams.
Sharing accounts is an unfortunate practice that you need to keep an eye on. Sharing accounts makes it impossible to hold users responsible for actions they’ve carried out. If you need to find out why a user dropped a table or updated a column, or accessed sensitive data, you might have to ask multiple people to figure it out, and you may never get a straight answer. Shared accounts may also be a breach of agreements you have with customers and their expectations of how you handle their data. Give each user a separate account to ensure security compliance and accountability.
The amount of data you have in your data warehouse will grow over time as your objectives change and you begin to track new metrics. As the amount of data grows, and as new analysts and new business objectives shift what data is being queried, you'll need to consider whether the way you designed your data warehouse objects — schemas, tables, views, and columns — still makes sense based on the way users query it. The data warehouse objects may need to be reconfigured to optimize usage and performance.
The most indicative sign of needing to consider remodeling the objects is a drop in performance. If simple queries take too long to run, users will stop querying data or start filing support tickets.
Different data warehouses will have options to check performance, but most offer ways to identify both slow and common queries. To improve performance for slow queries you may be able to add indexes. For common queries you can create views.
Let's take PostgreSQL as an example. PostgreSQL has a “slow query” log that lets you set a time threshold. If a query takes longer than this threshold, PostgreSQL writes a line to the slow query log. A data engineer can look at a query in the log, examine the query plan to see how a query is executing, and adjust the query to be more efficient.
Specifically in PostgreSQL, you can use the EXPLAIN command and the EXPLAIN ANALYZE command. The EXPLAIN command shows the generated query plan but does not run the query. To see the results of actually executing the query, use the EXPLAIN ANALYZE command. Based on the output, you can decide whether to create an index to speed up the query time.
In PostgreSQL you can use pg_stat_statements to group identical queries and find optimization opportunities. The pg_stat_statements directive stores queries that are run against your PostgreSQL instance. It saves the query, the execution time, the underlying reads and writes, and the variables. This information allows you to determine what type of data users want so you can optimize frequently used queries.
Creating a view can help users unfamiliar with the structure of the data warehouse by consolidating what they need to query to a single place. For example, you can grant users from a specific department access to a view that reflects all the departmental information they need to query.
In addition, you can get performance benefits if you materialize the view or create a new table. Most of the improvements here will be seen if the query is heavily filtered or if it is aggregated. Users can then query the materialized view or table. You can get an even bigger bump in performance if you add an index to this new materialized view or table.
A data warehouse that provides a single source of truth is a worthwhile investment, but without maintenance it will fall into disarray and lose its value. To keep that from happening, follow these best practices:
Got any additional suggestions? Please share them via comments below.
This post was adapted from Chartio's Data Governance web book.