Your central database for all things ETL: advice, suggestions, and best practices
ETL Challenges and Issues
You’ve heard it before: Tech projects have a tendency to go over time and over budget. Specifically, McKinsey has found that, on average, large IT projects run 45% over budget, 7% over time, and deliver 56% less value than predicted. Your ETL project will probably not be flawless. Here are the most common mistakes that go wrong with an ETL project.
Forgetting about long-term maintenance
Some of the future maintenance costs that companies forget about:
- Data formats changing over time
- An increase in data velocity
- The time cost of adding new data connections
- The time cost of fixing broken data connections
- Requests for new features, including new columns, dimensions, and derivatives
ETL best practice #1: Make the ETL buy-vs.-build decision with care. ETL can be a fun challenge at the beginning, but it gets old fast. Jeff Magnusson, director of data platform at Stitch Fix, takes a strong stance on this topic, declaring, “Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.” Even if your developers are excited about the challenge, choose the build approach with caution.
Underestimating data transformation requirements
Raw data from disparate sources almost always needs to be cleaned and normalized before it gets to your data warehouse. Data from different systems typically don’t play well together, and it requires work to get them to cooperate. Here are some common examples:
- Establishing key relationships across data sources, many of which might not exist in the raw data
- Updating new values on existing records without sacrificing performance
- Time zone consistency
- Lookup value normalization (US = United States = USA)
If you ignore the transformation step, the data in your warehouse will be impossible to work with and full of inconsistencies, causing your decision-makers to lose faith in its accuracy.
ETL best practice #2: While most data warehouses today have the processing capabilities to manage data modeling in the warehouse, you’ll still want cleansing mechanisms to get the data to a place where it plays nicely with data from other sources.
Forgoing the customer development process
Before you begin, remember that business units are the customers for this project. Who wants the data you are ETLing? What do they want to do with it? Run through the customer development process; do interviews (and not just with managers) and get your hands on current analyses. If you don’t have this information, you might be surprised by things and fail to:
- Bring in the information your users need most
- Support mission-critical reporting workflow
- Anticipate future data needs
Extract from the sources that run your business.
ETL best practice #3: Regard your business units as customers and involve them in the development process to help ensure you build the system they need.
Tightly coupling different elements of your data pipeline
Any pipeline that’s responsible for shipping data into a warehouse has more components than you would initially anticipate – data integrations for each raw data source, transformation logic, job scheduling – basically all the components outlined in the ETL Architecture section.
Each of these components and their subcomponents may come with independent technical decisions that will impact the scalability, functionality, and maintenance burden down the road. Frequently, first-time data engineers attempt to solve all of these problems with the same technology, but often, each component requires a specialized solution.
ETL best practice #4: Choose the right tool for each component of your stack. This allows you to update components of your ETL process when technologies and business needs change without having to rebuild from scratch.
Building your ETL process based on your current data scale
The volume and variety of your data is going to increase. The more successful your company and the more data-reliant your team becomes, the more quickly you’ll go from generating thousands of rows per hour to millions of rows per second.
Is your business about to expand globally? Does your product team want to do analysis on in-app events? Is your business on the cusp of rapid growth? Any of these scenarios could dramatically change the scale of data you’re transporting through your ETL system.
ETL best practice #5: Your ETL system needs to be able to scale; otherwise, you’ll outgrow your infrastructure and be stuck rebuilding.
Not recognizing the warning signs
It can be easy to overlook sunk costs and plow ahead as a project gets more and more off-track. The warning signs to look for are similar to the signs of any failing tech project:
- Are you missing deadlines?
- Are your engineers spending more time supporting existing processes than building new ones?
- How do your engineers feel about the project? How about your end users?
ETL best practice #6: If you’re already deep in the building process and realize you made some mistakes, don’t be afraid to admit it and take a few steps back. You won’t be the first person to get in over their head on an ETL project.
Focusing on tools and technologies rather than fundamental best practices
The data engineering space is incredibly hot right now; new languages and technologies pop up every minute. There’s a tendency to want to get swept up in the latest fad. Let’s use Spark! Let’s use Kafka! Everyone is using Hadoop! But regardless of what technology or language you choose, the fundamentals of ETL will remain the same.
ETL best practice #7: Focus on getting the basics right and solving the problems of your own unique organization. Tools should be secondary.