ETL Database

Your central database for all things ETL: advice, suggestions, and best practices.

ETL Architecture

The most common mistake people make when building an ETL system – or arguably any technology project – is that they jump into buying technology and writing code before thinking through the needs of their organization.

Before you start building your ETL architecture, consider some fundamental questions. What sources do you need to extract data from? What warehousing technology are you using? What’s the analytical skill level within your organization? Why is this project important to your overarching company goals? The answers to these questions should help you determine your overall data architecture, a necessary outline to establish prior to starting your project.

Enterprise architect and consultant Chris Lockhart describes the importance of architecture-first thinking like this:

“A comprehensive architecture that is made up of things like mission, vision, goals, objectives, requirements, roles, activities, capabilities, services, technical components and yes, physical components is a holistic thing. It is something we bake, not something served up in piecemeal ingredients. We don’t eat flour and call it bread just because everyone around us happens to be an expert in flour. An architecture is not a single one of those ingredients. It is the result of an intentional effort to bring those parts together so that they positively impact the matter at hand. Talking about servers and middleware and network all day, no matter how appealing to some (including myself) will never go far in solving business problems. Frankly it will just make you look like the stereotype Dilbert character. Your organization will lose all credibility with the folks who pay the bills.”

To Lockhart, architecture is the vision and the planning toward that vision.

Here are five things you should do when designing your ETL architecture:

Understand your organizational requirements

There comes a time in every business's life when joining data across disparate data sets on a one-off basis is no longer tenable: You’re suddenly overloaded with spreadsheets, or you start to calculate how much time you’re spending on perpetually-breaking ETL scripts. Before you jump into a solution, make sure you understand why ETL is a priority right now for your organization.

  1. Data sources: Which data sources does your team most want to work with? How much time are you willing to dedicate to building a new integration when the situation arises? For many startups and growing businesses, data sources are fairly standardized. You’ll likely be extracting data from transactional databases such as MongoDB and MySQL and SaaS tools like Salesforce and Google Analytics.
  2. Usage: Where will the data be loaded, who will be using it, and how will they be using it? If you’re provisioning data for analytics, that’s going to look different than if you’re provisioning data to power your product, as businesses like Uber or Airbnb do.
  3. Latency: How often do your end users need fresh data? For systems alerting, you need your data to be updated every second. If you’re delivering data for financial reporting, once a week might be enough. If your data scientists want to work on a data model, you may only need to deliver data once.

Answering these foundational questions will help you make the first critical decision about your ETL architecture: Should you build it or buy it? If you’re extracting data from standard sources and using it primarily for business analytics, buying it is often the right choice. There's no need to reinvent the wheel. On the other hand, if you're Uber and you need to make real-time decisions based on location data from hundreds of thousands of cars, you may need a custom solution.

Audit your data sources

The next stage of the architecture process requires that you build a profile of the data sources your business uses. You’ll want to consider the production databases that power your business, as well as the disparate data sets that live in your SaaS tools. Here are some common areas:


Production databases Sales and marketing sources Customer support sources Operational sources
MongoDB
MySQL
PostgreSQL
MariaDB
Microsoft SQL Server
Oracle
Web Analytics
- Google Analytics
- Mixpanel
Marketing Automation
- HubSpot
- Marketo
Email Service Providers
- MailChimp
- Constant Contact
Ad data
- Google Adwords
- Twitter Ads
- Facebook Ads
CRM
- Salesforce
- SugarCRM
- Insightly
Help desks
- Zendesk
- Desk.com
Customer chat
- Intercom
- Zendesk Chat
Payments
- Zuora
- Recurly
- Stripe
Accounting
- Quickbooks
- Xero
Shopping cart platforms
- Shopify
- Magento


It’s important during this process to understand what data sources are a top priority. For most companies, consistent financial data around things like revenue, MRR, and unit economics will be the top priority. Identifying these priorities provides you with information you need to make tradeoffs around data completeness vs. speed of project completion.

Determine your approach to data extraction

Two questions should guide your approach to data extraction:

  1. What sources are you extracting data from?
  2. What methods of data extraction will you use?

The Extraction section of this website covers these points in detail.

Build your cleansing machinery

In the Transform section of this site, we explore how and why improved data warehouse processing has impacted the transformation stage of ETL. Modern ETL systems leave the bulk of transformations for the analytics stage, and focus instead on loading minimally processed data into the warehouse. This increases the flexibility of your ETL process and gives your analysts greater control over data modeling.

That said, you may need to do some basic data cleansing during the ETL process. Certain commonly occurring scenarios should be handled prior to loading data into your warehouse:

Manage the ETL process

The final stage of designing an ETL architecture is figuring out the functionality you’ll need to manage the ETL process. Data being delivered to your organization needs to be reliably available and accurate; the processes you put in place here will ensure reliability and build organizational trust around your data accuracy.

  1. Job scheduling: How often should the ETL process run? Should it be triggered or run automatically? Should this vary by data source? For example, your CEO will likely want daily revenue numbers, but your customer support team probably needs NPS data only every week or two. Getting these schedules right will reduce load on your internal systems. Even in environments where aspects of your ETL pipeline are built for streaming, there will likely be components that run in batch.
  2. Monitoring: If any of your systems fail, or your testing rules reveal a data inaccuracy, you’ll want to know about it. Your monitoring system should escalate problems, alerting your team so they can take action immediately. You’ll likely want to integrate this with the other monitoring and alerting systems you use for other aspects of your technology infrastructure.
  3. Recovery: It’s not a question of if, but when, your ETL process will fail. You need to be prepared for when it happens. Your data warehouse might go down, an extraction job might fail, a SaaS API might temporarily go down or start sending you nonconforming data. Your architecture needs to plan for failure and have recovery mechanisms in place for when it happens.
  4. ETL testing: In addition to recovery options, you want checks and balances in place to test data accuracy. Bilal Mahmood wrote a great piece outlining the seven most common data inaccuracies and the rules to test for them.
  5. Security Your security measures need to consider things like how to transfer data securely, as well as what data gets transferred. For example, unless you’re PayPal (or a company like it) there’s likely no benefit to storing customer payments information in your analytics warehouse.

Try Stitch Today

Keep Learning about ETL Architecture