ETL Database

Your central database for all things ETL: advice, suggestions, and best practices

ETL Load

The load stage of the ETL process depends largely on what you intend to do with the data once it’s loaded into the data warehouse. Uses could include:

Regardless of your end goal, one of the key considerations during the load process is understanding the work you’re requiring of the target environment. Depending on your data volume, structure, target, and load type, you could negatively impact the host system when you load data.

For example, loading data into Amazon Redshift is best done infrequently in large batches. If you’re loading data into Redshift, you should avoid small, frequent batches or you’ll have angry analysts beating down your door when they notice that your jobs are consuming all of their cluster resources.

Bottom line: The load process needs to be specific to what you’re loading data into. We’re going to move forward with the assumption that you’re loading data into an analytics warehouse.

There are two primary methods to load data into a warehouse:

Full load Incremental load
Rows sync All rows in source data New and updated records only
Time More time Less time
Difficulty Low High. ETL must be checked for new/updated row. Recovery from an issue is harder

The initial full load is relatively straightforward. When you start taking on incremental loads, things get more complex. Here are three of the most common problem areas:

  1. Ordering: To handle massive scale with high availability, data pipelines are often distributed systems. This means that arriving data points can take different paths through the system, which means they can be processed in a different order than they were received. If data is being updated or deleted, processing in the wrong order will lead to bad data. Maintaining and auditing ordering is critical for keeping data accurate.
  2. Schema evolution: What happens to your existing data when a new property is added, or an existing property is changed? Some of these changes can be destructive or leave data in an inconsistent state. For example, what happens if your data warehouse starts receiving string values for a field that is expected to be an integer datatype?
  3. Monitorability: With data coming from a large number of sources, failures are inevitable. How long will it take you to catch them? Failure scenarios include:
    • An API is down for maintenance
    • API credentials expire
    • API calls are returning successfully, but do not contain any data
    • Network congestion prevents communication with an API
    • The pipeline destination (e.g. a data warehouse) is offline

Any of these problems will likely result in data that is either incomplete or wrong. Recovering from these issues can be a massive headache.

Keep Learning about ETL Loading