The complexity of your data landscape grows with each data source, each set of business requirements, each process change, and each new regulation. Finding the most suitable ETL process for your business can make the difference between working on your data pipeline or making your data pipeline work for you. The best approach takes into consideration your data sources, your data warehouse, and your business requirements.

How ETL works

ETL is a three-step process: extract data from databases or other data sources, transform the data in various ways, and load that data into a destination.

In the AWS environment, data sources include S3, Aurora, Relational Database Service (RDS), DynamoDB, and EC2. Amazon Redshift is a data warehouse and S3 can be used as a data lake.

Cloud-native data warehouses like Redshift can scale elastically to handle just about any processing load, which enables data engineers to run transformations on data after loading. This changes the data pipeline process for cloud data warehouses from ETL to ELT.

DIY data pipeline — big challenge, bad business

ETL is part of the process of replicating data from one system to another — a process with many steps. For instance, you first have to identify all of your data sources. Then, unless you plan to replicate all of your data on a regular basis, you need to identify when source data has changed. You also need to select a data warehouse destination that provides an architecture that's appropriate for the types of data analysis you plan to run, fits within your budget, and is compatible with your software ecosystem.

You then can task a data engineer on your internal team with manually coding a reusable data pipeline. However, writing ETL code is not simple — among other things, data engineers need to:

Learn how to use the data sources' APIs
Write the logic for the extraction process
Code in security, logging, and alerting capabilities
Test their work
Review pipeline performance regularly
Repeat many of these steps as they maintain the code over time

Given all that, many organizations choose to avoid manually coding data pipelines. As Jeff Magnusson, vice president of Stitch Fix, says, "Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume."

Fortunately, there's a smart alternative to writing and maintaining your own ETL code. If you want to follow Magnusson's advice, you can turn to a SaaS service to handle ETL tasks. In the AWS world, AWS Glue can handle ETL jobs, or you can consider a third-party service like Stitch.

Should you stick with AWS Glue for ETL?

AWS Glue is a managed ETL service that you control from the AWS Management Console. Glue may be a good choice if you're moving data from an Amazon data source to an Amazon data warehouse.

Glue's ETL process is similar to that of a manually coded data pipeline:

Set up a schedule or identify events to trigger an ETL job.
Extract data from AWS data sources.
Transform data based on code generated automatically by AWS Glue.
Load data into either Redshift or S3.
Write metadata pertaining to the ETL job into the AWS Glue Data Catalog.

However, Glue supports only services running on AWS:

Amazon Aurora
Amazon RDS for MySQL
Amazon RDS for Oracle
Amazon RDS for PostgreSQL
Amazon RDS for SQL Server
Amazon Redshift
Amazon S3
MySQL in Amazon VPC running on EC2
Oracle in Amazon VPC running on EC2
Microsoft SQL Server in Amazon VPC running on EC2
PostgreSQL in Amazon VPC running on EC2

If you need to include other sources in your ETL plan, a third-party ETL tool is a better choice.

Using a third-party AWS ETL tool

Third-party AWS ETL tools often have advantages over AWS Glue and internal pipelines. They support integrations with non-AWS data sources through graphical interfaces, and offer attractive pricing models.

How do you pick the most suitable ETL tool for your business? Start by asking questions specific to your requirements:

Does an ETL tool integrate with every data source you use today?
Will it integrate with new data sources — even those that are less popular or custom-built? To be prepared to take advantage of new data sources, consider an extensible solution like Stitch, whose Singer open source framework lets you integrate new data sources.
What kind of reliability do you need? It's tough to beat the fault tolerance of a cloud-native service.
What are your requirements for performance and scalability?
Does it meet your security requirements? Do you need to meet HIPAA, PCI, or GDPR compliance requirements?
Pricing models vary among ETL services. Does the solution charge by the hour, number of data sources, or a combination of factors? Make sure the solution is priced in a way that makes sense for your business's usage.
What replication scheduling options are available and how is replication handled? What happens if a replication job is still running when the next is supposed to start? Can you specify a start time to establish a predictable schedule?
What types of support are available? Is quality documentation available online? Is in-app support available? Is there an online community to share knowledge with?

Stitch any data source to AWS

Stitch supports Amazon Redshift and S3 destinations and more than 90 data sources, and provides businesses with the power to replicate data easily and cost-effectively while maintaining SOC 2, HIPAA, and GDPR compliance. As a cloud-first, extensible platform, Stitch lets you reliably scale your ecosystem and integrate with new data sources — all while providing the support and resources to answer any of your questions. Try Stitch for free today to create an AWS data pipeline that works for your organization.

Give Stitch a try, on us

Stitch streams all of your data directly to your analytics warehouse.

Set up in minutesUnlimited data volume during trial