The complexity of your data landscape grows with each data source, each set of business requirements, each process change, and each new regulation. Finding the most suitable ETL process for your business can make the difference between working on your data pipeline or making your data pipeline work for you. The best approach takes into consideration your data sources, your data warehouse, and your business requirements.
ETL is a three-step process: extract data from databases or other data sources, transform the data in various ways, and load that data into a destination.
In the AWS environment, data sources include S3, Aurora, Relational Database Service (RDS), DynamoDB, and EC2. Amazon Redshift is a data warehouse and S3 can be used as a data lake.
Cloud-native data warehouses like Redshift can scale elastically to handle just about any processing load, which enables data engineers to run transformations on data after loading. This changes the data pipeline process for cloud data warehouses from ETL to ELT.
ETL is part of the process of replicating data from one system to another — a process with many steps. For instance, you first have to identify all of your data sources. Then, unless you plan to replicate all of your data on a regular basis, you need to identify when source data has changed. You also need to select a data warehouse destination that provides an architecture that's appropriate for the types of data analysis you plan to run, fits within your budget, and is compatible with your software ecosystem.
You then can task a data engineer on your internal team with manually coding a reusable data pipeline. However, writing ETL code is not simple — among other things, data engineers need to:
Given all that, many organizations choose to avoid manually coding data pipelines. As Jeff Magnusson, vice president of Stitch Fix, says, "Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume."
Fortunately, there's a smart alternative to writing and maintaining your own ETL code. If you want to follow Magnusson's advice, you can turn to a SaaS service to handle ETL tasks. In the AWS world, AWS Glue can handle ETL jobs, or you can consider a third-party service like Stitch.
AWS Glue is a managed ETL service that you control from the AWS Management Console. Glue may be a good choice if you're moving data from an Amazon data source to an Amazon data warehouse.
Glue's ETL process is similar to that of a manually coded data pipeline:
However, Glue supports only services running on AWS:
If you need to include other sources in your ETL plan, a third-party ETL tool is a better choice.
Third-party AWS ETL tools often have advantages over AWS Glue and internal pipelines. They support integrations with non-AWS data sources through graphical interfaces, and offer attractive pricing models.
How do you pick the most suitable ETL tool for your business? Start by asking questions specific to your requirements:
Stitch supports Amazon Redshift and S3 destinations and more than 90 data sources, and provides businesses with the power to replicate data easily and cost-effectively while maintaining SOC 2, HIPAA, and GDPR compliance. As a cloud-first, extensible platform, Stitch lets you reliably scale your ecosystem and integrate with new data sources — all while providing the support and resources to answer any of your questions. Try Stitch for free today to create an AWS data pipeline that works for your organization.