Data extraction is the process of obtaining data from a database or SaaS platform so that it can be replicated to a destination — such as a data warehouse — designed to support online analytical processing (OLAP).
Data extraction is the first step in a data ingestion process called ETL — extract, transform, and load. The goal of ETL is to prepare data for analysis or business intelligence (BI).
Suppose an organization wants to monitor its reputation in the marketplace. It may have data from many sources, including online reviews, social media mentions, and online transactions. An ETL tool can extract data from these sources and load it into a data warehouse where it can be analyzed and mined for insights into brand perception.
Types of data extraction
Extraction jobs may be scheduled, or analysts may extract data on demand as dictated by business needs and analysis goals. Data can be extracted in three primary ways:
The easiest way to extract data from a source system is to have that system issue a notification when a record has been changed. Most databases provide a mechanism for this so that they can support database replication (change data capture or binary logs), and many SaaS applications provide webhooks, which offer conceptually similar functionality.
Some data sources are unable to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of those records. During subsequent ETL steps, the data extraction code needs to identify and propagate changes. One drawback of incremental extraction is that it may not be able to detect deleted records in source data, because there’s no way to see a record that’s no longer there.
The first time you replicate any source you have to do a full extraction, and some data sources have no way to identify data that has been changed, so reloading a whole table may be the only way to get data from that source. Because full extraction involves high data transfer volumes, which can put a load on the network, it’s not the best option if you can avoid it.
The data extraction process
Whether the source is a database or a SaaS platform, the data extraction process involves the following steps:
- Check for changes to the structure of the data, including the addition of new tables and columns. Changed data structures have to be dealt with programmatically.
- Retrieve the target tables and fields from the records specified by the integration’s replication scheme.
- Extract the appropriate data, if any.
Extracted data is loaded into a destination that serves as a platform for BI reporting, such as a cloud data warehouse like Amazon Redshift, Microsoft Azure SQL Data Warehouse, Snowflake, or Google BigQuery. The load process needs to be specific to the destination.
While it may be possible to extract data from a database using SQL, the extraction process for SaaS products relies on each platform’s application programming interface (API). Working with APIs can be challenging:
- APIs are different for every application.
- Many APIs are not well documented. Even APIs from reputable, developer-friendly companies sometimes have poor documentation.
- APIs change over time. For example, Facebook’s “move fast and break things” approach means the company frequently updates its reporting APIs – and Facebook doesn’t always notify API users in advance.
ETL: Build-your-own vs. cloud-first
In the past, developers would write their own ETL tools to extract and replicate data. This works fine when there is a single, or only a few, data sources.
However, when sources are more numerous or complex, this approach does not scale well. The more sources there are, the more likelihood that something will require maintenance. How does one deal with changing APIs? What happens when a source or destination changes its format? What if the script has an error that goes unnoticed, leading to decisions being made on bad data? It doesn’t take long for a simple script to become a maintenance headache.
Cloud-based ETL tools allow users to connect sources and destinations quickly without writing or maintaining code, and without worrying about other pitfalls that can compromise data extraction and loading. That in turn makes it easy to provide access to data to anyone who needs it for analytics, including executives, managers, and individual business units.
Data extraction drives business intelligence
To reap the benefits of analytics and BI programs, you must understand the context of your data sources and destinations, and use the right tools. For popular data sources, there’s no reason to build a data extraction tool.
Stitch offers an easy-to-use ETL tool to replicate data from sources to destinations; it makes the job of getting data for analysis faster, easier, and more reliable, so that businesses can get the most out of their data analysis and BI programs.
Stitch makes it simple to extract data from more than 90 sources and move it to a target destination. Sign up for a free trial and get your data to its destination in minutes.