Data pipelines transport raw data from software-as-a-service (SaaS) platforms and database sources to data warehouses for use by analytics and business intelligence (BI) tools. Developers can build pipelines themselves by writing code and manually interfacing with source databases — or they can avoid reinventing the wheel and use a SaaS data pipeline instead.
To understand how much of a revolution data pipeline-as-a-service is, and how much work goes into assembling an old-school data pipeline, let’s review the fundamental components and stages of data pipelines, as well as the technologies available for replicating data.
Data pipeline architecture
Data pipeline architecture is the design and structure of code and systems that copy, cleanse or transform as needed, and route source data to destination systems such as data warehouses and data lakes.
Three factors contribute to the speed with which data moves through a data pipeline:
- Rate, or throughput, is how much data a pipeline can process within a set amount of time.
- Data pipeline reliability requires individual systems within a data pipeline to be fault-tolerant. A reliable data pipeline with built-in auditing, logging, and validation mechanisms helps ensure data quality.
- Latency is the time needed for a single unit of data to travel through the pipeline. Latency relates more to response time than to volume or throughput. Low latency can be expensive to maintain in terms of both price and processing resources, and an enterprise should strike a balance to maximize the value it gets from analytics.
Data engineers should seek to optimize these aspects of the pipeline to suit the organization’s needs. An enterprise must consider business objectives, cost, and the type and availability of computational resources when designing its pipeline.
Designing a data pipeline
Data pipeline architecture is layered. Each subsystem feeds into the next, until data reaches its destination.
In terms of plumbing — we are talking about pipelines, after all — data sources are the wells, lakes, and streams where organizations first gather data. SaaS vendors support thousands of potential data sources, and every organization hosts dozens of others on their own systems. As the first layer in a data pipeline, data sources are key to its design. Without quality data, there’s nothing to ingest and move through the pipeline.
The ingestion components of a data pipeline are the processes that read data from data sources — the pumps and aqueducts in our plumbing analogy. An extraction process reads from each data source using application programming interfaces (API) provided by the data source. Before you can write code that calls the APIs, though, you have to figure out what data you want to extract through a process called data profiling — examining data for its characteristics and structure, and evaluating how well it fits a business purpose.
After the data is profiled, it’s ingested, either as batches or through streaming.
Batch ingestion and streaming ingestion
Batch processing is when sets of records are extracted and operated on as a group. Batch processing is sequential, and the ingestion mechanism reads, processes, and outputs groups of records according to criteria set by developers and analysts beforehand. The process does not watch for new records and move them along in real time, but instead runs on a schedule or acts based on external triggers.
Streaming is an alternative data ingestion paradigm where data sources automatically pass along individual records or units of information one by one. All organizations use batch ingestion for many different kinds of data, while enterprises use streaming ingestion only when they need near-real-time data for use with applications or analytics that require the minimum possible latency.
Depending on an enterprise’s data transformation needs, the data is either moved into a staging area or sent directly along its flow.
Once data is extracted from source systems, its structure or format may need to be adjusted. Processes that transform data are the desalination stations, treatment plants, and personal water filters of the data pipeline.
Transformations include mapping coded values to more descriptive ones, filtering, and aggregation. Combination is a particularly important type of transformation. It includes database joins, where relationships encoded in relational data models can be leveraged to bring related multiple tables, columns, and records together.
The timing of any transformations depends on what data replication process an enterprise decides to use in its data pipeline: ETL (extract, transform, load) or ELT (extract, load, transform). ETL, an older technology used with on-premises data warehouses, can transform data before it’s loaded to its destination. ELT, used with modern cloud-based data warehouses, loads data without applying any transformations. Data consumers can then apply their own transformations on data within a data warehouse or data lake.
Destinations are the water towers and holding tanks of the data pipeline. A data warehouse is the main destination for data replicated through the pipeline. These specialized databases contain all of an enterprise’s cleaned, mastered data in a centralized location for use in analytics, reporting, and business intelligence by analysts and executives.
Less-structured data can flow into data lakes, where data analysts and data scientists can access the large quantities of rich and minable information.
Finally, an enterprise may feed data into an analytics tool or service that directly accepts data feeds.
Data pipelines are complex systems that consist of software, hardware, and networking components, all of which are subject to failures. To keep the pipeline operational and capable of extracting and loading data, developers must write monitoring, logging, and alerting code to help data engineers manage performance and resolve any problems that arise.
Data pipeline technologies and techniques
When it comes to using data pipelines, businesses have two choices: write their own or use a SaaS pipeline.
Organizations can task their developers with writing, testing, and maintaining the code required for a data pipeline. In the process they may use several toolkits and frameworks:
- Workflow management tools can reduce the difficulty of creating a data pipeline. Open source tools like Airflow and Luigi structure the processes that make up the pipeline, automatically resolve dependencies, and give developers a way to visualize and organize data workflows.
- Event and messaging frameworks like Apache Kafka and RabbitMQ allow businesses to generate faster, better data from their existing applications. These frameworks capture events from business applications, making them available as high-throughput streams, and enable communication between different systems using their own protocols.
- Timely scheduling of processes is also critical in any data pipeline. Many tools allow users to create detailed schedules governing data ingestion, transformation, and loading to destinations, from the simple cron utility to entire dedicated workload automation platforms.
However, there are problems with the do-it-yourself approach. Your developers could be working on projects that provide direct business value, and your data engineers have better things to do than babysit complex systems.
Thanks to SaaS data pipelines, enterprises don’t need to write their own ETL code and build data pipelines from scratch. Stitch, for example, provides a data pipeline that’s quick to set up and easy to manage. Take a trip through Stitch’s data pipeline for detail on the technology that Stitch uses to make sure every record gets to its destination.
Save yourself the headache of assembling your own data pipeline — try Stitch today.