"A data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database" (Alan Marazzi).

"The purpose of a data pipeline is to move data from a point of origin to a specific destination. At a high level, a data pipeline consists of eight types of components:

  • Origin – The initial point at which data enters the pipeline.
  • Destination – The termination point to which data is delivered.
  • Dataflow – The sequence of processes and data stores through which data moves to get from origin to destination.
  • Storage – The datasets where data is persisted at various stages as it moves through the pipeline.
  • Processing – The steps and activities that are performed to ingest, persist, transform, and deliver data.
  • Workflow – Sequencing and dependency management of processes.
  • Monitoring – Observing to ensure a healthy and efficient pipeline.
  • Technology – The infrastructure and tools that enable dataflow, storage, processing, workflow, and monitoring." – Dave Wells
Divider

More from the data glossary

A definitive guide to data definitions and trends, from the team at Stitch.

Give Stitch a try, on us

Stitch streams all of your data directly to your analytics warehouse.

Set up in minutesUnlimited data volume during trial