Extract, transform, load (ETL) is the main process through which enterprises gather information from data sources and replicate it to destinations like data warehouses for use with business intelligence (BI) tools. ETL tools and services allow enterprises to quickly set up a data pipeline and begin ingesting data.
Analysts and engineers can alternatively use programming languages like Python to build their own ETL pipelines. This allows them to customize and control every aspect of the pipeline, but a handmade pipeline also requires more time and effort to create and maintain. Let's take a look at how to use Python for ETL, and why you may not need to.
Python is an elegant, versatile language with an ecosystem of powerful modules and code libraries. Writing Python for ETL starts with knowledge of the relevant frameworks and libraries, such as workflow management utilities, libraries for accessing and extracting data, and fully-featured ETL toolkits.
Workflow management is the process of designing, modifying, and monitoring workflow applications, which perform business tasks in sequence automatically. In the context of ETL, workflow management organizes engineering and maintenance activities, and workflow applications can also automate ETL tasks themselves. Two of the most popular workflow management tools are Airflow and Luigi.
Apache Airflow uses directed acyclic graphs (DAG) to describe relationships between tasks. In a DAG, individual tasks have both dependencies and dependents — they are directed — but following any sequence never results in looping back or revisiting a previous task — they are not cyclic.
Airflow provides a command-line interface (CLI) for sophisticated task graph operations and a graphical user interface (GUI) for monitoring and visualizing workflows.
Original developer Spotify used Luigi to automate or simplify internal tasks such as those generating weekly and recommended playlists. Now it's built to support a variety of workflows. Prospective Luigi users should keep in mind that it isn't intended to scale beyond tens of thousands of scheduled jobs.
Beyond overall workflow management and scheduling, Python can access libraries that extract, process, and transport data, such as pandas, Beautiful Soup, and Odo.
pandas is an accessible, convenient, and high-performance data manipulation and analysis library. It's useful for data wrangling, as well as general data work that intersects with other processes, from manually prototyping and sharing a machine learning algorithm within a research group to setting up automatic scripts that process data for a real-time interactive dashboard. pandas is often used alongside mathematical, scientific, and statistical libraries such as NumPy, SciPy, and scikit-learn.
On the data extraction front, Beautiful Soup is a popular web scraping and parsing utility. It provides tools for parsing hierarchical data formats, including those found on the web, such as HTML pages or JSON records. Programmers can use Beautiful Soup to grab structured information from the messiest of websites and online applications.
Odo is a lightweight utility with a single, eponymous function that automatically migrates data between formats. Programmers can call odo(source, target) on native Python data structures or external file and framework formats, and the data is immediately converted and ready for use by other ETL code.
Finally, a whole class of Python libraries are actually complete, fully-featured ETL frameworks, including Bonobo, petl, and pygrametl.
Bonobo is a lightweight framework, using native Python features like functions and iterators to perform ETL tasks. These are linked together in DAGs and can be executed in parallel. Bonobo is designed for writing simple, atomic, but diverse transformations that are easy to test and monitor.
petl is a general-purpose ETL package designed for ease of use and convenience. Though it's quick to pick up and get working, this package is not designed for large or memory-intensive data sets and pipelines. It's more appropriate as a portable ETL toolkit for small, simple projects, or for prototyping and testing.
pygrametl also provides ETL functionality in code that's easy to integrate into other Python applications. pygrametl includes integrations with Jython and CPython libraries, allowing programmers to work with other tools and providing flexibility in ETL performance and throughput.
Coding ETL processes in Python can take many forms, depending on technical requirements, business objectives, which libraries existing tools are compatible with, and how much developers feel they need to work from scratch. Python's strengths lie in working with indexed data structures and dictionaries, which are important in ETL operations.
Python is versatile enough that users can code almost any ETL process with native data structures. For example, filtering null values out of a list is easy with some help from the built-in Python math module:
import math data = [1.0, 3.0, 6.5, float('NaN'), 40.0, float('NaN')] filtered =  for value in data: if not math.isnan(value): filtered.append(value)
Users can also take advantage of list comprehensions for the same purpose:
filtered = [value for value in data if not math.isnan(value)]
Coding the entire ETL process from scratch isn't particularly efficient, so most ETL code ends up being a mix of pure Python code and externally defined functions or objects, such as those from libraries mentioned above. For instance, users can employ pandas to filter an entire DataFrame of rows containing nulls:
filtered = data.dropna()
Python software development kits (SDK), application programming interfaces (API), and other utilities are available for many platforms, some of which may be useful in coding for ETL. For example, the Anaconda platform is a Python distribution of modules and libraries relevant for working with data. It includes its own package manager and cloud hosting for sharing code notebooks and Python environments.
Much of the advice relevant for generally coding in Python also applies to programming for ETL. For example, the code should be "Pythonic" — which means programmers should follow some language-specific guidelines that make scripts concise and legible and represent the programmer's intentions. Documentation is also important, as well as good package management and watching out for dependencies.
Although Python is a viable choice for coding ETL tasks, developers do use other programming languages for data ingestion and loading.
Java is one of the most popular programming languages, especially for building client-server web applications. Java has influenced other programming languages — including Python — and spawned several spinoffs, such as Scala. Java forms the backbone of a slew of big data tools, such as Hadoop and Spark. The Java ecosystem also features a collection of libraries comparable to Python's.
Ruby is a scripting language like Python that allows developers to build ETL pipelines, but few ETL-specific Ruby frameworks exist to simplify the task. However, several libraries are currently undergoing development, including projects like Kiba, Nokogiri, and Square's ETL package.
Go, or Golang, is a programming language similar to C that's designed for data analysis and big data applications. Go features several machine learning libraries, support for Google's TensorFlow, some data pipeline libraries, like Apache Beam, and a couple of ETL toolkits — Crunch and Pachyderm.
Beyond alternative programming languages for manually building ETL processes, a wide set of platforms and tools can now perform ETL for enterprises.
There are benefits to using existing ETL tools over trying to build a data pipeline from scratch. ETL tools can compartmentalize and simplify data pipelines, leading to cost and resource savings, increased employee efficiency, and more performant data ingestion. ETL tools include connectors for many popular data sources and destinations, and can ingest data quickly. Organizations can add or change source or target systems without waiting for programmers to work on the pipeline first. ETL tools keep pace with SaaS platforms' updates to their APIs as well, allowing data ingestion to continue uninterrupted.
Although manual coding provides the highest level of control and customization, outsourcing ETL design, implementation, and management to expert third parties rarely represents a sacrifice in features or functionality. ETL has been a critical part of IT infrastructure for years, so ETL service providers now cover most use cases and technical requirements.
Stitch is a robust tool for replicating data to a data warehouse. It allows anyone to set up a data pipeline with a few clicks instead of thousands of lines of Python code. Try it for free.