Python and Pandas for ETL

ETL (Extract, Transform, Load) is a fundamental process in data engineering that involves gathering data from multiple sources, transforming it into a usable format, and loading it into a target system like a data warehouse or data lake. Python, with its rich ecosystem of libraries, is one of the most popular tools for building ETL pipelines. And at the heart of Python-based ETL lies Pandas, a powerful data manipulation library.

In this section, we’ll dive into how you can use Python and Pandas to perform ETL efficiently, covering each stage of the process with examples and best practices.


 

1. Extracting Data Using Python and Pandas

Python can extract data from a variety of sources:

  • Flat files (CSV, Excel, JSON)

  • Databases (PostgreSQL, MySQL, SQLite)

  • APIs and Web Services

  • Cloud storage (AWS S3, GCP, Azure)

Example: Reading CSV and JSON data using Pandas

import pandas as pd

# Read CSV file
csv_data = pd.read_csv("data/sales_data.csv")

# Read JSON file
json_data = pd.read_json("data/config.json")

Example: Reading from SQL database using SQLAlchemy

from sqlalchemy import create_engine

engine = create_engine('postgresql://user:password@host:port/dbname')
df = pd.read_sql("SELECT * FROM users", engine)

2. Transforming Data with Pandas

Pandas shines in the transformation phase, allowing you to:

  • Clean data (handle missing values, duplicates)

  • Rename, filter, and format columns

  • Merge and join datasets

  • Perform aggregations and groupings

  • Apply custom functions to data

Example: Cleaning and transforming a dataset

# Drop rows with missing values
data = csv_data.dropna()

# Rename columns for clarity
data.rename(columns={"cust_id": "customer_id"}, inplace=True)

# Filter out rows where sales < 100
data = data[data["sales"] >= 100]

# Create new column
data["sales_tax"] = data["sales"] * 0.1

3. Loading Data into Target Systems

Once transformed, the final step is to load data into a destination system. This can be another file, a database, or a cloud storage service.

Example: Save data to a new CSV file

data.to_csv("output/cleaned_sales_data.csv", index=False)

Example: Load data into a PostgreSQL database

data.to_sql("cleaned_sales", engine, if_exists='replace', index=False)

Example: Uploading to AWS S3 using boto3

import boto3

s3 = boto3.client('s3')
s3.upload_file("output/cleaned_sales_data.csv", "my-bucket", "cleaned_sales_data.csv")

4. Best Practices for Building ETL Pipelines with Pandas
  • Always validate the schema before processing data

  • Use logging to track the ETL process

  • Break the pipeline into modular functions

  • Use environment variables for sensitive credentials

  • Test transformations on a sample before full-scale processing


5. Scaling Pandas for Larger Workloads

While Pandas is great for moderate-sized datasets, it may struggle with very large files. In such cases, consider using:

  • Dask: Parallel computing library for big data

  • PySpark: Distributed processing for big data pipelines


Conclusion

Pandas simplifies the ETL process, making it easy to manipulate and clean data in Python. Its intuitive syntax, wide functionality, and strong integration with data sources make it a preferred tool for data engineers working on small to mid-sized pipelines. With good practices and an understanding of its capabilities, you can build robust, scalable ETL workflows that feed your analytics and reporting systems efficiently.