Data Cleaning with Pandas library

Introduction to Python for DE
Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda

Data Cleaning with Pandas: A Complete Guide for Data Engineers

When working with raw data, especially in the world of data engineering, data cleaning is the first and one of the most essential steps. No matter how good your algorithms or dashboards are, if your data isn’t clean, your output will be flawed. Enter Pandas, Python’s powerful data manipulation library that makes cleaning, transforming, and preparing data seamless and intuitive.

This guide is written to help beginners and intermediate users deeply understand how to perform data cleaning tasks using Pandas in a practical, hands-on way. Whether you’re cleaning CSV files, transforming JSON responses, or preparing data for machine learning or analytics, Pandas is a must-have tool.

Why Data Cleaning Matters

Before jumping into code, let’s understand why data cleaning is vital:

Ensure Accuracy: Garbage in, garbage out. Clean data leads to more accurate insights.
Prevent Errors: Dirty data leads to broken ETL jobs, wrong reports, and confused stakeholders.
Improve Performance: Clean data helps your pipelines run more efficiently.

Getting Started with Pandas

First, install and import Pandas:

pip install pandas

import pandas as pd

Load your dataset (from CSV for example):

df = pd.read_csv("sales_data.csv")
df.head()

Common Data Cleaning Tasks with Pandas

1. Handling Missing Values

Real-world data is messy. You’ll often run into missing or null values.

# Check for missing values
df.isnull().sum()

# Drop rows with any missing values
df.dropna(inplace=True)

# Fill missing values with defaults or strategies
df['Revenue'].fillna(0, inplace=True)
df['Region'].fillna(method='ffill', inplace=True)

Use .fillna() with strategies like:

method='ffill': forward fill
method='bfill': backward fill
A fixed value like 0 or “unknown”

2. Fixing Data Types

Sometimes numeric values are read as strings or dates are read incorrectly.

# Convert string to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Convert column to integer
df['Units'] = df['Units'].astype(int)

Use df.dtypes to inspect your data types.

3. Removing Duplicates

Duplicate data skews analysis and leads to double counting.

# Check for duplicates
df.duplicated().sum()

# Remove duplicates
df.drop_duplicates(inplace=True)

4. Standardizing Text and Casing

Text data is often inconsistently formatted.

# Strip leading/trailing whitespace
df['Customer'] = df['Customer'].str.strip()

# Convert to lowercase
df['Customer'] = df['Customer'].str.lower()

Standardization helps when grouping or filtering by categories.

5. Handling Outliers

Outliers can throw off your averages or visualizations.

# Example: Revenue column
q1 = df['Revenue'].quantile(0.25)
q3 = df['Revenue'].quantile(0.75)
iqr = q3 - q1

# Filter out outliers
filtered_df = df[(df['Revenue'] >= (q1 - 1.5 * iqr)) & (df['Revenue'] <= (q3 + 1.5 * iqr))]

6. Renaming Columns for Clarity

Clear column names help downstream tasks.

df.rename(columns={
    "Cust_Name": "CustomerName",
    "Rev": "Revenue"
}, inplace=True)

Use .columns to inspect and update all column names at once:

df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]

7. Combining or Splitting Columns

Sometimes data needs to be restructured:

# Split full name into first and last
name_split = df['FullName'].str.split(" ", expand=True)
df['FirstName'] = name_split[0]
df['LastName'] = name_split[1]

# Combine year and month into a date string
df['Period'] = df['Year'].astype(str) + "-" + df['Month'].astype(str)

8. Filtering Rows

Clean data also means filtering unnecessary rows:

# Filter only rows from 2023
df = df[df['Date'].dt.year == 2023]

# Keep rows where Revenue > 1000
df = df[df['Revenue'] > 1000]

Exploratory Data Cleaning

While cleaning, always use exploratory techniques to check your data:

# Summary statistics
df.describe()

# Count of unique values
df['Region'].value_counts()

# Correlation heatmap (optional with seaborn)
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(), annot=True)
plt.show()

Sample Workflow: Sales Data ETL Clean-up

import pandas as pd

df = pd.read_csv("sales.csv")

# Drop rows where Revenue is missing
df = df[df['Revenue'].notnull()]

# Standardize column names
df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]

# Clean customer names
df['customer'] = df['customer'].str.strip().str.title()

# Convert dates
df['sale_date'] = pd.to_datetime(df['sale_date'])

# Remove duplicates
df.drop_duplicates(inplace=True)

# Save cleaned file
df.to_csv("sales_cleaned.csv", index=False)

Best Practices for Data Cleaning with Pandas

Always keep a backup before making destructive changes like .dropna() or .drop_duplicates().
Automate cleaning tasks using functions or scripts.
Document transformations clearly using comments or notebooks.
Visualize data to better understand anomalies and patterns.
Use pipelines or classes for modular, reusable cleaning logic.

Bonus: Cleaning Data from S3 or API

Pandas can handle data from various sources:

import boto3
import io

s3 = boto3.client('s3')
obj = s3.get_object(Bucket="my-bucket", Key="raw/sales.csv")
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Or from an API:

import requests
url = "https://api.example.com/sales"
data = requests.get(url).json()
df = pd.json_normalize(data)

Summary

Data cleaning is a critical skill for any data engineer, and Pandas offers a complete suite of tools to make it efficient and powerful. With just a few lines of code, you can transform messy raw data into structured, clean, and ready-to-use datasets.

By following this guide, you’ve now learned:

How to identify and fix common data issues
Practical Pandas code for cleaning tasks
How to handle missing values, duplicates, outliers, and more
Best practices to maintain clean, trustworthy data

If you’re building data pipelines or preparing data for analysis or ML, this step can make or break your entire project.