Handling Big Files in Pandas – Tips for Data Engineers

Introduction to Python for DE
Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda
Processing Large Files with Pandas

Processing Large Files with Pandas

Working with large datasets is a core responsibility in modern data engineering. Whether you’re ingesting logs, aggregating transactions, or transforming raw files into insights, chances are you’ll hit performance or memory bottlenecks if you’re not careful—especially when using Pandas, the default data wrangling tool in Python.

This guide dives deep into efficient methods to process large files with Pandas, and how to do it in cloud-native environments like AWS Lambda and AWS Glue.

Why Is Processing Large Files Challenging?

Pandas loads data into memory (RAM). So, if your CSV, JSON, or log file is larger than the available memory on your machine—or your Lambda function’s max memory (10GB)—you’ll likely see performance issues or crashes.

Here are a few reasons why large file processing is tricky:

Memory Overflow: Pandas tries to load everything into memory by default.
Slow Disk I/O: Reading/writing CSVs repeatedly can bottleneck processing.
Data Types: Incorrect or unoptimized dtypes can double memory usage.
Cloud Constraints: Environments like Lambda are ephemeral and limited in memory/runtime.

Common Large File Scenarios

Format	Memory-Friendly?	Notes
CSV	No	Text-heavy, slow I/O
JSON	No	Nested, verbose
Parquet	Yes	Columnar, compressed
Logs (.txt)	No	Semi-structured
Excel	No	Poor for scale

Best Practices for Local Processing with Pandas

1. Use `dtype` to Minimize Memory

Assign appropriate data types to avoid unnecessary memory usage.

dtypes = {
    'user_id': 'int32',
    'amount': 'float32',
    'status': 'category'
}
df = pd.read_csv('transactions.csv', dtype=dtypes)

Reduces memory footprint by 30–70%

2. Read Data in Chunks

Use chunksize to load portions of the file into memory.

chunksize = 100_000
for chunk in pd.read_csv('large_file.csv', chunksize=chunksize):
    filtered = chunk[chunk['status'] == 'complete']
    filtered.to_csv('filtered.csv', mode='a', index=False)

Makes it possible to process multi-GB files on laptops

3. Use `usecols` to Skip Unused Columns

df = pd.read_csv('large.csv', usecols=['id', 'amount', 'region'])

Reduces I/O and speeds up processing

4. Convert Repetitive Strings to Categories

df['region'] = df['region'].astype('category')

Huge memory savings for low-cardinality string fields

5. Switch to Binary Formats

CSV and JSON are verbose. Convert to Parquet for faster, compressed processing.

df.to_parquet('optimized.parquet')
df = pd.read_parquet('optimized.parquet')

Up to 10x faster and 80% smaller on disk

Processing Large Files in AWS Lambda with Pandas

AWS Lambda is serverless and perfect for automation, but comes with limits:

Max memory: 10 GB
Max runtime: 15 minutes
No persistent local disk (only /tmp with 512MB–10GB limit)

Use Pandas with Chunking Inside Lambda

Here’s how to handle large CSVs from S3 in a memory-safe way:

import boto3, pandas as pd, io

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket = 'my-bucket'
    key = 'data/large.csv'
    obj = s3.get_object(Bucket=bucket, Key=key)

    for chunk in pd.read_csv(io.BytesIO(obj['Body'].read()), chunksize=500_000):
        chunk['processed'] = True
        buffer = io.StringIO()
        chunk.to_csv(buffer, index=False)
        s3.put_object(Bucket=bucket, Key='data/output/part.csv', Body=buffer.getvalue())

    return {'status': 'ETL done'}

Works well for medium-sized files (up to ~1GB)

Avoid writing to /tmp if your file is large; stream directly to S3.

Processing Large Files with AWS Glue and Pandas

AWS Glue is a fully managed ETL service built for scale. While Glue favors Spark under the hood, you can use Pandas inside Glue Python Shell jobs or via Pandas-on-Spark (Koalas) in Glue 3.0+.

Option 1: Pandas in AWS Glue Python Shell

Use for lightweight jobs or custom logic:

import pandas as pd
import boto3
import io

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='my-bucket', Key='large.csv')

for chunk in pd.read_csv(io.BytesIO(obj['Body'].read()), chunksize=1_000_000):
    # Transform
    chunk['processed'] = True
    buffer = io.StringIO()
    chunk.to_csv(buffer, index=False)
    s3.put_object(Bucket='my-bucket', Key='processed/output.csv', Body=buffer.getvalue())

Easy transition for Pandas users

Not as scalable as Spark for true big data

Option 2: Use Pandas-on-Spark (`koalas`) in Glue

import pyspark.pandas as ps

psdf = ps.read_csv("s3://my-bucket/large.csv")
filtered = psdf[psdf["amount"] > 1000]
filtered.to_parquet("s3://my-bucket/processed/output/")

Blends Pandas syntax with Spark scalability
Efficient for 10GB+ datasets
Requires understanding of distributed systems and Spark

Local vs Lambda vs Glue: What to Use When?

Feature	Local Pandas	AWS Lambda	AWS Glue (Python/Spark)
Memory Limit	Your system RAM	10 GB	Scales to terabytes
Scalability	Low	Medium	High
Runtime Limit	None	15 minutes	Up to 48 hours
Cost Model	Your machine	Pay-per-invoke	Pay-per-job
Ideal Use Case	Dev, quick jobs	Medium files, automation	Big ETL jobs, scheduled pipelines

Real Example: Daily Transaction ETL (Lambda vs Glue)

In Lambda:

File: 500MB CSV in S3
Task: Filter completed rows and save to S3
Runtime: ~3 minutes

In Glue:

File: 25GB Parquet
Task: Join with customer metadata, aggregate per region
Runtime: ~6 minutes with 5 workers

Tools That Help

dask — Parallelized Pandas for larger-than-RAM data
awswrangler — AWS-optimized Pandas wrapper (Glue + Pandas)
polars — Lightning-fast DataFrame library, better than Pandas for many use cases
pyarrow — Needed for Parquet read/write
boto3 — Essential for S3 interaction in Lambda or local scripts

Final Best Practices

Profile memory usage before/after loading (memory_usage(deep=True))
Always define dtype manually
Use chunksize and usecols for reads
Store in Parquet format for repeated reads
In Lambda: avoid /tmp if not needed—stream with io
In Glue: prefer Spark engine for large joins or aggregates

Conclusion

Pandas can scale to handle large files, if used wisely. When you’re operating in the cloud, combining Pandas with AWS Lambda or Glue opens up flexible, cost-efficient ETL possibilities.

Choose the right tool:

Use Lambda for lightweight, event-driven automation.
Use Glue when data volume and complexity go beyond what a single machine can handle.

Mastering these approaches ensures you’re not just writing ETL pipelines—but building production-grade, scalable, and cost-efficient data workflows.