Processing Large Files with Pandas

Working with large datasets is a core responsibility in modern data engineering. Whether you’re ingesting logs, aggregating transactions, or transforming raw files into insights, chances are you’ll hit performance or memory bottlenecks if you’re not careful—especially when using Pandas, the default data wrangling tool in Python.

This guide dives deep into efficient methods to process large files with Pandas, and how to do it in cloud-native environments like AWS Lambda and AWS Glue.


Why Is Processing Large Files Challenging?

Pandas loads data into memory (RAM). So, if your CSV, JSON, or log file is larger than the available memory on your machine—or your Lambda function’s max memory (10GB)—you’ll likely see performance issues or crashes.

Here are a few reasons why large file processing is tricky:

  • Memory Overflow: Pandas tries to load everything into memory by default.
  • Slow Disk I/O: Reading/writing CSVs repeatedly can bottleneck processing.
  • Data Types: Incorrect or unoptimized dtypes can double memory usage.
  • Cloud Constraints: Environments like Lambda are ephemeral and limited in memory/runtime.

Common Large File Scenarios
FormatMemory-Friendly?Notes
CSV       NoText-heavy, slow I/O
JSON       NoNested, verbose
Parquet      YesColumnar, compressed
Logs (.txt)      NoSemi-structured
Excel      NoPoor for scale

Best Practices for Local Processing with Pandas
1. Use dtype to Minimize Memory

Assign appropriate data types to avoid unnecessary memory usage.

dtypes = {
    'user_id': 'int32',
    'amount': 'float32',
    'status': 'category'
}
df = pd.read_csv('transactions.csv', dtype=dtypes)

Reduces memory footprint by 30–70%


2. Read Data in Chunks

Use chunksize to load portions of the file into memory.

chunksize = 100_000
for chunk in pd.read_csv('large_file.csv', chunksize=chunksize):
    filtered = chunk[chunk['status'] == 'complete']
    filtered.to_csv('filtered.csv', mode='a', index=False)

Makes it possible to process multi-GB files on laptops


3. Use usecols to Skip Unused Columns
df = pd.read_csv('large.csv', usecols=['id', 'amount', 'region'])

Reduces I/O and speeds up processing


4. Convert Repetitive Strings to Categories
df['region'] = df['region'].astype('category')

Huge memory savings for low-cardinality string fields


5. Switch to Binary Formats

CSV and JSON are verbose. Convert to Parquet for faster, compressed processing.

df.to_parquet('optimized.parquet')
df = pd.read_parquet('optimized.parquet')

Up to 10x faster and 80% smaller on disk

 


Processing Large Files in AWS Lambda with Pandas

AWS Lambda is serverless and perfect for automation, but comes with limits:

  • Max memory: 10 GB
  • Max runtime: 15 minutes
  • No persistent local disk (only /tmp with 512MB–10GB limit)
Use Pandas with Chunking Inside Lambda

Here’s how to handle large CSVs from S3 in a memory-safe way:

import boto3, pandas as pd, io

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket = 'my-bucket'
    key = 'data/large.csv'
    obj = s3.get_object(Bucket=bucket, Key=key)

    for chunk in pd.read_csv(io.BytesIO(obj['Body'].read()), chunksize=500_000):
        chunk['processed'] = True
        buffer = io.StringIO()
        chunk.to_csv(buffer, index=False)
        s3.put_object(Bucket=bucket, Key='data/output/part.csv', Body=buffer.getvalue())

    return {'status': 'ETL done'}

Works well for medium-sized files (up to ~1GB)

⚠️ Avoid writing to /tmp if your file is large; stream directly to S3.


Processing Large Files with AWS Glue and Pandas

AWS Glue is a fully managed ETL service built for scale. While Glue favors Spark under the hood, you can use Pandas inside Glue Python Shell jobs or via Pandas-on-Spark (Koalas) in Glue 3.0+.

Option 1: Pandas in AWS Glue Python Shell

Use for lightweight jobs or custom logic:

import pandas as pd
import boto3
import io

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='my-bucket', Key='large.csv')

for chunk in pd.read_csv(io.BytesIO(obj['Body'].read()), chunksize=1_000_000):
    # Transform
    chunk['processed'] = True
    buffer = io.StringIO()
    chunk.to_csv(buffer, index=False)
    s3.put_object(Bucket='my-bucket', Key='processed/output.csv', Body=buffer.getvalue())

Easy transition for Pandas users

⚠️ Not as scalable as Spark for true big data


Option 2: Use Pandas-on-Spark (koalas) in Glue
import pyspark.pandas as ps

psdf = ps.read_csv("s3://my-bucket/large.csv")
filtered = psdf[psdf["amount"] > 1000]
filtered.to_parquet("s3://my-bucket/processed/output/")

✅ Blends Pandas syntax with Spark scalability
✅ Efficient for 10GB+ datasets
⚠️ Requires understanding of distributed systems and Spark


Local vs Lambda vs Glue: What to Use When?
FeatureLocal PandasAWS LambdaAWS Glue (Python/Spark)
Memory LimitYour system RAM10 GBScales to terabytes
ScalabilityLowMediumHigh
Runtime LimitNone15 minutesUp to 48 hours
Cost ModelYour machinePay-per-invokePay-per-job
Ideal Use CaseDev, quick jobsMedium files, automationBig ETL jobs, scheduled pipelines

Real Example: Daily Transaction ETL (Lambda vs Glue)

In Lambda:

  • File: 500MB CSV in S3
  • Task: Filter completed rows and save to S3
  • Runtime: ~3 minutes

In Glue:

  • File: 25GB Parquet
  • Task: Join with customer metadata, aggregate per region
  • Runtime: ~6 minutes with 5 workers

Tools That Help
  • dask — Parallelized Pandas for larger-than-RAM data
  • awswrangler — AWS-optimized Pandas wrapper (Glue + Pandas)
  • polars — Lightning-fast DataFrame library, better than Pandas for many use cases
  • pyarrow — Needed for Parquet read/write
  • boto3 — Essential for S3 interaction in Lambda or local scripts

Final Best Practices
  • Profile memory usage before/after loading (memory_usage(deep=True))
  • Always define dtype manually
  • Use chunksize and usecols for reads
  • Store in Parquet format for repeated reads
  • In Lambda: avoid /tmp if not needed—stream with io
  • In Glue: prefer Spark engine for large joins or aggregates

Conclusion

Pandas can scale to handle large files, if used wisely. When you’re operating in the cloud, combining Pandas with AWS Lambda or Glue opens up flexible, cost-efficient ETL possibilities.

Choose the right tool:

  • Use Lambda for lightweight, event-driven automation.
  • Use Glue when data volume and complexity go beyond what a single machine can handle.

Mastering these approaches ensures you’re not just writing ETL pipelines—but building production-grade, scalable, and cost-efficient data workflows.