Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda
Processing Large Files with Pandas
Processing Large Files with Pandas
Working with large datasets is a core responsibility in modern data engineering. Whether you’re ingesting logs, aggregating transactions, or transforming raw files into insights, chances are you’ll hit performance or memory bottlenecks if you’re not careful—especially when using Pandas, the default data wrangling tool in Python.
This guide dives deep into efficient methods to process large files with Pandas, and how to do it in cloud-native environments like AWS Lambda and AWS Glue.
Why Is Processing Large Files Challenging?
Pandas loads data into memory (RAM). So, if your CSV, JSON, or log file is larger than the available memory on your machine—or your Lambda function’s max memory (10GB)—you’ll likely see performance issues or crashes.
Here are a few reasons why large file processing is tricky:
- Memory Overflow: Pandas tries to load everything into memory by default.
- Slow Disk I/O: Reading/writing CSVs repeatedly can bottleneck processing.
- Data Types: Incorrect or unoptimized dtypes can double memory usage.
- Cloud Constraints: Environments like Lambda are ephemeral and limited in memory/runtime.
Common Large File Scenarios
Format | Memory-Friendly? | Notes |
---|---|---|
CSV | No | Text-heavy, slow I/O |
JSON | No | Nested, verbose |
Parquet | Yes | Columnar, compressed |
Logs (.txt) | No | Semi-structured |
Excel | No | Poor for scale |
Best Practices for Local Processing with Pandas
1. Use dtype
to Minimize Memory
Assign appropriate data types to avoid unnecessary memory usage.
dtypes = {
'user_id': 'int32',
'amount': 'float32',
'status': 'category'
}
df = pd.read_csv('transactions.csv', dtype=dtypes)
Reduces memory footprint by 30–70%
2. Read Data in Chunks
Use chunksize
to load portions of the file into memory.
chunksize = 100_000
for chunk in pd.read_csv('large_file.csv', chunksize=chunksize):
filtered = chunk[chunk['status'] == 'complete']
filtered.to_csv('filtered.csv', mode='a', index=False)
Makes it possible to process multi-GB files on laptops
3. Use usecols
to Skip Unused Columns
df = pd.read_csv('large.csv', usecols=['id', 'amount', 'region'])
Reduces I/O and speeds up processing
4. Convert Repetitive Strings to Categories
df['region'] = df['region'].astype('category')
Huge memory savings for low-cardinality string fields
5. Switch to Binary Formats
CSV and JSON are verbose. Convert to Parquet for faster, compressed processing.
df.to_parquet('optimized.parquet')
df = pd.read_parquet('optimized.parquet')
Up to 10x faster and 80% smaller on disk
Processing Large Files in AWS Lambda with Pandas
AWS Lambda is serverless and perfect for automation, but comes with limits:
- Max memory: 10 GB
- Max runtime: 15 minutes
- No persistent local disk (only
/tmp
with 512MB–10GB limit)
Use Pandas with Chunking Inside Lambda
Here’s how to handle large CSVs from S3 in a memory-safe way:
import boto3, pandas as pd, io
def lambda_handler(event, context):
s3 = boto3.client('s3')
bucket = 'my-bucket'
key = 'data/large.csv'
obj = s3.get_object(Bucket=bucket, Key=key)
for chunk in pd.read_csv(io.BytesIO(obj['Body'].read()), chunksize=500_000):
chunk['processed'] = True
buffer = io.StringIO()
chunk.to_csv(buffer, index=False)
s3.put_object(Bucket=bucket, Key='data/output/part.csv', Body=buffer.getvalue())
return {'status': 'ETL done'}
Works well for medium-sized files (up to ~1GB)
Avoid writing to
/tmp
if your file is large; stream directly to S3.
Processing Large Files with AWS Glue and Pandas
AWS Glue is a fully managed ETL service built for scale. While Glue favors Spark under the hood, you can use Pandas inside Glue Python Shell jobs or via Pandas-on-Spark (Koalas) in Glue 3.0+.
Option 1: Pandas in AWS Glue Python Shell
Use for lightweight jobs or custom logic:
import pandas as pd
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='my-bucket', Key='large.csv')
for chunk in pd.read_csv(io.BytesIO(obj['Body'].read()), chunksize=1_000_000):
# Transform
chunk['processed'] = True
buffer = io.StringIO()
chunk.to_csv(buffer, index=False)
s3.put_object(Bucket='my-bucket', Key='processed/output.csv', Body=buffer.getvalue())
Easy transition for Pandas users
Not as scalable as Spark for true big data
Option 2: Use Pandas-on-Spark (koalas
) in Glue
import pyspark.pandas as ps
psdf = ps.read_csv("s3://my-bucket/large.csv")
filtered = psdf[psdf["amount"] > 1000]
filtered.to_parquet("s3://my-bucket/processed/output/")
Blends Pandas syntax with Spark scalability
Efficient for 10GB+ datasets
Requires understanding of distributed systems and Spark
Local vs Lambda vs Glue: What to Use When?
Feature | Local Pandas | AWS Lambda | AWS Glue (Python/Spark) |
---|---|---|---|
Memory Limit | Your system RAM | 10 GB | Scales to terabytes |
Scalability | Low | Medium | High |
Runtime Limit | None | 15 minutes | Up to 48 hours |
Cost Model | Your machine | Pay-per-invoke | Pay-per-job |
Ideal Use Case | Dev, quick jobs | Medium files, automation | Big ETL jobs, scheduled pipelines |
Real Example: Daily Transaction ETL (Lambda vs Glue)
In Lambda:
- File: 500MB CSV in S3
- Task: Filter completed rows and save to S3
- Runtime: ~3 minutes
In Glue:
- File: 25GB Parquet
- Task: Join with customer metadata, aggregate per region
- Runtime: ~6 minutes with 5 workers
Tools That Help
dask
— Parallelized Pandas for larger-than-RAM dataawswrangler
— AWS-optimized Pandas wrapper (Glue + Pandas)polars
— Lightning-fast DataFrame library, better than Pandas for many use casespyarrow
— Needed for Parquet read/writeboto3
— Essential for S3 interaction in Lambda or local scripts
Final Best Practices
- Profile memory usage before/after loading (
memory_usage(deep=True)
) - Always define
dtype
manually - Use
chunksize
andusecols
for reads - Store in Parquet format for repeated reads
- In Lambda: avoid
/tmp
if not needed—stream withio
- In Glue: prefer Spark engine for large joins or aggregates
Conclusion
Pandas can scale to handle large files, if used wisely. When you’re operating in the cloud, combining Pandas with AWS Lambda or Glue opens up flexible, cost-efficient ETL possibilities.
Choose the right tool:
- Use Lambda for lightweight, event-driven automation.
- Use Glue when data volume and complexity go beyond what a single machine can handle.
Mastering these approaches ensures you’re not just writing ETL pipelines—but building production-grade, scalable, and cost-efficient data workflows.