Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda
Handling JSON and CSV Files in Python for Data Engineering
In the world of data engineering, file formats like JSON and CSV are used almost everywhere—from storing structured records and logging user events to configuring APIs and streaming real-time data. If you’re working with data, chances are you’ll encounter these formats almost daily. Whether you’re building a data pipeline, cleaning raw datasets, or integrating with third-party services, having a solid grasp of how to handle JSON and CSV files in Python is crucial.
This comprehensive guide is designed to help you learn how to read, write, transform, and work with JSON and CSV files using Python in a way that’s not just technically sound but also easy to understand—even if you’re relatively new to data engineering. We’ll walk through each concept with clear explanations, real-world examples, and Python code snippets. We’ll also discuss best practices, common pitfalls, and how to prepare your file-handling logic for production environments.
What are CSV Files?
CSV stands for Comma-Separated Values. It’s one of the simplest ways to store structured data in plain text form. Each line in a CSV file corresponds to a row in a table, and each field (column) is separated by a comma.
Why CSV is Important in Data Engineering
-
It’s easy to generate and read.
-
Almost every analytics tool or database supports it.
-
Ideal for simple tabular data and reports.
-
Lightweight and fast to process.
Common Use Cases
-
Exporting reports from databases or tools like Excel.
-
Loading customer data into CRMs or marketing platforms.
-
Ingesting flat files from partners or internal teams.
Reading CSV Files in Python
Python provides two main ways to read CSV files:
1. Using Pandas (Recommended for Dataframes)
import pandas as pd
df = pd.read_csv('data/sales.csv')
print(df.head())
This method is super-efficient and gives you access to powerful dataframe operations right away.
2. Using the Built-in csv Module
import csv
with open('data/sales.csv', mode='r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
The built-in module gives you more control and is memory-efficient for large files.
Writing to CSV Files
Using Pandas:
df.to_csv('data/output.csv', index=False)
Simple and fast. Removes index column and writes headers automatically.
Using csv.writer:
with open('data/output.csv', mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Age'])
writer.writerow(['John', 30])
This gives more flexibility if you’re writing data row-by-row.
What is JSON?
JSON (JavaScript Object Notation) is a lightweight data format used to represent hierarchical or nested data. It’s commonly used in APIs, logs, configuration files, and NoSQL databases like MongoDB.
Why JSON is Important in Data Engineering
-
Human-readable and language-independent.
-
Supports nested and complex structures.
-
Ideal for configuration and messaging systems.
-
Frequently used in REST APIs and real-time data exchanges.
Common Use Cases
-
Parsing API responses
-
Storing user activity logs
-
Configuring applications (settings.json)
-
Event-driven architecture
Reading JSON Files in Python
Using Pandas (for flat structures):
import pandas as pd
df = pd.read_json('data/data.json')
print(df.head())
Use this for simple JSON arrays or flat records.
Using json module (for nested JSON):
import json
with open('data/config.json') as file:
data = json.load(file)
print(data['settings'])
This gives you full control over how you access and manipulate keys.
Writing to JSON Files
import json
output = {'name': 'Alice', 'role': 'Data Engineer'}
with open('data/output.json', 'w') as file:
json.dump(output, file, indent=4)
The indent=4
makes the JSON human-readable and properly formatted.
Transforming Between JSON and CSV
Very often, you’ll need to switch between formats, especially when dealing with APIs (JSON) and analytics tools (CSV).
CSV to JSON Example:
import pandas as pd
df = pd.read_csv('data/sales.csv')
df.to_json('data/sales.json', orient='records', lines=True)
This converts rows into individual JSON records, ideal for line-by-line processing.
Flattening Nested JSON:
import json
import pandas as pd
with open('data/nested.json') as file:
data = json.load(file)
flat_data = pd.json_normalize(data['employees'])
flat_data.to_csv('data/employees.csv', index=False)
Use json_normalize()
to deal with nested fields like addresses, contacts, or preferences.
Real-World Tips and Best Practices
-
Always validate your input files – Check encoding (UTF-8 preferred) and file format.
-
Use try-except blocks to handle unexpected file errors.
-
Validate JSON schema to ensure consistency in structure.
-
Avoid hardcoding paths – Use
os.path
or environment variables. -
Compress large files – Use
gzip
for JSON/CSV to save space. -
Stream large files – Don’t load multi-GB files into memory; read line-by-line.
-
Log your steps – Always log file operations in production environments.
-
Backup before overwrite – Always keep a copy before transforming or rewriting files.
Cloud and Big Data Considerations
In a cloud environment like AWS or GCP, CSV and JSON files are often stored in buckets like S3, processed using services like Lambda, AWS Glue, or Apache Spark.
Read CSV from AWS S3 using Pandas:
import boto3
import pandas as pd
from io import StringIO
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='my-bucket', Key='sales.csv')
df = pd.read_csv(StringIO(obj['Body'].read().decode('utf-8')))
Upload transformed JSON to S3:
with open("sales.json", "rb") as data:
s3.upload_fileobj(data, "my-bucket", "processed/sales.json")
This flexibility allows you to integrate cloud-native workflows into your local Python code.
Conclusion
Working with JSON and CSV files is at the core of every data engineer’s workflow. These file formats are not just essential—they’re foundational. Whether you’re ingesting data from third-party services, configuring internal systems, or exporting clean datasets for analytics, the ability to confidently read, write, and manipulate JSON and CSV data using Python will significantly boost your productivity.
Pandas, along with Python’s native json
and csv
libraries, provides everything you need to work with these formats reliably. Once you master these basics, you’ll be well on your way to tackling more complex tasks like streaming ingestion, schema evolution, and cloud-scale processing.