Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda
Setting Up Your Environment for Advanced Python in Data Engineering
Setting up a reliable, scalable, and efficient environment is the first and most important step when working on advanced Python projects as a data engineer. Whether you’re working locally, on cloud platforms like AWS, or containerizing your apps for portability, this guide will walk you through everything you need to know.
We’ll cover everything from basic setup on your machine to configuring cloud environments like AWS Lambda and AWS Glue for Python-based data workflows. By the end, you’ll be able to start any project with confidence.
Who Is This For?
This guide is perfect for:
-
Data Engineers and aspiring Data Engineers
-
Python developers looking to level up for real-world data projects
-
Anyone working with cloud-based data pipelines (e.g., AWS Glue, Lambda, S3)
If you’ve ever asked yourself:
-
“How do I isolate my project’s dependencies?”
-
“Which Python libraries should I install for data engineering?”
-
“How can I run my Python code in AWS Glue or Lambda?”
Then this setup guide is for you.
Step 1: Install Python (If You Haven’t Already)
First, ensure you have Python 3.8 or higher installed on your system. Why 3.8+? Because many cloud platforms and libraries now require or recommend it.
Check your current version:
python3 --version
If Python isn’t installed, download it from python.org.
Step 2: Set Up a Virtual Environment
Virtual environments help you isolate project-specific dependencies so they don’t interfere with other Python projects.
Commands:
# Create a virtual environment
python3 -m venv venv
# Activate it (Mac/Linux)
source venv/bin/activate
# Activate it (Windows)
venv\Scripts\activate
Once activated, any packages you install with pip
will be scoped only to this environment.
Step 3: Install Essential Python Libraries
For data engineering, you’ll frequently use:
Core Libraries:
pip install pandas numpy sqlalchemy requests pyarrow boto3
Testing & Code Quality:
pip install black isort flake8 pytest
Optional (for performance and scaling):
pip install dask fastparquet polars
Save your environment:
pip freeze > requirements.txt
This helps you or your team recreate the environment later.
Step 4: Organize Your Project Structure
Keeping your files clean and organized makes your code easier to scale, test, and maintain.
Recommended Structure:
project-name/
├── data/ # Sample/raw input data
├── src/ # Python source code
│ ├── etl/ # ETL logic scripts
│ └── utils/ # Helper functions
├── notebooks/ # Jupyter Notebooks
├── tests/ # Unit tests
├── venv/ # Your virtual environment
├── requirements.txt # List of installed packages
└── README.md # Project overview
This structure supports scalable development, testing, and deployment.
Step 5: Set Up Your IDE (Visual Studio Code Recommended)
Using a proper IDE can drastically improve your productivity. VS Code is lightweight, powerful, and perfect for Python projects.
Extensions to Install:
-
Python
-
Pylance (for IntelliSense and autocompletion)
-
Jupyter
-
Docker (if you’re containerizing)
-
GitLens (for version control)
Set your interpreter to point to your virtual environment.
Step 6: Test Your Setup
Create a test Python file in the src/
folder:
# src/test_env.py
import pandas as pd
print("Pandas version:", pd.__version__)
Run it to ensure everything is working:
python src/test_env.py
If you see the pandas version printed, you’re all set locally!
Step 7: Setting Up a Cloud Environment (AWS Lambda, Glue, and EC2)
Cloud platforms like AWS are widely used in data engineering. Here’s how to configure Python environments for common AWS services.
AWS Lambda (for serverless jobs)
Lambda supports Python, but you need to bundle dependencies correctly.
-
Create a
lambda_function.py
file -
Install your libraries locally:
pip install requests -t ./package
-
Zip everything:
cd package
zip -r ../function.zip .
cd ..
zip -g function.zip lambda_function.py
-
Upload
function.zip
in AWS Lambda Console -
Set the handler to
lambda_function.lambda_handler
Note: Avoid heavy libraries like
pandas
in Lambda unless absolutely necessary.
AWS Glue (for large-scale ETL)
Glue supports both Python (PySpark) and Scala. For Python:
-
Use Glue Studio or the AWS Console
-
Set your script as a job
-
Choose Python 3.x
-
For libraries not included in Glue:
-
Create a wheel file (.whl)
-
Upload it to S3
-
Add the S3 path in Glue job’s “Python library path”
-
Commonly included libraries: pandas
, numpy
, pyarrow
, boto3
AWS EC2 (for full control)
Use EC2 when you need custom environments:
-
Launch an EC2 instance (Amazon Linux 2 preferred)
-
SSH into it:
ssh -i your-key.pem ec2-user@your-ec2-ip
-
Install Python, pip, and virtualenv
sudo yum update -y
sudo yum install python3 -y
python3 -m venv venv
source venv/bin/activate
-
Clone your repo, install dependencies, and run your job!
Bonus: Dockerizing Your Setup (Optional but Recommended)
For reproducibility and portability:
Sample Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "src/main.py"]
Then build and run:
docker build -t data-eng-app .
docker run data-eng-app
Final Thoughts
Setting up a Python environment properly isn’t just a technical step—it’s a productivity boost and a time saver. With your tools, libraries, and workflows in place, you’re now ready to build real-world data pipelines, automate processes, and scale your projects.
If you’re planning to work on AWS or any cloud platform, getting comfortable with Lambda, Glue, or Docker early will pay off in the long run.