Complete Guide to Setting Up Python Environment

Introduction to Python for DE
Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda

Setting Up Your Environment for Advanced Python in Data Engineering

Setting up a reliable, scalable, and efficient environment is the first and most important step when working on advanced Python projects as a data engineer. Whether you’re working locally, on cloud platforms like AWS, or containerizing your apps for portability, this guide will walk you through everything you need to know.

We’ll cover everything from basic setup on your machine to configuring cloud environments like AWS Lambda and AWS Glue for Python-based data workflows. By the end, you’ll be able to start any project with confidence.

Who Is This For?

This guide is perfect for:

Data Engineers and aspiring Data Engineers
Python developers looking to level up for real-world data projects
Anyone working with cloud-based data pipelines (e.g., AWS Glue, Lambda, S3)

If you’ve ever asked yourself:

“How do I isolate my project’s dependencies?”
“Which Python libraries should I install for data engineering?”
“How can I run my Python code in AWS Glue or Lambda?”

Then this setup guide is for you.

Step 1: Install Python (If You Haven’t Already)

First, ensure you have Python 3.8 or higher installed on your system. Why 3.8+? Because many cloud platforms and libraries now require or recommend it.

Check your current version:

python3 --version

If Python isn’t installed, download it from python.org.

Step 2: Set Up a Virtual Environment

Virtual environments help you isolate project-specific dependencies so they don’t interfere with other Python projects.

Commands:

# Create a virtual environment
python3 -m venv venv

# Activate it (Mac/Linux)
source venv/bin/activate

# Activate it (Windows)
venv\Scripts\activate

Once activated, any packages you install with pip will be scoped only to this environment.

Step 3: Install Essential Python Libraries

For data engineering, you’ll frequently use:

Core Libraries:

pip install pandas numpy sqlalchemy requests pyarrow boto3

Testing & Code Quality:

`pip install black isort flake8 pytest`

Optional (for performance and scaling):

`pip install dask fastparquet polars`

Save your environment:

pip freeze > requirements.txt

This helps you or your team recreate the environment later.

Step 4: Organize Your Project Structure

Keeping your files clean and organized makes your code easier to scale, test, and maintain.

Recommended Structure:

project-name/
├── data/              # Sample/raw input data
├── src/               # Python source code
│   ├── etl/           # ETL logic scripts
│   └── utils/         # Helper functions
├── notebooks/         # Jupyter Notebooks
├── tests/             # Unit tests
├── venv/              # Your virtual environment
├── requirements.txt   # List of installed packages
└── README.md          # Project overview

This structure supports scalable development, testing, and deployment.

Step 5: Set Up Your IDE (Visual Studio Code Recommended)

Using a proper IDE can drastically improve your productivity. VS Code is lightweight, powerful, and perfect for Python projects.

Extensions to Install:

Python
Pylance (for IntelliSense and autocompletion)
Jupyter
Docker (if you’re containerizing)
GitLens (for version control)

Set your interpreter to point to your virtual environment.

Step 6: Test Your Setup

Create a test Python file in the src/ folder:

# src/test_env.py
import pandas as pd
print("Pandas version:", pd.__version__)

Run it to ensure everything is working:

python src/test_env.py

If you see the pandas version printed, you’re all set locally!

Step 7: Setting Up a Cloud Environment (AWS Lambda, Glue, and EC2)

Cloud platforms like AWS are widely used in data engineering. Here’s how to configure Python environments for common AWS services.

AWS Lambda (for serverless jobs)

Lambda supports Python, but you need to bundle dependencies correctly.

Create a lambda_function.py file
Install your libraries locally:

pip install requests -t ./package

Zip everything:

cd package
zip -r ../function.zip .
cd ..
zip -g function.zip lambda_function.py

Upload function.zip in AWS Lambda Console
Set the handler to lambda_function.lambda_handler

Note: Avoid heavy libraries like pandas in Lambda unless absolutely necessary.

AWS Glue (for large-scale ETL)

Glue supports both Python (PySpark) and Scala. For Python:

Use Glue Studio or the AWS Console
Set your script as a job
Choose Python 3.x
For libraries not included in Glue:
- Create a wheel file (.whl)
- Upload it to S3
- Add the S3 path in Glue job’s “Python library path”

Commonly included libraries: pandas, numpy, pyarrow, boto3

AWS EC2 (for full control)

Use EC2 when you need custom environments:

Launch an EC2 instance (Amazon Linux 2 preferred)
SSH into it:

ssh -i your-key.pem ec2-user@your-ec2-ip

Install Python, pip, and virtualenv

sudo yum update -y
sudo yum install python3 -y
python3 -m venv venv
source venv/bin/activate

Clone your repo, install dependencies, and run your job!

Bonus: Dockerizing Your Setup (Optional but Recommended)

For reproducibility and portability:

Sample Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "src/main.py"]

Then build and run:

docker build -t data-eng-app .
docker run data-eng-app

Final Thoughts

Setting up a Python environment properly isn’t just a technical step—it’s a productivity boost and a time saver. With your tools, libraries, and workflows in place, you’re now ready to build real-world data pipelines, automate processes, and scale your projects.

If you’re planning to work on AWS or any cloud platform, getting comfortable with Lambda, Glue, or Docker early will pay off in the long run.