Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda
Python with AWS SDK (Boto3): A Complete Guide for Data Engineers
When you’re working in the cloud as a data engineer, Amazon Web Services (AWS) becomes an essential part of your toolkit. From managing data in S3 to running serverless jobs with Lambda or orchestrating ETL pipelines using AWS Glue, automation is key. That’s where Boto3, AWS’s SDK (Software Development Kit) for Python, comes into play.
This guide is for data engineers who want to automate AWS services using Python in a scalable, reliable, and production-grade manner. We’ll cover what Boto3 is, how to install and configure it, and dive into practical, real-world use cases—like uploading files to S3, launching Glue jobs, querying Athena, working with Lambda, and managing EC2 resources. You’ll also find best practices, security tips, and code snippets to help you hit the ground running.
What is Boto3?
Boto3 is the official Python SDK for AWS that allows developers and data engineers to write software that makes use of Amazon services like S3, EC2, Lambda, DynamoDB, and many others.
Why is Boto3 Important for Data Engineering?
Automate cloud infrastructure and workflows
Trigger and monitor ETL jobs
Manage data pipelines and resources
Integrate Python-based tools with AWS services
Reduce manual effort and errors
Installing and Setting Up Boto3
Before diving into examples, you need to set up your environment.
1. Install Boto3
pip install boto3
2. Configure AWS Credentials
You can use the AWS CLI to set credentials globally:
aws configure
Enter your:
AWS Access Key ID
AWS Secret Access Key
Default region (e.g., us-east-1)
These credentials are saved in ~/.aws/credentials
and used by Boto3 automatically.
Alternatively, you can configure them inside your Python script using:
import boto3
session = boto3.Session(
aws_access_key_id='YOUR_KEY',
aws_secret_access_key='YOUR_SECRET',
region_name='us-east-1'
)
✅ Tip: Never hardcode credentials in production scripts. Use IAM roles or environment variables.
Working with AWS S3 (Simple Storage Service)
S3 is a widely used service in data engineering for storing raw, intermediate, and final data.
Upload a File to S3
import boto3
s3 = boto3.client('s3')
s3.upload_file('localfile.csv', 'my-bucket', 'data/localfile.csv')
Download a File from S3
s3.download_file('my-bucket', 'data/localfile.csv', 'downloaded.csv')
List Files in a Bucket
response = s3.list_objects_v2(Bucket='my-bucket')
for obj in response.get('Contents', []):
print(obj['Key'])
Running AWS Lambda Functions with Boto3
Lambda lets you run code without provisioning or managing servers—great for lightweight ETL steps.
Invoke a Lambda Function
lambda_client = boto3.client('lambda')
response = lambda_client.invoke(
FunctionName='my_lambda_function',
InvocationType='RequestResponse',
Payload=b'{"key": "value"}'
)
print(response['Payload'].read())
Lambda Function Example: Reading from and Writing to S3 using Boto3
Here’s a simple AWS Lambda function in Python that demonstrates reading data from an S3 object and writing a processed version back into another S3 location.
import boto3
import json
def lambda_handler(event, context):
s3 = boto3.client('s3')
bucket_name = 'your-source-bucket'
object_key = 'input/sample.json'
# Read file from S3
response = s3.get_object(Bucket=bucket_name, Key=object_key)
data = json.loads(response['Body'].read())
# Modify or process data
data['processed'] = True
# Write back to a different location
result_key = 'output/processed_sample.json'
s3.put_object(
Bucket=bucket_name,
Key=result_key,
Body=json.dumps(data),
ContentType='application/json'
)
return {
'statusCode': 200,
'body': json.dumps('File processed and saved!')
}
This showcases how Boto3 in Lambda can be used to:
Read objects from S3
Modify or process the content
Write updated data back to S3
Starting AWS Glue Jobs
Glue is AWS’s managed ETL service. You can run scripts, transform data, and load it to destinations.
Start a Glue Job
glue = boto3.client('glue')
response = glue.start_job_run(JobName='my_glue_job')
print("Started Glue job with run ID:", response['JobRunId'])
Check Job Status
status = glue.get_job_run(JobName='my_glue_job', RunId=response['JobRunId'])
print("Status:", status['JobRun']['JobRunState'])
Querying Data with AWS Athena
Athena is a serverless query engine you can use to analyze data directly in S3 using SQL.
Run a Query
athena = boto3.client('athena')
query = "SELECT * FROM my_table LIMIT 10;"
response = athena.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': 'my_database'
},
ResultConfiguration={
'OutputLocation': 's3://my-bucket/query-results/'
}
)
print("Query Execution ID:", response['QueryExecutionId'])
EC2 – Manage Cloud Compute Resources
Sometimes you may need to spin up EC2 instances for custom ETL scripts, Spark jobs, or file processing.
Start an EC2 Instance
ec2 = boto3.client('ec2')
response = ec2.start_instances(InstanceIds=['i-1234567890abcdef0'])
print("Started EC2 Instance")
Stop an EC2 Instance
response = ec2.stop_instances(InstanceIds=['i-1234567890abcdef0'])
print("Stopped EC2 Instance")
Best Practices for Using Boto3
Use IAM roles when running on EC2, Lambda, or Glue to avoid hardcoding credentials.
Enable retries using
botocore.config.Config
to handle transient errors.Paginate API calls for large datasets (e.g., list_objects_v2 in S3).
Use environment-based configuration for production environments.
Follow principle of least privilege in IAM policies.
Security Tips
Use AWS Secrets Manager to manage secrets and credentials.
Rotate credentials frequently.
Monitor usage using CloudTrail.
Enable MFA (Multi-Factor Authentication) for sensitive actions.
Real-World Example: Automating a Daily ETL Pipeline
Imagine you need to automate a pipeline that:
Pulls raw data files from S3
Triggers a Glue job for transformation
Invokes a Lambda function to validate data
Stores final output back in S3
Logs activity to CloudWatch
All of this can be orchestrated using Boto3 in Python with robust error handling and retries. This reduces manual oversight and ensures a production-grade workflow.
Conclusion
Boto3 is an incredibly powerful SDK that allows Python developers and data engineers to unlock the full power of AWS services. Whether you’re working with S3, Lambda, Glue, Athena, or EC2, Boto3 helps you automate, scale, and simplify your cloud workflows.
Once you master the basics and understand how to securely and efficiently use Boto3 in your projects, you’ll be well-equipped to handle cloud-native data engineering tasks and build scalable ETL systems with confidence.