Python with AWS SDK (boto3)

Introduction to Python for DE
Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda

Python with AWS SDK (Boto3): A Complete Guide for Data Engineers

When you’re working in the cloud as a data engineer, Amazon Web Services (AWS) becomes an essential part of your toolkit. From managing data in S3 to running serverless jobs with Lambda or orchestrating ETL pipelines using AWS Glue, automation is key. That’s where Boto3, AWS’s SDK (Software Development Kit) for Python, comes into play.

This guide is for data engineers who want to automate AWS services using Python in a scalable, reliable, and production-grade manner. We’ll cover what Boto3 is, how to install and configure it, and dive into practical, real-world use cases—like uploading files to S3, launching Glue jobs, querying Athena, working with Lambda, and managing EC2 resources. You’ll also find best practices, security tips, and code snippets to help you hit the ground running.

What is Boto3?

Boto3 is the official Python SDK for AWS that allows developers and data engineers to write software that makes use of Amazon services like S3, EC2, Lambda, DynamoDB, and many others.

Why is Boto3 Important for Data Engineering?

Automate cloud infrastructure and workflows
Trigger and monitor ETL jobs
Manage data pipelines and resources
Integrate Python-based tools with AWS services
Reduce manual effort and errors

Installing and Setting Up Boto3

Before diving into examples, you need to set up your environment.

1. Install Boto3

pip install boto3

2. Configure AWS Credentials

You can use the AWS CLI to set credentials globally:

aws configure

Enter your:

AWS Access Key ID
AWS Secret Access Key
Default region (e.g., us-east-1)

These credentials are saved in ~/.aws/credentials and used by Boto3 automatically.

Alternatively, you can configure them inside your Python script using:

import boto3

session = boto3.Session(
    aws_access_key_id='YOUR_KEY',
    aws_secret_access_key='YOUR_SECRET',
    region_name='us-east-1'
)

✅ Tip: Never hardcode credentials in production scripts. Use IAM roles or environment variables.

Working with AWS S3 (Simple Storage Service)

S3 is a widely used service in data engineering for storing raw, intermediate, and final data.

Upload a File to S3

import boto3

s3 = boto3.client('s3')
s3.upload_file('localfile.csv', 'my-bucket', 'data/localfile.csv')

Download a File from S3

s3.download_file('my-bucket', 'data/localfile.csv', 'downloaded.csv')

List Files in a Bucket

response = s3.list_objects_v2(Bucket='my-bucket')
for obj in response.get('Contents', []):
    print(obj['Key'])

Running AWS Lambda Functions with Boto3

Lambda lets you run code without provisioning or managing servers—great for lightweight ETL steps.

Invoke a Lambda Function

lambda_client = boto3.client('lambda')

response = lambda_client.invoke(
    FunctionName='my_lambda_function',
    InvocationType='RequestResponse',
    Payload=b'{"key": "value"}'
)

print(response['Payload'].read())

Lambda Function Example: Reading from and Writing to S3 using Boto3

Here’s a simple AWS Lambda function in Python that demonstrates reading data from an S3 object and writing a processed version back into another S3 location.

import boto3
import json

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket_name = 'your-source-bucket'
    object_key = 'input/sample.json'

    # Read file from S3
    response = s3.get_object(Bucket=bucket_name, Key=object_key)
    data = json.loads(response['Body'].read())

    # Modify or process data
    data['processed'] = True

    # Write back to a different location
    result_key = 'output/processed_sample.json'
    s3.put_object(
        Bucket=bucket_name,
        Key=result_key,
        Body=json.dumps(data),
        ContentType='application/json'
    )

    return {
        'statusCode': 200,
        'body': json.dumps('File processed and saved!')
    }

This showcases how Boto3 in Lambda can be used to:

Read objects from S3
Modify or process the content
Write updated data back to S3

Starting AWS Glue Jobs

Glue is AWS’s managed ETL service. You can run scripts, transform data, and load it to destinations.

Start a Glue Job

glue = boto3.client('glue')

response = glue.start_job_run(JobName='my_glue_job')
print("Started Glue job with run ID:", response['JobRunId'])

Check Job Status

status = glue.get_job_run(JobName='my_glue_job', RunId=response['JobRunId'])
print("Status:", status['JobRun']['JobRunState'])

Querying Data with AWS Athena

Athena is a serverless query engine you can use to analyze data directly in S3 using SQL.

Run a Query

athena = boto3.client('athena')

query = "SELECT * FROM my_table LIMIT 10;"
response = athena.start_query_execution(
    QueryString=query,
    QueryExecutionContext={
        'Database': 'my_database'
    },
    ResultConfiguration={
        'OutputLocation': 's3://my-bucket/query-results/'
    }
)
print("Query Execution ID:", response['QueryExecutionId'])

EC2 – Manage Cloud Compute Resources

Sometimes you may need to spin up EC2 instances for custom ETL scripts, Spark jobs, or file processing.

Start an EC2 Instance

ec2 = boto3.client('ec2')

response = ec2.start_instances(InstanceIds=['i-1234567890abcdef0'])
print("Started EC2 Instance")

Stop an EC2 Instance

response = ec2.stop_instances(InstanceIds=['i-1234567890abcdef0'])
print("Stopped EC2 Instance")

Best Practices for Using Boto3

Use IAM roles when running on EC2, Lambda, or Glue to avoid hardcoding credentials.
Enable retries using botocore.config.Config to handle transient errors.
Paginate API calls for large datasets (e.g., list_objects_v2 in S3).
Use environment-based configuration for production environments.
Follow principle of least privilege in IAM policies.

Security Tips

Use AWS Secrets Manager to manage secrets and credentials.
Rotate credentials frequently.
Monitor usage using CloudTrail.
Enable MFA (Multi-Factor Authentication) for sensitive actions.

Real-World Example: Automating a Daily ETL Pipeline

Imagine you need to automate a pipeline that:

Pulls raw data files from S3
Triggers a Glue job for transformation
Invokes a Lambda function to validate data
Stores final output back in S3
Logs activity to CloudWatch

All of this can be orchestrated using Boto3 in Python with robust error handling and retries. This reduces manual oversight and ensures a production-grade workflow.

Conclusion

Boto3 is an incredibly powerful SDK that allows Python developers and data engineers to unlock the full power of AWS services. Whether you’re working with S3, Lambda, Glue, Athena, or EC2, Boto3 helps you automate, scale, and simplify your cloud workflows.

Once you master the basics and understand how to securely and efficiently use Boto3 in your projects, you’ll be well-equipped to handle cloud-native data engineering tasks and build scalable ETL systems with confidence.