Python with AWS SDK (Boto3): A Complete Guide for Data Engineers

When you’re working in the cloud as a data engineer, Amazon Web Services (AWS) becomes an essential part of your toolkit. From managing data in S3 to running serverless jobs with Lambda or orchestrating ETL pipelines using AWS Glue, automation is key. That’s where Boto3, AWS’s SDK (Software Development Kit) for Python, comes into play.

This guide is for data engineers who want to automate AWS services using Python in a scalable, reliable, and production-grade manner. We’ll cover what Boto3 is, how to install and configure it, and dive into practical, real-world use cases—like uploading files to S3, launching Glue jobs, querying Athena, working with Lambda, and managing EC2 resources. You’ll also find best practices, security tips, and code snippets to help you hit the ground running.


What is Boto3?

Boto3 is the official Python SDK for AWS that allows developers and data engineers to write software that makes use of Amazon services like S3, EC2, Lambda, DynamoDB, and many others.

Why is Boto3 Important for Data Engineering?

  • Automate cloud infrastructure and workflows

  • Trigger and monitor ETL jobs

  • Manage data pipelines and resources

  • Integrate Python-based tools with AWS services

  • Reduce manual effort and errors


Installing and Setting Up Boto3

Before diving into examples, you need to set up your environment.

1. Install Boto3
pip install boto3
2. Configure AWS Credentials

You can use the AWS CLI to set credentials globally:

aws configure

Enter your:

  • AWS Access Key ID

  • AWS Secret Access Key

  • Default region (e.g., us-east-1)

These credentials are saved in ~/.aws/credentials and used by Boto3 automatically.

Alternatively, you can configure them inside your Python script using:

import boto3

session = boto3.Session(
    aws_access_key_id='YOUR_KEY',
    aws_secret_access_key='YOUR_SECRET',
    region_name='us-east-1'
)

Tip: Never hardcode credentials in production scripts. Use IAM roles or environment variables.


Working with AWS S3 (Simple Storage Service)

S3 is a widely used service in data engineering for storing raw, intermediate, and final data.

Upload a File to S3
import boto3

s3 = boto3.client('s3')
s3.upload_file('localfile.csv', 'my-bucket', 'data/localfile.csv')
Download a File from S3
s3.download_file('my-bucket', 'data/localfile.csv', 'downloaded.csv')
List Files in a Bucket
response = s3.list_objects_v2(Bucket='my-bucket')
for obj in response.get('Contents', []):
    print(obj['Key'])

Running AWS Lambda Functions with Boto3

Lambda lets you run code without provisioning or managing servers—great for lightweight ETL steps.

Invoke a Lambda Function
lambda_client = boto3.client('lambda')

response = lambda_client.invoke(
    FunctionName='my_lambda_function',
    InvocationType='RequestResponse',
    Payload=b'{"key": "value"}'
)

print(response['Payload'].read())
Lambda Function Example: Reading from and Writing to S3 using Boto3

Here’s a simple AWS Lambda function in Python that demonstrates reading data from an S3 object and writing a processed version back into another S3 location.

import boto3
import json

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket_name = 'your-source-bucket'
    object_key = 'input/sample.json'

    # Read file from S3
    response = s3.get_object(Bucket=bucket_name, Key=object_key)
    data = json.loads(response['Body'].read())

    # Modify or process data
    data['processed'] = True

    # Write back to a different location
    result_key = 'output/processed_sample.json'
    s3.put_object(
        Bucket=bucket_name,
        Key=result_key,
        Body=json.dumps(data),
        ContentType='application/json'
    )

    return {
        'statusCode': 200,
        'body': json.dumps('File processed and saved!')
    }

This showcases how Boto3 in Lambda can be used to:

  • Read objects from S3

  • Modify or process the content

  • Write updated data back to S3


Starting AWS Glue Jobs

Glue is AWS’s managed ETL service. You can run scripts, transform data, and load it to destinations.

Start a Glue Job
glue = boto3.client('glue')

response = glue.start_job_run(JobName='my_glue_job')
print("Started Glue job with run ID:", response['JobRunId'])
Check Job Status
status = glue.get_job_run(JobName='my_glue_job', RunId=response['JobRunId'])
print("Status:", status['JobRun']['JobRunState'])

Querying Data with AWS Athena

Athena is a serverless query engine you can use to analyze data directly in S3 using SQL.

Run a Query
athena = boto3.client('athena')

query = "SELECT * FROM my_table LIMIT 10;"
response = athena.start_query_execution(
    QueryString=query,
    QueryExecutionContext={
        'Database': 'my_database'
    },
    ResultConfiguration={
        'OutputLocation': 's3://my-bucket/query-results/'
    }
)
print("Query Execution ID:", response['QueryExecutionId'])

EC2 – Manage Cloud Compute Resources

Sometimes you may need to spin up EC2 instances for custom ETL scripts, Spark jobs, or file processing.

Start an EC2 Instance
ec2 = boto3.client('ec2')

response = ec2.start_instances(InstanceIds=['i-1234567890abcdef0'])
print("Started EC2 Instance")
Stop an EC2 Instance
response = ec2.stop_instances(InstanceIds=['i-1234567890abcdef0'])
print("Stopped EC2 Instance")

Best Practices for Using Boto3
  1. Use IAM roles when running on EC2, Lambda, or Glue to avoid hardcoding credentials.

  2. Enable retries using botocore.config.Config to handle transient errors.

  3. Paginate API calls for large datasets (e.g., list_objects_v2 in S3).

  4. Use environment-based configuration for production environments.

  5. Follow principle of least privilege in IAM policies.


Security Tips
  • Use AWS Secrets Manager to manage secrets and credentials.

  • Rotate credentials frequently.

  • Monitor usage using CloudTrail.

  • Enable MFA (Multi-Factor Authentication) for sensitive actions.


Real-World Example: Automating a Daily ETL Pipeline

Imagine you need to automate a pipeline that:

  • Pulls raw data files from S3

  • Triggers a Glue job for transformation

  • Invokes a Lambda function to validate data

  • Stores final output back in S3

  • Logs activity to CloudWatch

All of this can be orchestrated using Boto3 in Python with robust error handling and retries. This reduces manual oversight and ensures a production-grade workflow.


Conclusion

Boto3 is an incredibly powerful SDK that allows Python developers and data engineers to unlock the full power of AWS services. Whether you’re working with S3, Lambda, Glue, Athena, or EC2, Boto3 helps you automate, scale, and simplify your cloud workflows.

Once you master the basics and understand how to securely and efficiently use Boto3 in your projects, you’ll be well-equipped to handle cloud-native data engineering tasks and build scalable ETL systems with confidence.