Setting Up Your Environment
Python and Pandas for ETL
Handling JSON and CSV Files
Python with AWS SDK (Boto3)
Python & SQL (with SQLite/MySQL)
Data Cleaning with Pandas
Working with APIs in Python
Building Batch Jobs in Python
Real-Time Data Pipelines with Python
Logging & Error Handling in Python
ETL Jobs with Cron and AWS Lambda
Working with APIs in Python: A Data Engineer's Complete Guide
APIs are at the heart of modern data engineering. Whether you’re pulling real-time stock data, querying a weather service, syncing with a CRM like Salesforce, or accessing cloud storage logs, APIs make it all possible. As a data engineer, learning how to work with APIs in Python is essential for automating workflows and building scalable data pipelines.
In this guide, we’ll take a deep dive into:
-
What APIs are and how they work
-
Understanding RESTful APIs
-
Using Python’s
requests
library -
Making GET and POST requests
-
Working with API authentication (API keys, tokens, etc.)
-
Handling JSON responses
-
Paginated APIs and rate limits
-
Real-world use cases (e.g., public APIs, cloud services, internal APIs)
-
Error handling and retries
-
Best practices for data engineers
Let’s Begin !.
What is an API?
An API (Application Programming Interface) is a set of rules that lets one software application talk to another. APIs are commonly used to expose data or services to other systems.
Most web APIs use the HTTP protocol and follow the REST (Representational State Transfer) standard. This means you interact with them using endpoints like:
GET https://api.example.com/users
POST https://api.example.com/login
Common HTTP Methods
-
GET – retrieve data
-
POST – send data
-
PUT – update data
-
DELETE – remove data
Python Requests: The Essential Tool
The requests
library is the most commonly used Python package for interacting with APIs.
Install it:
pip install requests
Import it in your script:
import requests
Making Your First API Call
Here’s an example using the JSONPlaceholder test API:
import requests
url = 'https://jsonplaceholder.typicode.com/posts/1'
response = requests.get(url)
print(response.status_code) # 200 OK
print(response.json())
JSON Responses
Most APIs return data in JSON (JavaScript Object Notation) format. Python can handle JSON natively.
Example:
data = response.json()
print(data['title'])
POST Requests – Sending Data
You’ll use POST to send data, such as to register a user or upload info.
payload = {'name': 'Alice', 'email': 'alice@example.com'}
response = requests.post('https://api.example.com/users', json=payload)
API Authentication
APIs often require some form of authentication:
1. API Key
headers = {'Authorization': 'Bearer YOUR_API_KEY'}
requests.get(url, headers=headers)
2. Token-based Auth (OAuth2, JWT)
Tokens are usually passed in the header or query string.
Handling Pagination
APIs may limit the number of records per call. You’ll need to paginate through results.
Example (GitHub API):
url = 'https://api.github.com/users/octocat/repos?page=1&per_page=100'
while url:
res = requests.get(url)
data = res.json()
process(data) # Your logic here
url = res.links.get('next', {}).get('url') # GitHub-style pagination
Rate Limiting
Some APIs limit how many calls you can make in a period. Respect rate limits using:
import time
for i in range(10):
response = requests.get('https://api.example.com/data')
if response.status_code == 429: # Too Many Requests
time.sleep(60)
else:
process(response.json())
Error Handling
Always include error handling in production code:
try:
res = requests.get(url)
res.raise_for_status()
data = res.json()
except requests.exceptions.RequestException as e:
print("API error:", e)
Working with Public APIs
There are many free APIs available:
-
OpenWeatherMap (weather data)
-
CoinGecko (cryptocurrency prices)
-
COVID19 API
-
NewsAPI
Example:
import requests
url = 'https://api.coindesk.com/v1/bpi/currentprice.json'
res = requests.get(url)
data = res.json()
print("Bitcoin Price:", data['bpi']['USD']['rate'])
Real-World API Use Cases for Data Engineers
-
Data Ingestion Pipelines – Pull stock prices, weather, e-commerce transactions.
-
Data Enrichment – Append IP geolocation, user metadata.
-
Monitoring – Get logs or metrics from AWS CloudWatch, Datadog, etc.
-
Cloud Operations – Manage AWS EC2 instances or S3 buckets via the API.
Advanced: Using Session
and Retry
To avoid repeating headers, use requests.Session()
:
session = requests.Session()
session.headers.update({"Authorization": "Bearer MY_API_KEY"})
res = session.get("https://api.example.com/data")
Add retry logic:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry = Retry(total=5, backoff_factor=0.3)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
Saving API Data to File or Database
import json
res = requests.get("https://api.example.com/data")
data = res.json()
with open("data.json", "w") as f:
json.dump(data, f)
Or load into a Pandas DataFrame:
import pandas as pd
df = pd.json_normalize(data)
df.to_csv("api_data.csv", index=False)
Best Practices
-
Keep API keys secure using environment variables or AWS Secrets Manager
-
Respect rate limits
-
Log your requests and responses
-
Use retry logic for unstable networks
-
Validate JSON responses before accessing fields
-
Modularize API logic using functions or classes
Conclusion
APIs open up endless possibilities in data engineering. From collecting data in real-time to automating backend operations, knowing how to work with APIs in Python gives you a serious edge.
You now know how to use the requests
library, handle JSON, authenticate securely, manage pagination and errors, and build real-world API workflows.