If you’re preparing for a Data Engineering career or looking to level up your Python skills with real-world cloud and ETL examples, don’t miss out on the full guide!


3. How can you read a CSV file in Python?

We can read a CSV file in Python using the pandas library. It’s a quick and easy way to load your data into a DataFrame for analysis.

import pandas as pd
df = pd.read_csv("data.csv")
7. How do you handle missing data using Pandas?

We can handle missing data in Pandas using functions like isnull(), dropna(), and fillna(). This helps keep the data clean and ready for analysis.

df.dropna()
df.fillna(0)
8. What’s the use of the with statement in file handling?

The with statement in file handling is used to open files safely and automatically close them after use. It helps avoid memory leaks and errors by managing resources properly. No need to call file.close()
Example:

with open("data.txt") as file:
    content = file.read()
9. How do you convert a Python list into a string?
"-".join(["2023", "05", "13"])
10. How to import libraries in Python?
import pandas as pd
import json
11. What is a function in Python?

A function in Python is a reusable block of code that performs a specific task. It helps make your code cleaner and more organized. You define a function using the def keyword:

def clean_data(df):
    return df.dropna()
12. How can you create a simple ETL job?
def etl():
    df = pd.read_csv("raw.csv")
    df.drop_duplicates().to_csv("cleaned.csv")
13. What are Python data types commonly used in data engineering?
  • String, Integer, Float, List, Dict, DataFrame
14. What is type casting in Python?

Type casting in Python means converting one data type into another, like turning a string into an integer. It’s useful when handling user input or reading data from files.

age = int("25")  # Converts string "25" to integer 25
15. How do you filter rows in a DataFrame?
df[df["status"] == "active"]
16. What is indentation in Python and why is it important?

Answer: Indentation in Python refers to the spaces at the beginning of a line of code. It defines the blocks of code, like inside loops, functions, or conditionals.

It’s super important because Python uses indentation (not braces {}) to understand code structure. Incorrect indentation can lead to errors.

17. What’s the difference between append() and extend()?

The difference between append() and extend() in Python is:

  • append() adds a single item to the end of the list. pythonCopyEditnums = [1, 2] nums.append([3, 4]) # Result: [1, 2, [3, 4]]
  • extend() adds each element from another list to the end. pythonCopyEditnums = [1, 2] nums.extend([3, 4]) # Result: [1, 2, 3, 4]

So, append() adds the whole object, while extend() adds each item individually.

18. What is a module in Python?

Answer: A module in Python is a file that contains Python code—like functions, classes, or variables—that you can reuse in other programs.

You can import a module using the import keyword:

pythonCopyEditimport math
print(math.sqrt(16))  # Output: 4.0

Modules help organize code and avoid repetition.

19. What is exception handling?

Exception handling in Python is the process of managing errors that occur during program execution.It allows us to handle unexpected situations without crashing the program. We use try, except, and optionally finally blocks for exception handling:

try:
num = int(input("Enter a number: "))
except ValueError:
print("That's not a valid number!")
finally:
print("Execution completed.")

This way, we can catch specific errors and take appropriate actions.

20. How do you write a loop to read multiple files?
for file in ["a.csv", "b.csv"]:
    df = pd.read_csv(file)

Intermediate Python Questions for Data Engineers

Targeting developers with working Python experience. These interview questions assess production-readiness and familiarity with Python data tools.

1. What is Pandas and why is it used in data engineering?

Answer: A versatile library to clean, transform, and analyze structured data. It’s often used in ETL pipelines.

2. How do you handle large files with Pandas?
chunks = pd.read_csv("big.csv", chunksize=10000)
3. Difference between DataFrame and Series?
  • Series: 1D labeled array
  • DataFrame: 2D table
4. What is lambda and how is it used?
clean = lambda x: x.strip().lower()
5. Explain Reusable cleaning function .
def clean_df(df):
    df.columns = df.columns.str.lower()
    return df.dropna()
6. What is a virtual environment?
python -m venv env
source env/bin/activate
7. Describe Logging with example .
import logging
logging.basicConfig(level=logging.INFO)
logging.info("Starting ETL")
8. Purpose of *args and **kwargs .

Answer: Accept multiple arguments in functions.

9. Explain os and pathlib use cases .
import os
os.listdir("./data")
10. How you Handle datetime columns?
df['created_at'] = pd.to_datetime(df['created_at'])
11. what is JSON parsing?
import json
data = json.loads('{"name": "AI"}')
12. How to interact with APIs?
import requests
r = requests.get("https://api.test.com")
13. What Batch processing job?
for file in os.listdir("/data"):
    process_file(file)
14. What is Database connection?
import sqlite3
conn = sqlite3.connect('db.sqlite3')
15. Describe Exception handling in ETL .
try:
    read()
except Exception as e:
    log.error(str(e))
16. Explain Purpose of enumerate() .

Answer: Index with value.

17. Give Generator example .
def stream():
    yield from range(5)
18. Provide Example Using map() .
map(str.upper, ["a", "b"])
19. What is pickling?
import pickle
pickle.dump(obj, open("obj.pkl", "wb"))
20. Describe environment variables in Python .
os.getenv("DB_PASS")

Advanced Python Interview Questions for Data Engineers

Explore topics like optimization, architecture, and cloud integration.

1. What is Python’s GIL?

Answer: Global Interpreter Lock limits Python threads. Prefer multiprocessing for CPU tasks.

2. How to make Scalable batch jobs?
  • Modular code
  • Parallel processing
  • Retry & logging
3. Threading vs multiprocessing .

Answer:

  • Threading = I/O bound
  • Multiprocessing = CPU bound
4. What are the basics Setup for AWS Lambda .
  • Use layers
  • Keep packages light
  • Use boto3
5. How you will implement Retry logic ?
from time import sleep
for i in range(3):
    try:
        run()
        break
    except:
        sleep(2)
6. How you will Monitor a Python ETL Job .
  • Use CloudWatch
  • Add logs & alerts
7. What are Python decorators ?
def logger(func):
    def wrapper(*args, **kwargs):
        print("Calling function")
        return func(*args, **kwargs)
    return wrapper
8. PySpark vs Pandas .

Answer: Use PySpark for big data. Pandas for in-memory tasks.

9. How you will Handle 100GB+ files?
  • Dask, PySpark, chunk reads
10.What is Async programming?
import asyncio
await asyncio.sleep(1)
11. How to Connect with S3?
import boto3
s3 = boto3.client("s3")
s3.download_file("bucket", "key", "file.csv")
12. Schema drift
  • Use Avro
  • Add validation step
13. What is DAG?

Answer: Directed Acyclic Graph – used in Airflow.

14. How you Secure credentials?
  • Use os.getenv()
  • Store secrets in AWS Secret Manager
15. Use of functools.lru_cache()
from functools import lru_cache
@lru_cache
def fetch(): return slow_func()
16. Build API with FastAPI
from fastapi import FastAPI
app = FastAPI()

@app.get("/")
def read(): return {"Hello": "World"}
17. Profiling code
import cProfile
cProfile.run('main()')
18. How you generate Unit tests
import unittest
class Test(unittest.TestCase):
    def test_clean(self):
        self.assertEqual(1, 1)
19. what is API pagination?
while url:
    r = requests.get(url)
    url = r.json().get('next')
20. Timezone conversion
df['ts'] = pd.to_datetime(df['ts']).dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')

These Python interview questions are a comprehensive guide to help you crack data engineering roles from entry-level to expert. Whether you’re building data pipelines, working on APIs, or optimizing cloud-based workflows—mastering these questions will give you an edge in technical interviews.