πΌ Pandas Study Guide
A complete interactive reference for the pandas library. Each topic includes
explanation, runnable code examples, and expected output.
Open pandas_study_guide.ipynb to run examples in Jupyter.
32
Topics
100+
Code Examples
1
Notebook File
01
Installation & Import
setup
βΌ
Install pandas and import the standard data science stack.
pd is the conventional alias for pandas.python
# Install (run in terminal)
# pip install pandas numpy matplotlib openpyxl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
OutputPandas version: 2.x.x
NumPy version: 1.x.x
📍 Real-World Use Case — Setting Up a Data Analysis Project
A junior analyst at a retail company sets up their Python environment before analysing 12 months of sales data.
use-case example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Helpful display settings for wide DataFrames
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:,.2f}".format)
pd.set_option("display.width", 120)
print(f"pandas {pd.__version__} | numpy {np.__version__}")
print("Environment ready for sales analysis.")
02
Series
fundamentals
βΌ
A
Series is a 1-D labeled array β like a single spreadsheet column. Each element has an index label.Creating a Series
# From a list
s1 = pd.Series([10, 20, 30, 40, 50])
# Custom index
s2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# From dict
s3 = pd.Series({'apples': 5, 'bananas': 3, 'cherries': 8})
print(s3)
Accessing & Attributes
print(s2['b']) # element by label
print(s2.iloc[1]) # element by position
print(s2.values) # numpy array
print(s2.index.tolist()) # ['a', 'b', 'c']
print(s2.dtype) # int64
Arithmetic
s = pd.Series([1, 2, 3, 4, 5])
print((s * 2).tolist()) # [2, 4, 6, 8, 10]
print((s + 10).tolist()) # [11, 12, 13, 14, 15]
print((s ** 2).tolist()) # [1, 4, 9, 16, 25]
print(s.sum(), s.mean()) # 15 3.0
📍 Real-World Use Case — Daily Stock Price Tracker
Scenario: A quant stores 10 days of AAPL closing prices and needs daily returns, best/worst day, and annualised volatility.
use-case example
import pandas as pd
aapl = pd.Series(
[182.5, 184.1, 180.3, 185.7, 188.0,
186.4, 190.2, 192.5, 189.8, 194.0],
index=pd.date_range("2024-01-01", periods=10, freq="B"),
name="AAPL_Close"
)
daily_ret = aapl.pct_change()
total_ret = (aapl.iloc[-1] / aapl.iloc[0] - 1) * 100
volatility = daily_ret.std() * (252 ** 0.5) * 100 # annualised
best_day = daily_ret.idxmax()
print(f"Total return : {total_ret:.2f}%")
print(f"Best day : {best_day.date()} (+{daily_ret[best_day]*100:.2f}%)")
print(f"Worst day : {daily_ret.idxmin().date()}")
print(f"Ann. volatility: {volatility:.1f}%")
print(f"52-wk high so far: {aapl.max():.2f}")
03
DataFrame Basics
fundamentals
βΌ
A
DataFrame is a 2-D table with labeled rows and columns. Think of it as an Excel spreadsheet in Python.From dict of lists
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 22],
'City': ['NYC', 'LA', 'Chicago', 'NYC', 'Houston'],
'Salary': [70000, 85000, 90000, 72000, 65000],
'Score': [88.5, 92.0, 78.3, 95.1, 85.6]
})
From list of dicts & NumPy array
# From list of dicts (records format)
records = [
{'product': 'Apple', 'price': 1.2, 'qty': 50},
{'product': 'Banana', 'price': 0.5, 'qty': 100},
]
pd.DataFrame(records)
# From 2D NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6]])
pd.DataFrame(arr, columns=['A', 'B', 'C'])
📍 Real-World Use Case — E-Commerce Product Catalog
Scenario: A developer loads a product catalogue to answer instant business questions: stock-outs, top-rated items, and total inventory value.
use-case example
import pandas as pd
products = pd.DataFrame({
"sku": ["WGT-001","GDG-042","DNT-007","WGT-002","GDG-099"],
"name": ["Blue Widget","Smart Gadget","Donut Maker","Red Widget","Super Gadget"],
"category": ["Widgets","Gadgets","Kitchen","Widgets","Gadgets"],
"price": [29.99, 149.99, 49.99, 34.99, 199.99],
"stock": [120, 45, 80, 0, 23],
"rating": [4.5, 4.8, 3.9, 4.2, 4.7],
})
out_of_stock = products[products["stock"] == 0][["sku","name"]]
top_rated = products.nlargest(3, "rating")[["name","rating"]]
inventory_val = (products["price"] * products["stock"]).sum()
print("Out of stock:\n", out_of_stock.to_string(index=False))
print("\nTop-rated products:\n", top_rated.to_string(index=False))
print(f"\nTotal inventory value: ${inventory_val:,.2f}")
04
Reading & Writing Data
I/O
βΌ
Pandas supports CSV, JSON, Excel, Parquet, SQL and many more formats out of the box.
Write to files
df.to_csv('data.csv', index=False)
df.to_json('data.json', orient='records', indent=2)
df.to_excel('data.xlsx', index=False) # pip install openpyxl
df.to_parquet('data.parquet') # pip install pyarrow
Read from files
df = pd.read_csv('data.csv')
df = pd.read_json('data.json')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Useful read_csv options
df = pd.read_csv(
'data.csv',
sep=';', # delimiter
header=0, # row to use as column names
index_col='ID', # set index column
usecols=['A', 'B'], # only load these columns
nrows=1000, # max rows to read
skiprows=2, # skip first N rows
dtype={'col': str}, # force dtype
parse_dates=['date'] # auto-parse dates
)
📍 Real-World Use Case — Loading a Messy Monthly Sales Report
Scenario: The finance team sends a CSV with a custom delimiter, 3 junk header rows, European date format, and an ID column that must stay as a string.
use-case example
import pandas as pd
from io import StringIO
# Simulates the messy file (semicolon-delimited, 3 junk header rows)
raw = (
"Report generated: 2024-04-01\n"
"Source: CRM Export v2.3\n"
"\n"
"order_id;customer;order_date;amount;status\n"
"1001;Alice;01/03/2024;250.00;completed\n"
"1002;Bob;15/03/2024;89.50;completed\n"
"1003;Charlie;22/03/2024;430.75;pending\n"
"1004;Diana;28/03/2024;120.00;cancelled"
)
df = pd.read_csv(
StringIO(raw),
sep=";",
skiprows=3, # skip the 3 junk lines
parse_dates=["order_date"],
dayfirst=True, # European DD/MM/YYYY
dtype={"order_id": str}, # keep leading zeros
)
print(df)
print("\nDtypes:\n", df.dtypes)
print(f"\nTotal revenue: ${df['amount'].sum():,.2f}")
05
Viewing & Inspecting Data
exploration
βΌ
First things to do when you load a new dataset β understand its shape, types, and distribution.
Quick look
df.head(5) # first 5 rows
df.tail(3) # last 3 rows
df.sample(5) # 5 random rows
df.shape # (rows, cols)
df.columns.tolist() # column names
df.dtypes # data type per column
df.index # row labels
Info & Statistics
df.info() # dtype + non-null counts + memory
df.describe() # count, mean, std, min, quartiles, max
df.describe(include='all') # include categorical cols too
df['City'].value_counts() # frequency of each value
df['City'].value_counts(normalize=True) # proportions
df.nunique() # unique count per column
df['City'].unique() # array of unique values
📍 Real-World Use Case — First Look at a New Customer Dataset
Scenario: A data scientist receives a CRM extract for a churn prediction project. They run a structured exploration before any modelling.
use-case example
import pandas as pd, numpy as np
customers = pd.DataFrame({
"cust_id": range(1001, 1011),
"segment": ["Gold","Silver","Bronze","Gold","Silver",
"Bronze","Gold","Silver","Bronze","Gold"],
"age": [34, 28, 45, np.nan, 52, 29, 38, 41, np.nan, 55],
"monthly_spend": [450, 120, 85, 700, 60, np.nan, 390, 200, 95, 820],
"tenure_months": [24, 6, 48, 36, 12, 3, 60, 18, 9, 84],
"churned": [0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
})
print(f"Shape: {customers.shape}")
print("\nMissing values:\n", customers.isnull().sum())
print(f"\nChurn rate: {customers['churned'].mean()*100:.0f}%")
print("\nSegment mix:\n", customers["segment"].value_counts())
print("\nNumeric summary:\n", customers.describe().round(1))
06
Selecting & Indexing
selection
βΌ
loc selects by label; iloc selects by integer position. Master these two and you can access anything.Column selection
df['Name'] # single column β Series
df[['Name', 'Salary']] # multiple columns β DataFrame
df.Name # dot notation (no spaces in name)
loc β label based
df.loc[0] # row by label
df.loc[0:2] # rows 0,1,2 (inclusive!)
df.loc[1, ['Name', 'Age']] # row 1, specific cols
df.loc[0:2, 'Name':'City'] # slice rows & cols by label
iloc β position based
df.iloc[0] # first row
df.iloc[:3, :2] # first 3 rows, first 2 cols
df.iloc[2, 3] # row 2, col 3 (scalar)
df.iloc[::2] # every other row
df.iloc[-1] # last row
Index management
df.set_index('Name') # use Name as row index
df.reset_index() # move index back to column
df.set_index('Name').loc['Alice'] # access by name
📍 Real-World Use Case — Extracting an Account Manager's Report
Scenario: A BI analyst pulls only Sarah's client accounts with specific columns for her weekly review, then flags contracts expiring within 3 months.
use-case example
import pandas as pd
accounts = pd.DataFrame({
"account_id": ["ACC001","ACC002","ACC003","ACC004","ACC005"],
"client": ["Acme Corp","Beta LLC","Gamma Inc","Delta Co","Epsilon Ltd"],
"manager": ["Sarah","John","Sarah","John","Sarah"],
"revenue": [45000, 32000, 78000, 12000, 56000],
"health_score": [82, 65, 91, 44, 77],
"contract_end": ["2025-06","2024-12","2025-03","2024-11","2025-09"],
})
# Select Sarah's accounts β specific columns only
sarahs = (
accounts
.loc[accounts["manager"] == "Sarah", ["client","revenue","health_score","contract_end"]]
.sort_values("revenue", ascending=False)
.reset_index(drop=True)
)
# iloc: pull just the revenue column as a quick sum check
sarahs["renew_soon"] = sarahs["contract_end"] <= "2025-03"
print(sarahs.to_string(index=False))
print(f"\nTotal revenue managed: ${sarahs['revenue'].sum():,}")
print(f"Contracts expiring soon: {sarahs['renew_soon'].sum()}")
07
Filtering Data
selection
βΌ
Use boolean masks or
query() to filter rows. Always wrap multiple conditions in parentheses.Boolean masks
df[df['Age'] > 27] # single cond
df[(df['Age'] > 25) & (df['Salary'] > 70000)] # AND
df[(df['City'] == 'NYC') | (df['City'] == 'LA')] # OR
df[~(df['City'] == 'NYC')] # NOT
df[df['City'].isin(['NYC', 'LA'])] # isin
df[~df['City'].isin(['NYC', 'LA'])] # not isin
df[df['Name'].str.startswith('A')] # string filter
query() β readable syntax
df.query('Age > 25 and Salary > 70000')
df.query('City in ["NYC", "LA"]')
df.query('Age > 25 or City == "Houston"')
# Use @ to reference a Python variable
min_salary = 75000
df.query('Salary > @min_salary')
📍 Real-World Use Case — Finding High-Value Customers at Churn Risk
Scenario: A customer-success team needs to identify Premium/Enterprise users who have been inactive for 30+ days OR gave a low NPS score, so they can intervene.
use-case example
import pandas as pd
users = pd.DataFrame({
"user_id": range(101, 111),
"plan": ["Premium","Basic","Premium","Enterprise","Basic",
"Premium","Basic","Enterprise","Basic","Premium"],
"days_inactive": [35, 5, 12, 42, 8, 60, 3, 28, 15, 7],
"monthly_spend": [500, 50, 520, 2100, 45, 480, 55, 1950, 48, 510],
"nps_score": [3, 8, 9, 2, 7, 1, 8, 4, 6, 9],
})
# Churn risk: paid plan AND (inactive 30+ days OR low NPS)
at_risk = users[
users["plan"].isin(["Premium","Enterprise"]) &
((users["days_inactive"] >= 30) | (users["nps_score"] <= 3))
]
print("At-risk customers:")
print(at_risk[["user_id","plan","days_inactive","nps_score","monthly_spend"]])
print(f"\nAt-risk monthly revenue : ${at_risk['monthly_spend'].sum():,}")
print(f"At-risk annualised revenue: ${at_risk['monthly_spend'].sum()*12:,}")
08
Adding & Removing Columns / Rows
manipulation
βΌ
Add computed columns directly, or use
assign() for method chaining. Remove with drop().Add & remove columns
df['Tax'] = df['Salary'] * 0.2 # add
df['NetSalary'] = df['Salary'] - df['Tax'] # derived
df.insert(1, 'ID', range(len(df))) # at position
df.drop(columns=['Tax'], inplace=True) # remove
df.rename(columns={'Name': 'FullName'}) # rename
Add & remove rows
new_row = pd.DataFrame([{'Name': 'Frank', 'Age': 40,
'City': 'Boston', 'Salary': 95000}])
df = pd.concat([df, new_row], ignore_index=True) # add row
df.drop(index=[0, 1]) # remove rows 0 and 1
df.drop(index=df[df['Age'] < 23].index) # remove by condition
📍 Real-World Use Case — Building a Profit & Loss Report
Scenario: A finance analyst enriches raw sales figures with gross profit, margin %, operating profit, and YoY growth columns before exporting to the CFO.
use-case example
import pandas as pd
sales = pd.DataFrame({
"product": ["Widget A","Widget B","Gadget X","Gadget Y","Service Z"],
"revenue_2024": [120000, 85000, 220000, 95000, 180000],
"revenue_2023": [100000, 90000, 195000, 82000, 150000],
"cogs": [ 60000, 55000, 110000, 65000, 45000],
"opex": [ 20000, 15000, 40000, 20000, 25000],
})
sales["gross_profit"] = sales["revenue_2024"] - sales["cogs"]
sales["gross_margin_pct"] = (sales["gross_profit"] / sales["revenue_2024"] * 100).round(1)
sales["operating_profit"] = sales["gross_profit"] - sales["opex"]
sales["yoy_growth_pct"] = ((sales["revenue_2024"] - sales["revenue_2023"])
/ sales["revenue_2023"] * 100).round(1)
sales["is_growing"] = sales["yoy_growth_pct"] > 0
# Drop raw columns before sharing
report = sales.drop(columns=["cogs","opex","revenue_2023"])
print(report.to_string(index=False))
print(f"\nAvg gross margin: {sales['gross_margin_pct'].mean():.1f}%")
09
Handling Missing Data
cleaning
βΌ
Missing values are represented as
NaN (Not a Number). Pandas provides tools to detect, remove, and fill them.Detect & count
df.isnull() # True where NaN
df.isnull().sum() # count NaN per column
df.isnull().sum().sum() # total NaN in whole df
df.notnull() # True where not NaN
df['col'].isna().any() # True if any NaN in column
Drop & fill
df.dropna() # drop rows with ANY NaN
df.dropna(how='all') # drop rows ALL NaN
df.dropna(subset=['A', 'B']) # NaN in specific cols
df.dropna(thresh=3) # keep rows with β₯3 non-NaN
df.fillna(0) # fill all with 0
df['A'].fillna(df['A'].mean()) # fill with column mean
df.fillna(method='ffill') # forward fill
df.ffill() # same, newer API
df.bfill() # backward fill
df.fillna({'A': 0, 'B': 'Unknown'}) # per-column fill
📍 Real-World Use Case — Cleaning Employee Satisfaction Survey Data
Scenario: HR receives a quarterly survey with many blank responses. Numeric gaps are filled with the department median; categorical gaps become 'Unknown'.
use-case example
import pandas as pd, numpy as np
survey = pd.DataFrame({
"emp_id": [101,102,103,104,105,106,107,108],
"department": ["Eng","Mkt","Eng","HR","Mkt","Eng","HR","Mkt"],
"satisfaction": [8, np.nan, 6, 9, np.nan, 7, np.nan, 5],
"would_recommend": ["Yes","No",np.nan,"Yes","Yes",np.nan,"No","No"],
"years_at_co": [3, np.nan, 7, 1, 2, 5, np.nan, 4],
})
print("Missing before:\n", survey.isnull().sum())
# Fill numeric with department median (fairer than global mean)
survey["satisfaction"] = (
survey.groupby("department")["satisfaction"]
.transform(lambda x: x.fillna(x.median()))
)
survey["years_at_co"] = survey["years_at_co"].fillna(survey["years_at_co"].median())
# Fill categorical with explicit "Unknown"
survey["would_recommend"] = survey["would_recommend"].fillna("Unknown")
print("\nCleaned survey:\n", survey)
print(f"\nCompany satisfaction avg: {survey['satisfaction'].mean():.1f}/10")
10
Data Types & Conversion
cleaning
βΌ
Wrong dtypes are a common source of bugs. Always verify types after loading data and convert early.
Type conversion
df['id'] = df['id'].astype(int)
df['price'] = df['price'].astype(float)
df['flag'] = df['flag'].astype(bool)
df['cat'] = df['cat'].astype('category') # memory efficient!
# Safe numeric conversion (bad values β NaN)
df['num'] = pd.to_numeric(df['num'], errors='coerce')
# Date parsing
df['date'] = pd.to_datetime(df['date_str'])
df['date'] = pd.to_datetime(df['date_str'], format='%d/%m/%Y')
📍 Real-World Use Case — Fixing a Bank Transaction CSV Import
Scenario: A fintech developer imports a bank statement where amounts have currency symbols, dates are US-format strings, and transaction type is high-cardinality text.
use-case example
import pandas as pd
from io import StringIO
raw = (
"date,description,amount,type\n"
"01/15/2024,AMAZON.COM,-$45.99,debit\n"
"01/16/2024,PAYROLL DEP,+$3500.00,credit\n"
"01/17/2024,NETFLIX,-$15.99,debit\n"
"01/18/2024,ATM WITHDR,-$200.00,debit\n"
"01/19/2024,WHOLE FOOD,-$87.30,debit"
)
df = pd.read_csv(StringIO(raw))
# Fix amount: strip $ and +, parse as float
df["amount"] = (
df["amount"]
.str.replace(r"[$+]", "", regex=True)
.astype(float)
)
# Parse dates
df["date"] = pd.to_datetime(df["date"])
# Categorical for low-cardinality column (saves memory)
df["type"] = df["type"].astype("category")
# Derived columns
df["abs_amount"] = df["amount"].abs()
df["month"] = df["date"].dt.month_name()
print(df)
print(f"\nTotal debits : ${df[df['type']=='debit']['abs_amount'].sum():.2f}")
print(f"Total credits: ${df[df['type']=='credit']['abs_amount'].sum():.2f}")
11
Sorting
manipulation
βΌ
sort_values & sort_index
df.sort_values('Age') # ascending
df.sort_values('Salary', ascending=False) # descending
df.sort_values(['City', 'Salary'], # multi-col
ascending=[True, False])
df.sort_index() # by index
df.sort_index(ascending=False)
# nlargest / nsmallest β faster than sort + head
df.nlargest(3, 'Salary') # top 3 salaries
df.nsmallest(3, 'Age') # 3 youngest
📍 Real-World Use Case — Q4 Sales Leaderboard
Scenario: A sales manager builds a ranked leaderboard to share at the all-hands, highlighting top performers and biggest improvers.
use-case example
import pandas as pd
reps = pd.DataFrame({
"rep": ["Alice","Bob","Charlie","Diana","Eve","Frank","Grace"],
"region": ["East","West","East","North","West","North","East"],
"q4_deals": [42, 38, 55, 29, 47, 33, 61],
"q4_rev": [210000,185000,275000,140000,230000,160000,305000],
"q3_rev": [195000,200000,250000,130000,215000,145000,280000],
})
reps["yoy_growth"] = ((reps["q4_rev"] - reps["q3_rev"]) / reps["q3_rev"] * 100).round(1)
reps["rank"] = reps["q4_rev"].rank(ascending=False).astype(int)
# Leaderboard: sorted by revenue
leaderboard = reps.sort_values("q4_rev", ascending=False).reset_index(drop=True)
leaderboard.index += 1 # 1-based rank
print("=== Q4 Leaderboard ===")
print(leaderboard[["rep","region","q4_deals","q4_rev","yoy_growth"]].to_string())
print("\n=== Top 3 Improvers ===")
print(reps.nlargest(3, "yoy_growth")[["rep","q4_rev","yoy_growth"]].to_string(index=False))
12
GroupBy & Aggregation
aggregation
βΌ
groupby splits the data, applies a function, and combines results (split-apply-combine pattern).Basic groupby
df.groupby('Region')['Sales'].sum()
df.groupby('Region')['Sales'].mean()
df.groupby(['Region', 'Product'])['Sales'].sum()
Multiple aggregations with agg()
# Multiple functions on one column
df.groupby('Region')['Sales'].agg(['sum', 'mean', 'count'])
# Different functions per column
df.groupby('Region').agg({
'Sales': ['sum', 'mean'],
'Quantity': 'sum',
'Score': ['min', 'max']
})
# Named aggregations (pandas 0.25+)
df.groupby('Region').agg(
total_sales = ('Sales', 'sum'),
avg_sales = ('Sales', 'mean'),
num_orders = ('Sales', 'count')
)
transform & filter
# transform: returns same-shape (add group stat to original df)
df['RegionTotal'] = df.groupby('Region')['Sales'].transform('sum')
df['SalesShare'] = df['Sales'] / df['RegionTotal']
# filter: keep only groups meeting a condition
df.groupby('Region').filter(lambda g: g['Sales'].sum() > 500)
📍 Real-World Use Case — Monthly Revenue Analysis by Department
Scenario: A business analyst produces a department-level P&L summary using named aggregations and then adds a computed profit column.
use-case example
import pandas as pd
orders = pd.DataFrame({
"month": ["Jan","Jan","Jan","Feb","Feb","Feb","Mar","Mar","Mar"],
"dept": ["Engineering","Marketing","Sales"] * 3,
"revenue": [45000,32000,78000, 48000,35000,82000, 51000,29000,90000],
"cost": [28000,18000,42000, 30000,20000,45000, 32000,16000,48000],
"headcount": [12, 8, 15, 12, 8, 16, 13, 7, 16],
})
summary = orders.groupby("dept").agg(
total_revenue = ("revenue", "sum"),
total_cost = ("cost", "sum"),
avg_headcount = ("headcount", "mean"),
months = ("revenue", "count"),
).round(0)
summary["gross_profit"] = summary["total_revenue"] - summary["total_cost"]
summary["margin_pct"] = (summary["gross_profit"] / summary["total_revenue"] * 100).round(1)
summary = summary.sort_values("gross_profit", ascending=False)
print(summary.to_string())
print(f"\nTotal company revenue: ${orders['revenue'].sum():,}")
13
Merge, Join & Concat
aggregation
βΌ
pd.merge() works like SQL joins. pd.concat() stacks DataFrames vertically or horizontally.merge (SQL-style joins)
pd.merge(left, right, on='id', how='inner') # matching only
pd.merge(left, right, on='id', how='left') # all from left
pd.merge(left, right, on='id', how='right') # all from right
pd.merge(left, right, on='id', how='outer') # all rows
# Different column names
pd.merge(emp, dept,
left_on='emp_id', right_on='dept_emp_id')
# Merge on index
pd.merge(left, right, left_index=True, right_index=True)
concat
pd.concat([df1, df2], ignore_index=True) # stack vertically
pd.concat([df1, df2], axis=1) # side by side
pd.concat([df1, df2], keys=['2023', '2024']) # add key labels
📍 Real-World Use Case — 360Β° Customer View β Profile + Purchase History
Scenario: A CRM analyst joins the customer table with orders to compute CLV, average order value, and days since last purchase.
use-case example
import pandas as pd
customers = pd.DataFrame({
"cust_id": [101,102,103,104,105],
"name": ["Alice","Bob","Charlie","Diana","Eve"],
"segment": ["Gold","Silver","Bronze","Gold","Silver"],
})
orders = pd.DataFrame({
"order_id": [1,2,3,4,5,6,7,8],
"cust_id": [101,102,101,103,104,101,105,102],
"amount": [250,180,320,90,450,280,75,220],
"date": pd.to_datetime(["2024-01-15","2024-01-20","2024-02-01",
"2024-02-10","2024-02-15","2024-03-01",
"2024-03-10","2024-03-15"]),
})
clv = (
orders.groupby("cust_id")
.agg(total_spend=("amount","sum"),
order_count=("order_id","count"),
last_order =("date","max"))
.reset_index()
)
full = pd.merge(customers, clv, on="cust_id", how="left").fillna(0)
full["avg_order_value"] = (full["total_spend"] / full["order_count"].replace(0,1)).round(2)
today = pd.Timestamp("2024-03-20")
full["days_since_purchase"] = (today - full["last_order"]).dt.days
print(full[["name","segment","total_spend","order_count","avg_order_value","days_since_purchase"]])
14
Pivot Tables & Crosstab
aggregation
βΌ
pivot_table
pd.pivot_table(
df,
values='Sales', # what to aggregate
index='Region', # row grouping
columns='Product', # column grouping
aggfunc='sum', # how to aggregate
fill_value=0, # replace NaN
margins=True # add totals row/col
)
crosstab β frequency table
pd.crosstab(df['Region'], df['Product']) # counts
pd.crosstab(df['Region'], df['Product'],
normalize='index') # row %
pd.crosstab(df['Region'], df['Product'],
values=df['Sales'], aggfunc='sum') # sum
📍 Real-World Use Case — Regional Revenue Matrix for the Board Meeting
Scenario: A VP of Sales needs a matrix showing quarterly revenue by region with row/column totals, formatted as a one-slide PowerPoint insert.
use-case example
import pandas as pd
sales = pd.DataFrame({
"region": ["East","West","East","North","West","East","North","West","North"],
"quarter": ["Q1","Q1","Q2","Q1","Q2","Q3","Q2","Q3","Q3"],
"revenue": [120000,95000,130000,88000,105000,140000,92000,110000,98000],
})
matrix = pd.pivot_table(
sales,
values="revenue",
index="region",
columns="quarter",
aggfunc="sum",
fill_value=0,
margins=True,
margins_name="TOTAL",
)
# Format as $ thousands
fmt = matrix.applymap(lambda x: f"${x/1000:.0f}K")
print("=== Revenue by Region & Quarter ===")
print(fmt.to_string())
# Best quarter per region
best_q = matrix.drop("TOTAL").drop("TOTAL", axis=1).idxmax(axis=1)
print("\nBest quarter per region:\n", best_q.to_string())
15
String Operations
advanced
βΌ
The
.str accessor lets you apply string methods to a whole Series at once, including regex.Common str methods
s.str.lower() # lowercase
s.str.upper() # uppercase
s.str.strip() # remove whitespace
s.str.len() # length of each string
s.str.contains('pattern', case=False) # regex or literal
s.str.startswith('prefix')
s.str.endswith('suffix')
s.str.replace('old', 'new', regex=False)
s.str.split(',', expand=True) # split β DataFrame
s.str.extract(r'(\d+)') # first regex group
s.str.extractall(r'(\w+)') # all matches
s.str.count('a') # count occurrences
s.str.zfill(5) # pad with zeros: '42' β '00042'
📍 Real-World Use Case — Standardising Messy CRM Contact Data
Scenario: A data engineer cleans names with random casing, emails with extra spaces, and phone numbers in mixed formats before loading into the company database.
use-case example
import pandas as pd
contacts = pd.DataFrame({
"full_name": [" alice johnson ","BOB SMITH","charlie BROWN","DIANA Prince "],
"email": ["Alice@Gmail.COM ", " bob.smith@YAHOO.com","charlie@outlook.com","diana@corp.IO"],
"phone": ["(555) 123-4567","555.987.6543","555 246 8101","+1-555-369-2580"],
"company": ["Acme Corp.","BETA LLC","gamma inc","Delta Co"],
})
contacts["full_name"] = contacts["full_name"].str.strip().str.title()
contacts["email"] = contacts["email"].str.strip().str.lower()
contacts["email_domain"]= contacts["email"].str.extract(r"@([\w.]+)")[0]
contacts["phone_clean"] = contacts["phone"].str.replace(r"[^\d]", "", regex=True)
contacts["company"] = contacts["company"].str.strip().str.title()
# Flag non-corporate emails
corp_domains = ["gmail.com","yahoo.com","outlook.com","hotmail.com"]
contacts["is_personal_email"] = contacts["email_domain"].isin(corp_domains)
print(contacts[["full_name","email","phone_clean","email_domain","is_personal_email"]].to_string(index=False))
16
DateTime Operations
advanced
βΌ
The
.dt accessor exposes all datetime properties. Always parse dates on load with parse_dates.dt accessor
ds = pd.to_datetime(pd.Series(['2024-01-15', '2024-07-04']))
ds.dt.year # [2024, 2024]
ds.dt.month # [1, 7]
ds.dt.day # [15, 4]
ds.dt.day_name() # ['Monday', 'Thursday']
ds.dt.quarter # [1, 3]
ds.dt.week # week number
ds.dt.is_month_end # bool
ds.dt.normalize() # truncate to midnight
Date arithmetic & ranges
today = pd.Timestamp('2025-01-01')
today + pd.Timedelta(days=30)
today + pd.DateOffset(months=3)
pd.date_range('2024-01-01', '2024-12-31', freq='ME') # month end
pd.date_range('2024-01-01', periods=5, freq='B') # business days
Resample (time series)
ts = pd.Series(..., index=pd.date_range('2024', periods=365, freq='D'))
ts.resample('ME').sum() # daily β monthly total
ts.resample('W').mean() # daily β weekly average
ts.resample('QE').last() # daily β quarterly last value
📍 Real-World Use Case — Subscription Analytics β Tenure & Renewal Pipeline
Scenario: A SaaS analyst calculates customer tenure, flags subscriptions expiring in 30 days, and identifies loyal customers for a renewal discount campaign.
use-case example
import pandas as pd
today = pd.Timestamp("2025-03-01")
subs = pd.DataFrame({
"sub_id": ["S001","S002","S003","S004","S005"],
"customer": ["Acme","Beta","Gamma","Delta","Epsilon"],
"plan": ["Pro","Basic","Enterprise","Pro","Basic"],
"start_date":pd.to_datetime(["2022-01-15","2024-09-01","2020-06-01","2023-03-15","2025-01-10"]),
"end_date": pd.to_datetime(["2025-03-15","2025-06-01","2025-02-28","2025-04-01","2026-01-10"]),
"mrr": [500, 99, 2000, 500, 99],
})
subs["tenure_months"] = ((today - subs["start_date"]).dt.days / 30).astype(int)
subs["days_to_end"] = (subs["end_date"] - today).dt.days
subs["renewing_soon"] = subs["days_to_end"].between(0, 30)
subs["loyal"] = subs["tenure_months"] >= 24
subs["expired"] = subs["days_to_end"] < 0
print(subs[["customer","plan","tenure_months","days_to_end","renewing_soon","loyal"]].to_string(index=False))
print(f"\nAt-risk MRR (renewing β€30d): ${subs[subs['renewing_soon']]['mrr'].sum():,}")
print(f"Loyal customers: {subs['loyal'].sum()}")
17
Apply, Map & Lambda
advanced
βΌ
Use
apply() for custom logic, map() for element lookups, and np.where() for fast vectorized if-else.apply on Series
def classify(age):
if age < 25: return 'Young'
elif age < 35: return 'Mid'
else: return 'Senior'
df['AgeGroup'] = df['Age'].apply(classify)
df['SalaryK'] = df['Salary'].apply(lambda x: f'${x/1000:.0f}K')
apply on DataFrame (row-wise)
df['Label'] = df.apply(
lambda row: f"{row['Name']} - {row['City']}", axis=1
)
map & np.where
# map: element-wise lookup
city_map = {'NYC': 'Northeast', 'LA': 'West', 'Chicago': 'Midwest'}
df['Region'] = df['City'].map(city_map)
# np.where: vectorized if-else (much faster than apply for simple conditions)
df['HighEarner'] = np.where(df['Salary'] >= 80000, 'Yes', 'No')
# pd.cut alternative to nested np.where
df['Band'] = pd.cut(df['Salary'],
bins=[0, 70000, 85000, float('inf')],
labels=['Low', 'Mid', 'High'])
📍 Real-World Use Case — Tiered Discount Engine for an Order System
Scenario: An e-commerce backend applies tiered discounts: >$500 = 15%, $200-$500 = 10%, <$200 = 5%. VIP customers get an extra 5% on top.
use-case example
import pandas as pd, numpy as np
orders = pd.DataFrame({
"order_id": [1001,1002,1003,1004,1005,1006],
"customer": ["Alice","Bob","Charlie","Diana","Eve","Frank"],
"subtotal": [650.00, 320.00, 95.00, 1200.00, 185.00, 480.00],
"vip": [True, False, False, True, False, True],
})
def discount_rate(row):
base = 0.15 if row["subtotal"] >= 500 else (0.10 if row["subtotal"] >= 200 else 0.05)
return base + (0.05 if row["vip"] else 0.0)
orders["discount_pct"] = orders.apply(discount_rate, axis=1)
orders["discount_amt"] = (orders["subtotal"] * orders["discount_pct"]).round(2)
orders["final_price"] = (orders["subtotal"] - orders["discount_amt"]).round(2)
# Vectorised alternative using np.where (faster for large DataFrames)
orders["tier"] = np.where(orders["subtotal"] >= 500, "Premium",
np.where(orders["subtotal"] >= 200, "Standard", "Basic"))
print(orders[["order_id","customer","subtotal","discount_pct","discount_amt","final_price","tier"]].to_string(index=False))
print(f"\nTotal discounts given: ${orders['discount_amt'].sum():.2f}")
18
Reshaping Data
advanced
βΌ
Switch between wide and long formats. Most plotting libraries prefer long format; most people read wide format.
melt: wide β long
wide = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Math': [90, 75], 'English': [80, 88], 'Science': [95, 70]
})
long = pd.melt(wide, id_vars=['Name'],
var_name='Subject', value_name='Score')
# Name Subject Score
# 0 Alice Math 90
# 1 Bob Math 75 ...
pivot: long β wide
long.pivot(index='Name', columns='Subject', values='Score')
stack / unstack
df.stack() # columns β innermost index level (wideβlong)
df.unstack() # innermost index level β columns (longβwide)
📍 Real-World Use Case — Reshaping Survey Data for a Heatmap
Scenario: A UX researcher melts wide-format feature-ratings into long format for seaborn, then pivots back to a per-feature summary.
use-case example
import pandas as pd
# Wide: each row = respondent, each column = feature score
survey = pd.DataFrame({
"user_id": [101,102,103,104,105],
"checkout_flow":[4, 5, 3, 4, 5],
"search_bar": [3, 4, 5, 2, 3],
"mobile_app": [5, 5, 4, 5, 4],
"payment_page": [2, 3, 2, 4, 3],
"help_center": [4, 2, 3, 3, 5],
})
# Melt to long (required by seaborn / most viz libraries)
long = pd.melt(survey, id_vars=["user_id"],
var_name="feature", value_name="rating")
print("Long format (first 6 rows):")
print(long.head(6))
# Per-feature averages
avg = long.groupby("feature")["rating"].agg(["mean","count","std"]).round(2)
avg = avg.sort_values("mean", ascending=False)
print("\nFeature ratings (ranked):")
print(avg)
# Back to wide for the heatmap matrix
heatmap_data = long.pivot(index="feature", columns="user_id", values="rating")
print("\nHeatmap matrix shape:", heatmap_data.shape)
19
Statistical Functions
advanced
βΌ
Basic stats
df.mean() # mean per column
df.median() # median per column
df.std() # standard deviation
df.var() # variance
df.min() # minimum
df.max() # maximum
df.sum() # sum
df.corr() # correlation matrix
df.cov() # covariance matrix
df['col'].quantile([0.25, 0.5, 0.75])
df['col'].rank(pct=True) # percentile rank
Cumulative & rolling
s.cumsum() # running total
s.cumprod() # running product
s.cummax() # running maximum
s.cummin() # running minimum
s.pct_change() # % change from previous row
s.rolling(7).mean() # 7-period moving average
s.rolling(7).std() # 7-period moving std
s.rolling(7).sum()
📍 Real-World Use Case — A/B Test Analysis for a Landing Page
Scenario: A growth team ran a landing page A/B test. They analyse conversion rates, compute lift, and check correlations to decide which variant wins.
use-case example
import pandas as pd, numpy as np
np.random.seed(42)
n = 500 # visitors per variant
ab = pd.DataFrame({
"variant": ["A"] * n + ["B"] * n,
"converted": np.concatenate([
np.random.binomial(1, 0.12, n), # A: 12% CVR
np.random.binomial(1, 0.155, n), # B: 15.5% CVR
]),
"time_on_page":np.concatenate([
np.random.normal(45, 15, n),
np.random.normal(52, 18, n),
]).round(1),
})
summary = ab.groupby("variant").agg(
visitors = ("converted", "count"),
conversions = ("converted", "sum"),
cvr_pct = ("converted", lambda x: round(x.mean()*100, 2)),
avg_time_s = ("time_on_page", "mean"),
).round(2)
print(summary)
cvr_a = summary.loc["A","cvr_pct"]
cvr_b = summary.loc["B","cvr_pct"]
lift = (cvr_b - cvr_a) / cvr_a * 100
print(f"\nLift (B vs A): +{lift:.1f}%")
print(f"Revenue impact (1M visitors, $50 AOV): ${1e6 * (cvr_b-cvr_a)/100 * 50:,.0f}")
print(f"Timeβconvert corr: {ab['time_on_page'].corr(ab['converted']):.3f}")
20
Plotting with Pandas
visualization
βΌ
Pandas wraps matplotlib to provide quick-and-easy plotting directly from DataFrames and Series.
Chart types
import matplotlib.pyplot as plt
df['Salary'].plot(kind='bar') # bar
df['Salary'].plot(kind='barh') # horizontal bar
df['Salary'].plot(kind='hist', bins=10) # histogram
df['Salary'].plot(kind='box') # box plot
df['Salary'].plot(kind='line', marker='o') # line
df['Salary'].plot(kind='kde') # density
# Scatter
df.plot(kind='scatter', x='Age', y='Salary', c='Score',
cmap='viridis', s=80)
# Pie
df.set_index('Name')['Salary'].plot(kind='pie', autopct='%1.0f%%')
plt.tight_layout()
plt.savefig('chart.png', dpi=120)
plt.show()
📍 Real-World Use Case — 6-Month Revenue Dashboard
Scenario: A data analyst builds a 3-panel matplotlib dashboard β revenue vs target, mix by category, and MoM growth β for the company all-hands.
use-case example
import pandas as pd
import matplotlib.pyplot as plt
monthly = pd.DataFrame({
"Month": pd.date_range("2024-01", periods=6, freq="ME"),
"Revenue": [185000,210000,195000,240000,228000,265000],
"Target": [200000,200000,200000,220000,220000,250000],
"SaaS": [ 80000, 95000, 88000,110000,105000,125000],
"Services": [ 70000, 80000, 75000, 90000, 85000, 95000],
"Hardware": [ 35000, 35000, 32000, 40000, 38000, 45000],
})
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
fig.suptitle("H1 2024 Revenue Dashboard", fontsize=14, fontweight="bold")
monthly.set_index("Month")[["Revenue","Target"]].plot(
ax=axes[0], marker="o", title="Revenue vs Target")
monthly.set_index("Month")[["SaaS","Services","Hardware"]].plot(
kind="bar", stacked=True, ax=axes[1], title="Revenue Mix")
axes[1].set_xticklabels([d.strftime("%b") for d in monthly["Month"]], rotation=30)
monthly["MoM"] = monthly["Revenue"].pct_change() * 100
colors = ["#2ecc71" if x > 0 else "#e74c3c" for x in monthly["MoM"].fillna(0)]
monthly.set_index("Month")["MoM"].plot(kind="bar", ax=axes[2], color=colors, title="MoM Growth %")
plt.tight_layout()
plt.savefig("revenue_dashboard.png", dpi=100, bbox_inches="tight")
plt.show()
print("Dashboard saved.")
21
Duplicate Handling
cleaning
βΌ
Detect duplicates
df.duplicated() # mask: True for duplicate rows
df.duplicated(subset=['Name']) # only check these columns
df.duplicated(keep='last') # mark earlier occurrences
df[df.duplicated()] # show duplicate rows
df.duplicated().sum() # count duplicates
Drop duplicates
df.drop_duplicates() # keep first of each
df.drop_duplicates(subset=['Name']) # unique by Name
df.drop_duplicates(subset=['Name'],
keep='last') # keep last
df.drop_duplicates(subset=['Name'],
keep=False) # drop ALL duplicated rows
df.drop_duplicates(inplace=True) # modify in place
📍 Real-World Use Case — Deduplicating a CRM Customer Export
Scenario: A data-ops team merges two CRM systems and finds duplicate customers caused by the migration. They keep the most recently updated record per email address.
use-case example
import pandas as pd
crm = pd.DataFrame({
"email": ["alice@co.com","bob@co.com","alice@co.com","charlie@co.com","bob@co.com"],
"name": ["Alice Smith","Bob Jones","Alice Smith","Charlie Brown","Bob Jones"],
"plan": ["Pro","Basic","Pro","Enterprise","Pro"], # plan changed in v2!
"last_updated": ["2024-03-01","2024-01-15","2024-03-20","2024-02-10","2024-03-18"],
"source": ["CRM_v1","CRM_v1","CRM_v2","CRM_v1","CRM_v2"],
})
print(f"Before: {len(crm)} rows, {crm['email'].nunique()} unique customers")
print(f"Duplicated emails: {crm.duplicated(subset=['email']).sum()}")
# Sort by date so latest is last, then keep last per email
crm_clean = (
crm
.sort_values("last_updated")
.drop_duplicates(subset=["email"], keep="last")
.reset_index(drop=True)
)
print(f"\nAfter: {len(crm_clean)} rows")
print(crm_clean[["email","name","plan","last_updated","source"]].to_string(index=False))
22
Binning with cut() and qcut()
manipulation
βΌ
cut() creates equal-width bins. qcut() creates equal-frequency (quantile) bins.pd.cut β equal width
ages = pd.Series([22, 25, 29, 31, 35, 40, 45, 52, 60, 70])
# Automatic bins
pd.cut(ages, bins=3)
# Custom bin edges with labels
pd.cut(ages,
bins=[0, 25, 40, 100],
labels=['Young', 'Middle', 'Senior'],
right=True) # intervals are (left, right]
# Get bin counts
pd.cut(ages, bins=3).value_counts()
pd.qcut β equal frequency
salaries = pd.Series([65000, 70000, 72000, 85000, 90000, 95000])
pd.qcut(salaries, q=4) # 4 quartiles
pd.qcut(salaries, q=4, labels=['Q1','Q2','Q3','Q4'])
# Attach to DataFrame
df['SalaryBand'] = pd.cut(df['Salary'],
bins=3, labels=['Low','Mid','High'])
df['SalaryQuartile'] = pd.qcut(df['Salary'], q=4,
labels=['Q1','Q2','Q3','Q4'])
📍 Real-World Use Case — Customer Lifetime Value (CLV) Segmentation
Scenario: A marketing analyst segments customers into spend tiers for campaign targeting: Low β re-engagement email, Mid β upsell offer, High β VIP white-glove treatment.
use-case example
import pandas as pd
customers = pd.DataFrame({
"customer_id": range(1001, 1011),
"total_spent": [1250, 320, 4500, 89, 780, 2100, 560, 3800, 150, 920],
"age": [34, 52, 28, 19, 45, 37, 61, 29, 24, 42],
})
# CLV tier by spend (equal-width bins)
customers["clv_tier"] = pd.cut(
customers["total_spent"],
bins=[0, 500, 1500, 5000],
labels=["Low", "Mid", "High"],
)
# Age group by quantile (equal-frequency β same number per bucket)
customers["age_group"] = pd.qcut(
customers["age"], q=3,
labels=["18-30", "31-45", "46+"],
)
# Revenue share per tier
tier_stats = customers.groupby("clv_tier", observed=True).agg(
count = ("customer_id", "count"),
avg_spend = ("total_spent", "mean"),
pct_rev = ("total_spent", lambda x: x.sum() / customers["total_spent"].sum() * 100),
).round(1)
print(customers[["customer_id","total_spent","clv_tier","age_group"]])
print("\nTier revenue breakdown:")
print(tier_stats)
23
assign(), pipe() & Method Chaining
advanced
βΌ
assign() adds columns without mutating the original. pipe() integrates custom functions into chains. Chaining makes code readable and functional.assign()
# assign returns a new DataFrame β original unchanged
result = df.assign(
Tax = lambda x: x['Salary'] * 0.2,
NetPay = lambda x: x['Salary'] - x['Salary'] * 0.2,
Grade = lambda x: pd.cut(x['Score'],
bins=[0, 75, 85, 100],
labels=['C', 'B', 'A'])
)
pipe() for clean chaining
def normalize(df, col):
df = df.copy()
df[col + '_z'] = (df[col] - df[col].mean()) / df[col].std()
return df
def flag_outliers(df, col, threshold=2):
df = df.copy()
df['is_outlier'] = df[col + '_z'].abs() > threshold
return df
result = (
df
.query('Age >= 25')
.assign(SalaryK = lambda x: x['Salary'] / 1000)
.pipe(normalize, col='Salary')
.pipe(flag_outliers, col='Salary')
.sort_values('Salary', ascending=False)
.reset_index(drop=True)
)
📍 Real-World Use Case — Reusable ETL Pipeline for Raw Sales Data
Scenario: A data engineer builds a composable, readable pipeline using pipe() that takes messy raw order exports and produces clean analysis-ready DataFrames.
use-case example
import pandas as pd, numpy as np
raw = pd.DataFrame({
"order_id": ["ORD-001","ORD-002","ORD-003","ORD-004","ORD-005"],
"customer": [" alice ","BOB","charlie"," DIANA ","eve"],
"amount": [250.0, -99.0, 430.0, np.nan, 185.0],
"quantity": [2, 1, 3, 1, 2],
"date": ["2024-01-15","2024-01-16","2024-01-16","2024-01-17","2024-01-18"],
"status": ["completed","cancelled","completed","pending","completed"],
})
# Composable pipeline functions
def clean_strings(df):
df = df.copy()
df["customer"] = df["customer"].str.strip().str.title()
return df
def remove_bad_orders(df):
return df[(df["amount"] > 0) & df["amount"].notna()]
def enrich(df):
return df.assign(
date = pd.to_datetime(df["date"]),
revenue = df["amount"] * df["quantity"],
day_of_week = lambda x: x["date"].dt.day_name(),
is_weekend = lambda x: x["date"].dt.dayofweek >= 5,
)
# Chain everything with .pipe()
clean = (
raw
.pipe(clean_strings)
.pipe(remove_bad_orders)
.pipe(enrich)
.query("status == 'completed'")
.sort_values("revenue", ascending=False)
.reset_index(drop=True)
)
print(clean[["order_id","customer","amount","quantity","revenue","day_of_week"]])
24
eval() & Advanced query()
advanced
βΌ
eval() evaluates string expressions against DataFrame columns β faster than regular Python for large DataFrames.df.eval()
# Returns a Series
df.eval('A + B * 2')
# Add a column in-place
df.eval('D = A + B + C', inplace=True)
df.eval('E = A ** 2 + B ** 2', inplace=True)
# Boolean expression (returns mask)
mask = df.eval('A > 2 and B < 50')
df[mask]
query() with @variables
min_age = 25
max_salary = 90000
df.query('Age > @min_age and Salary < @max_salary')
df.query('City.str.startswith("N")', engine='python')
df.query('index > 2') # filter on index
📍 Real-World Use Case — Fast Financial Ratio Calculations
Scenario: A financial analyst computes multiple P&L ratios across 1,000 companies. eval() is faster than chained column assignments and keeps the code readable.
use-case example
import pandas as pd, numpy as np
np.random.seed(0)
n = 1000
fin = pd.DataFrame({
"company": [f"CO_{i:04d}" for i in range(n)],
"revenue": np.random.uniform(1e6, 100e6, n),
"cogs_pct": np.random.uniform(0.30, 0.65, n),
"opex_pct": np.random.uniform(0.10, 0.25, n),
"interest_pct": np.random.uniform(0.01, 0.05, n),
"tax_rate": np.random.uniform(0.20, 0.30, n),
})
# Build P&L with eval (compiles to fast bytecode)
fin.eval("cogs = revenue * cogs_pct", inplace=True)
fin.eval("gross_profit = revenue - cogs", inplace=True)
fin.eval("opex = revenue * opex_pct", inplace=True)
fin.eval("ebit = gross_profit - opex", inplace=True)
fin.eval("interest_exp = revenue * interest_pct", inplace=True)
fin.eval("ebt = ebit - interest_exp", inplace=True)
fin.eval("net_income = ebt * (1 - tax_rate)", inplace=True)
fin.eval("gross_margin = gross_profit / revenue", inplace=True)
fin.eval("net_margin = net_income / revenue", inplace=True)
# Filter: profitable companies with gross margin > 50%
profitable = fin.query("net_income > 0 and gross_margin > 0.50")
print(f"Profitable high-margin companies: {len(profitable)}/{n}")
print(profitable[["company","revenue","gross_margin","net_margin"]].nlargest(5,"net_margin").round(3).to_string(index=False))
25
MultiIndex (Hierarchical Indexing)
advanced
βΌ
MultiIndex allows multiple levels of row or column labels β great for panel data, time series by group, and hierarchical reporting.
Creating MultiIndex
# From arrays
arrays = [['East','East','West','West'], ['Q1','Q2','Q1','Q2']]
idx = pd.MultiIndex.from_arrays(arrays, names=['Region','Quarter'])
# From product (all combinations)
idx = pd.MultiIndex.from_product(
[['East','West'], ['Q1','Q2','Q3']],
names=['Region', 'Quarter']
)
Selecting from MultiIndex
s['East'] # all East rows
s['East']['Q1'] # East β Q1
s.loc[('East', 'Q1')] # tuple access
# xs: cross-section (select a level value)
s.xs('Q1', level='Quarter') # all Q1 regardless of Region
df.xs('East', level='Region')
# Unstack inner level β columns
df['Sales'].unstack(level='Quarter') # Region as rows, Q as cols
GroupBy on MultiIndex
df.groupby(level='Region').sum()
df.groupby(level='Quarter').mean()
df.swaplevel().sort_index() # swap the two levels
📍 Real-World Use Case — Multi-Store Quarterly Inventory Report
Scenario: A retail chain analyst uses MultiIndex to track inventory across 3 stores and 4 quarters, enabling instant cross-sectional slicing.
use-case example
import pandas as pd, numpy as np
stores = ["NYC", "LA", "Chicago"]
quarters = ["Q1", "Q2", "Q3", "Q4"]
idx = pd.MultiIndex.from_product([stores, quarters], names=["store","quarter"])
np.random.seed(0)
inv = pd.DataFrame({
"units_sold": np.random.randint(200, 800, len(idx)),
"units_ordered":np.random.randint(300, 700, len(idx)),
"revenue": np.random.randint(10000, 60000, len(idx)),
}, index=idx)
inv["net_stock"] = inv["units_ordered"] - inv["units_sold"]
print("=== NYC performance all quarters ===")
print(inv.loc["NYC"])
print("\n=== All stores, Q4 only ===")
print(inv.xs("Q4", level="quarter"))
print("\n=== Annual revenue per store ===")
print(inv.groupby(level="store")["revenue"].sum().sort_values(ascending=False))
print("\n=== Revenue matrix (stores Γ quarters) ===")
print(inv["revenue"].unstack(level="quarter"))
26
Categorical Data
advanced
βΌ
Categorical dtype stores low-cardinality string columns efficiently as integers internally. It also enables ordered comparisons.
Creating & comparing
sizes = pd.Categorical(
['M', 'L', 'S', 'XL', 'M'],
categories=['S', 'M', 'L', 'XL'],
ordered=True # enables < > comparisons
)
print(sizes.codes) # [1, 2, 0, 3, 1] β integer encoding
print(sizes > 'M') # [False, True, False, True, False]
In a DataFrame
df['Size'] = pd.Categorical(df['Size'],
categories=['S', 'M', 'L', 'XL'], ordered=True)
# Sorting respects category order (not alphabet)
df.sort_values('Size')
# groupby includes all categories (even empty ones)
df.groupby('Size', observed=False)['Sales'].sum()
# Memory comparison
df['str_col'].memory_usage(deep=True)
df['str_col'].astype('category').memory_usage(deep=True) # much less!
📍 Real-World Use Case — NPS Survey Analysis with Ordered Ratings
Scenario: A product team classifies 0β10 NPS responses into Detractor/Passive/Promoter ordered categories so they sort and group correctly, and empty categories still appear.
use-case example
import pandas as pd
survey = pd.DataFrame({
"respondent": range(1, 11),
"product": ["App","App","Web","App","Web","App","Web","App","Web","Web"],
"nps_score": [9, 6, 8, 10, 4, 7, 9, 3, 8, 6],
"plan": ["Pro","Basic","Pro","Enterprise","Basic",
"Pro","Basic","Pro","Enterprise","Basic"],
})
def nps_label(score):
return "Promoter" if score >= 9 else ("Passive" if score >= 7 else "Detractor")
survey["nps_cat"] = pd.Categorical(
survey["nps_score"].apply(nps_label),
categories=["Detractor","Passive","Promoter"],
ordered=True,
)
# groupby with observed=False shows all categories even if empty
breakdown = survey.groupby("nps_cat", observed=False).agg(
count = ("respondent", "count"),
avg_score = ("nps_score", "mean"),
).round(2)
n = len(survey)
breakdown["pct"] = (breakdown["count"] / n * 100).round(1)
nps_idx = breakdown.loc["Promoter","pct"] - breakdown.loc["Detractor","pct"]
print(breakdown)
print(f"\nNPS Index: {nps_idx:.0f} (range -100 to +100)")
print("\nSorts correctly by category order:")
print(survey.sort_values("nps_cat")[["respondent","nps_score","nps_cat"]].to_string(index=False))
27
Window Functions: expanding & ewm
advanced
βΌ
Beyond
rolling(), pandas offers expanding() (growing window from start) and ewm() (exponentially weighted, recent data gets more weight).expanding()
s = pd.Series([10, 12, 11, 14, 13, 16, 15, 18])
s.expanding().mean() # cumulative average
s.expanding().max() # running maximum
s.expanding().sum() # cumulative sum
s.expanding().std() # expanding std dev
s.expanding(min_periods=3).mean() # require at least 3 points
ewm() β Exponential Weighted Moving Average
s.ewm(span=3).mean() # span=N behaves like rolling(N) but weighted
s.ewm(alpha=0.3).mean() # explicit smoothing factor (0β1)
s.ewm(halflife=2).mean() # weight halves every 2 periods
s.ewm(com=2).mean() # center of mass
# EWM vs Rolling: EWM never has NaN at the start
compare = pd.DataFrame({
'original': s,
'rolling3': s.rolling(3).mean(),
'ewm_span3': s.ewm(span=3).mean()
})
📍 Real-World Use Case — Stock Technical Analysis β SMA, EMA & Bollinger Bands
Scenario: A quant analyst computes SMA-20, EMA-12, MACD, and Bollinger Bands to generate buy/sell signals on a simulated stock.
use-case example
import pandas as pd, numpy as np
np.random.seed(42)
n = 60
price = pd.Series(
100 + np.cumsum(np.random.normal(0.2, 2.0, n)),
index=pd.date_range("2024-01-01", periods=n, freq="B"),
name="Close",
)
ta = pd.DataFrame({"Close": price})
# Simple & Exponential Moving Averages
ta["SMA_20"] = price.rolling(20).mean()
ta["EMA_12"] = price.ewm(span=12).mean()
ta["EMA_26"] = price.ewm(span=26).mean()
# MACD indicator
ta["MACD"] = ta["EMA_12"] - ta["EMA_26"]
ta["Signal"] = ta["MACD"].ewm(span=9).mean()
ta["Histogram"] = ta["MACD"] - ta["Signal"]
# Bollinger Bands (2 standard deviations)
rolling_std = price.rolling(20).std()
ta["BB_upper"] = ta["SMA_20"] + 2 * rolling_std
ta["BB_lower"] = ta["SMA_20"] - 2 * rolling_std
# Expanding all-time-high
ta["ATH"] = price.expanding().max()
# Buy signal: price below lower Bollinger Band
ta["buy_signal"] = price < ta["BB_lower"]
print(ta.tail(10).round(2))
print(f"\nBuy signals triggered: {ta['buy_signal'].sum()} days")
28
DataFrame Styling
visualization
βΌ
The
.style API renders styled HTML tables in Jupyter. Great for dashboards and reports.Number formatting
df.style.format({
'Salary': '${:,.0f}',
'Score': '{:.1f}%',
'Date': '{:%Y-%m-%d}'
})
Highlighting & gradients
(df.style
.format({'Salary': '${:,.0f}'})
.highlight_max(subset=['Salary', 'Score'], color='#90EE90')
.highlight_min(subset=['Salary', 'Score'], color='#FFB6C1')
.background_gradient(subset=['Score'], cmap='RdYlGn')
.bar(subset=['Salary'], color='lightblue')
.set_caption('Performance Table')
.hide(axis='index') # hide row numbers
)
Export to HTML
html = df.style.to_html()
with open('table.html', 'w') as f:
f.write(html)
📍 Real-World Use Case — Color-Coded Weekly KPI Report for Management
Scenario: A BI analyst creates a styled HTML table with conditional formatting β green/red cells and data bars β for the weekly business review.
use-case example
import pandas as pd
kpis = pd.DataFrame({
"Metric": ["Revenue","New Customers","Churn Rate","NPS","Avg Order Value","Support CSAT"],
"This Week": [285000, 142, 2.1, 48, 210, 91],
"Last Week": [262000, 128, 2.8, 44, 198, 88],
"Target": [275000, 150, 2.0, 50, 205, 90],
})
kpis["vs Target %"] = ((kpis["This Week"] - kpis["Target"]) / kpis["Target"] * 100).round(1)
kpis["WoW %"] = ((kpis["This Week"] - kpis["Last Week"]) / kpis["Last Week"] * 100).round(1)
styled = (
kpis.style
.format({"This Week": "{:,.0f}", "Last Week": "{:,.0f}",
"Target": "{:,.0f}", "vs Target %": "{:+.1f}%", "WoW %": "{:+.1f}%"})
.background_gradient(subset=["vs Target %"], cmap="RdYlGn", vmin=-10, vmax=10)
.background_gradient(subset=["WoW %"], cmap="RdYlGn", vmin=-10, vmax=10)
.bar(subset=["This Week"], color="#4a90d9")
.set_caption("Weekly KPIs β w/e 7 Mar 2025")
.hide(axis="index")
)
# In Jupyter, `styled` renders the HTML table. Here we print insights:
print(kpis[["Metric","This Week","Target","vs Target %"]].to_string(index=False))
# Export to HTML for email
html_table = styled.to_html()
with open("weekly_kpis.html", "w") as f:
f.write(html_table)
print("\nHTML table saved to weekly_kpis.html")
29
Memory Optimization
performance
βΌ
Large DataFrames can easily use gigabytes of RAM. A few type changes can reduce memory 60-90%.
Measure & optimize
# Check memory usage (deep=True for actual string sizes)
df.memory_usage(deep=True).sum() / 1e6 # in MB
# Downcast integer types
df['id'] = pd.to_numeric(df['id'], downcast='unsigned') # int64 β uint8/16/32
df['count'] = pd.to_numeric(df['count'], downcast='integer')
# Downcast float
df['value'] = df['value'].astype('float32') # 64β32 bit
# Categorical for low-cardinality strings
df['status'] = df['status'].astype('category')
df['country'] = df['country'].astype('category')
# Use int8 for flags/booleans
df['flag'] = df['flag'].astype('int8') # 8x smaller than int64
Read optimizations
# Only load columns you need
df = pd.read_csv('big.csv', usecols=['id', 'value', 'date'])
# Read in chunks for very large files
chunks = pd.read_csv('huge.csv', chunksize=100_000)
result = pd.concat([process(chunk) for chunk in chunks])
# Parquet is much more memory-efficient than CSV
df.to_parquet('data.parquet')
df = pd.read_parquet('data.parquet')
📍 Real-World Use Case — Optimising a 500K-Row Transaction Log
Scenario: A data engineer processes a large e-commerce transaction log that is consuming too much RAM. Type downcast + categorical columns reduce memory by ~75%.
use-case example
import pandas as pd, numpy as np
np.random.seed(0)
n = 500_000
txn = pd.DataFrame({
"txn_id": np.arange(n),
"user_id": np.random.randint(1, 100_000, n),
"product_id": np.random.randint(1, 10_000, n),
"category": np.random.choice(["Electronics","Clothing","Books","Food","Sports"], n),
"country": np.random.choice(["US","UK","CA","AU","DE"], n),
"amount": np.random.uniform(1.0, 500.0, n),
"quantity": np.random.randint(1, 10, n),
"returned": np.random.randint(0, 2, n),
})
before_mb = txn.memory_usage(deep=True).sum() / 1e6
print(f"Before: {before_mb:.1f} MB | dtypes: {txn.dtypes.value_counts().to_dict()}")
# Downcast integers
txn["txn_id"] = txn["txn_id"].astype("uint32")
txn["user_id"] = txn["user_id"].astype("uint32")
txn["product_id"] = txn["product_id"].astype("uint16")
txn["quantity"] = txn["quantity"].astype("int8")
txn["returned"] = txn["returned"].astype("int8")
# Float32 instead of float64
txn["amount"] = txn["amount"].astype("float32")
# Categorical for low-cardinality strings (5 unique values each!)
txn["category"] = txn["category"].astype("category")
txn["country"] = txn["country"].astype("category")
after_mb = txn.memory_usage(deep=True).sum() / 1e6
print(f"After: {after_mb:.1f} MB")
print(f"Saved: {(1 - after_mb/before_mb)*100:.0f}% β")
30
Sampling & Random Operations
advanced
βΌ
sample()
df.sample(n=5) # 5 random rows
df.sample(frac=0.2, random_state=42) # 20% sample (reproducible)
df.sample(frac=1).reset_index(drop=True) # shuffle entire df
# Weighted sampling (higher weight = more likely to appear)
df.sample(n=3, weights='Salary')
# Sample columns
df.sample(n=3, axis=1) # 3 random columns
Stratified sampling
# Same proportion from each group
stratified = df.groupby('Dept', group_keys=False).apply(
lambda x: x.sample(frac=0.5, random_state=42)
)
# Train/test split
train = df.sample(frac=0.8, random_state=42)
test = df.drop(train.index)
📍 Real-World Use Case — Stratified Train / Validation / Test Split for a Churn Model
Scenario: A ML engineer needs reproducible train/val/test splits that preserve the 20% churn rate (class imbalance) in every split.
use-case example
import pandas as pd, numpy as np
np.random.seed(42)
n = 1000
df = pd.DataFrame({
"cust_id": range(1, n+1),
"tenure": np.random.randint(1, 72, n),
"spend": np.random.uniform(20, 500, n).round(2),
"calls": np.random.randint(0, 15, n),
"churn": np.random.choice([0, 1], n, p=[0.80, 0.20]),
})
print(f"Full dataset churn rate: {df['churn'].mean():.1%} ({n} rows)")
def stratified_split(df, label, train=0.70, val=0.15, seed=42):
tr = df.groupby(label, group_keys=False).apply(
lambda x: x.sample(frac=train, random_state=seed))
rem = df.drop(tr.index)
va = rem.groupby(label, group_keys=False).apply(
lambda x: x.sample(frac=val/(1-train), random_state=seed))
te = rem.drop(va.index)
return tr.reset_index(drop=True), va.reset_index(drop=True), te.reset_index(drop=True)
train, val, test = stratified_split(df, "churn")
for name, split in [("Train", train), ("Val", val), ("Test", test)]:
print(f"{name:5s} rows={len(split):4d} churn={split['churn'].mean():.1%}")
31
shift() & diff() for Time Series
advanced
βΌ
shift() creates lag/lead features. diff() computes differences. Essential for time series feature engineering.shift β lag & lead
df['prev_close'] = df['Close'].shift(1) # lag 1 period
df['next_close'] = df['Close'].shift(-1) # lead 1 period
df['prev_7day'] = df['Close'].shift(7) # lag 7 periods
# Time-based shift with DateOffset
ts.shift(1, freq='D') # shift by 1 day (moves the index)
diff β differences
df['change'] = df['Close'].diff(1) # day-over-day
df['weekly_chg'] = df['Close'].diff(5) # 5-period difference
df['pct_change'] = df['Close'].pct_change() # % change
# Direction
df['direction'] = np.where(df['change'] > 0, 'UP',
np.where(df['change'] < 0, 'DOWN', 'FLAT'))
# Combine for a feature-rich time series DataFrame
df = df.assign(
lag1 = df['Close'].shift(1),
lag5 = df['Close'].shift(5),
diff1 = df['Close'].diff(1),
pct = df['Close'].pct_change(),
ma7 = df['Close'].rolling(7).mean(),
ewm7 = df['Close'].ewm(span=7).mean(),
)
📍 Real-World Use Case — Portfolio Daily Returns & Drawdown Analysis
Scenario: A portfolio manager calculates daily log-returns, cumulative NAV growth, maximum drawdown, and annualised Sharpe ratio.
use-case example
import pandas as pd, numpy as np
np.random.seed(1)
nav = pd.Series(
[10000] + list(10000 * np.cumprod(1 + np.random.normal(0.0005, 0.015, 59))),
index=pd.date_range("2024-01-02", periods=60, freq="B"),
name="NAV",
)
pf = pd.DataFrame({"NAV": nav})
pf["daily_return"] = pf["NAV"].pct_change()
pf["log_return"] = np.log(pf["NAV"] / pf["NAV"].shift(1))
pf["cum_return_pct"]= (pf["NAV"] / pf["NAV"].iloc[0] - 1) * 100
pf["rolling_max"] = pf["NAV"].expanding().max()
pf["drawdown_pct"] = (pf["NAV"] - pf["rolling_max"]) / pf["rolling_max"] * 100
pf["vol_10d_ann"] = pf["daily_return"].rolling(10).std() * np.sqrt(252) * 100
max_dd = pf["drawdown_pct"].min()
total_r = pf["cum_return_pct"].iloc[-1]
sharpe = (pf["daily_return"].mean() / pf["daily_return"].std()) * np.sqrt(252)
print(pf[["NAV","daily_return","cum_return_pct","drawdown_pct"]].tail(10).round(3).to_string())
print(f"\nTotal return : {total_r:+.2f}%")
print(f"Max drawdown : {max_dd:.2f}%")
print(f"Sharpe ratio : {sharpe:.2f}")
32
explode() for List Columns
manipulation
βΌ
explode() transforms each list element into its own row β essential when a column contains lists or arrays.Basic explode
df = pd.DataFrame({
'OrderID': [1, 2, 3],
'Items': [['Apple', 'Banana'], ['Cherry'], ['Apple', 'Cherry', 'Donut']]
})
df.explode('Items').reset_index(drop=True)
# OrderID Items
# 0 1 Apple
# 1 1 Banana
# 2 2 Cherry
# 3 3 Apple
# 4 3 Cherry
# 5 3 Donut
Split string β explode
df['TagList'] = df['Tags'].str.split(',')
df_exploded = df.explode('TagList').drop(columns='Tags')
# Count tag frequency
df_exploded['TagList'].value_counts()
# Explode multiple columns at once (pandas 1.3+)
df.explode(['col1', 'col2'])
📍 Real-World Use Case — E-Commerce: Expand Cart Items for Co-Purchase Analysis
Scenario: An analyst needs to find the most frequently purchased products. Raw orders store items as lists β explode() unpacks them for item-level analysis.
use-case example
import pandas as pd
orders = pd.DataFrame({
"order_id": [5001, 5002, 5003, 5004, 5005],
"customer": ["Alice","Bob","Alice","Charlie","Bob"],
"items": [["USB-Hub","Webcam","Mouse"],
["Keyboard","Mouse"],
["Monitor","USB-Hub","Webcam","Headphones"],
["Mouse","Keyboard","USB-Hub"],
["Webcam","Headphones"]],
"order_value": [89.50, 125.00, 349.99, 210.00, 75.50],
})
# Explode: one row per item (order_value is repeated per item)
items = orders.explode("items").rename(columns={"items":"product"}).reset_index(drop=True)
print("Exploded (first 8 rows):")
print(items.head(8))
print("\nTop products by frequency:")
print(items["product"].value_counts())
# Co-purchase: how often does Webcam appear with Mouse?
webcam_orders = items[items["product"]=="Webcam"]["order_id"]
print("\nProducts bought with Webcam:")
print(items[items["order_id"].isin(webcam_orders)]["product"].value_counts())
β
Complete Cheatsheet
reference
βΌ
All 32 Topics β Quick Reference
| Task | Code |
|---|---|
| Create DataFrame | pd.DataFrame({'col': [...]}) |
| Read CSV | pd.read_csv('file.csv') |
| Select column | df['col'] or df[['a','b']] |
| Label select | df.loc[0:5, 'col1':'col3'] |
| Position select | df.iloc[:5, :3] |
| Filter rows | df[df['col'] > 10] or df.query('col > 10') |
| Add column | df['new'] = df['a'] + df['b'] |
| Remove column | df.drop(columns=['col']) |
| Rename | df.rename(columns={'old': 'new'}) |
| Missing count | df.isnull().sum() |
| Fill NaN | df.fillna(0) / df.ffill() |
| Drop NaN | df.dropna(subset=['col']) |
| Convert type | df['col'].astype('int32') |
| Parse dates | pd.to_datetime(df['col']) |
| Sort | df.sort_values('col', ascending=False) |
| GroupBy + agg | df.groupby('col').agg({'a': 'sum', 'b': 'mean'}) |
| GroupBy transform | df.groupby('g')['col'].transform('sum') |
| Merge | pd.merge(left, right, on='id', how='inner') |
| Concat | pd.concat([df1, df2], ignore_index=True) |
| Pivot table | pd.pivot_table(df, values='v', index='r', columns='c') |
| String ops | df['col'].str.lower() / .str.contains('pat') |
| Date parts | df['date'].dt.year / .dt.month_name() |
| Apply func | df['col'].apply(func) |
| Vectorized if | np.where(cond, 'yes', 'no') |
| Wide β Long | pd.melt(df, id_vars=['id'], var_name='k', value_name='v') |
| Long β Wide | df.pivot(index='i', columns='c', values='v') |
| Correlation | df.corr() |
| Rolling mean | df['col'].rolling(7).mean() |
| EWM | df['col'].ewm(span=7).mean() |
| Duplicates | df.drop_duplicates(subset=['col']) |
| Binning | pd.cut(s, bins=3) / pd.qcut(s, q=4) |
| Assign col | df.assign(Tax=lambda x: x.Salary * 0.2) |
| Pipe func | df.pipe(my_func).pipe(other_func) |
| eval expr | df.eval('C = A + B', inplace=True) |
| MultiIndex | df.xs('Q1', level='Quarter') |
| Categorical | pd.Categorical(s, categories=[...], ordered=True) |
| Style table | df.style.background_gradient().highlight_max() |
| Memory | df.memory_usage(deep=True) |
| Sample | df.sample(frac=0.2, random_state=42) |
| Lag / lead | df['col'].shift(1) / .shift(-1) |
| Diff | df['col'].diff(1) / .pct_change() |
| Explode list | df.explode('list_col').reset_index(drop=True) |
| Plot | df.plot(kind='bar') / df.plot(kind='scatter', x=, y=) |
📍 Real-World Use Case — End-to-End Data Pipeline in 30 Lines
Scenario: A data analyst takes raw data from ingestion to a final analytical summary β combining all major pandas techniques in one concise workflow.
use-case example
import pandas as pd, numpy as np
from io import StringIO
# 1. Simulate raw CSV data (replace with pd.read_csv("sales.csv") in real life)
raw = StringIO(
"order_id,customer,date,amount,product,region\n"
"001,Alice,2024-01-10,250,Widget A,East\n"
"002,Bob,2024-01-15,180,Gadget X,West\n"
"002,Bob,2024-01-15,180,Gadget X,West\n" # duplicate!
"003,Charlie,2024-02-01,430,Widget B,East\n"
"004,Diana,2024-02-20,,Service Z,North\n" # missing amount
"005,Alice,2024-03-05,520,Widget A,East\n"
)
# 2. Load
df = pd.read_csv(raw, dtype={"order_id": str}, parse_dates=["date"])
# 3. Clean
df = (df
.dropna(subset=["amount"])
.drop_duplicates(subset=["order_id"])
.assign(customer=lambda x: x["customer"].str.title()))
# 4. Enrich
df = df.assign(
month = df["date"].dt.to_period("M").astype(str),
quarter = "Q" + df["date"].dt.quarter.astype(str),
revenue_k = (df["amount"] / 1000).round(2),
value_band = pd.cut(df["amount"], bins=[0,200,400,9999], labels=["Low","Mid","High"]),
)
# 5. Analyse
summary = df.groupby(["region","quarter"]).agg(
orders = ("order_id", "count"),
total_rev = ("amount", "sum"),
avg_order = ("amount", "mean"),
).round(1)
print(summary)
# 6. Save
df.to_csv("clean_sales.csv", index=False)
print(f"\nClean data saved ({len(df)} orders).")