Topics covered: DataFrame creation, reading CSV/Excel, exploration methods, boolean filtering, .loc[]/.iloc[], groupby aggregations, .apply(), method chaining
Learning objectives: By the end of this week you will be able to apply data manipulation with pandas concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.
Pandas provides two data structures: Series (labelled 1D array) and DataFrame (labelled 2D table). Read data with pd.read_csv() or pd.read_excel(). Explore with: .shape (rows, cols), .info() (dtypes), .describe() (summary statistics), .head(n) (first n rows), .isnull().sum() (missing counts), .nunique() (unique values per column).
import pandas as pd
import numpy as np
# Create a DataFrame
data = {
'patient_id': ['P001','P002','P003','P004','P005'],
'age': [45, 52, 38, 61, 29],
'gender': ['F', 'M', 'F', 'M', 'F'],
'baseline_bp': [138, 145, 128, 162, 118],
'treatment': ['drug','placebo','drug','drug','placebo']
}
df = pd.DataFrame(data)
print(df.shape) # (5, 5)
print(df.dtypes) # data types
print(df.describe()) # summary statistics
print(df.head(3))
Select a column with df['col'] (returns Series) or df[['col1','col2']] (returns DataFrame). Filter rows with boolean conditions. Combine conditions with & (and) and | (or). Use ~ for NOT. Always use .loc[row_selector, col_selector] when both selecting and assigning to avoid the SettingWithCopyWarning.
import pandas as pd
df = pd.read_csv('customers.csv') # your dataset
# Boolean filtering
high_balance = df[df['balance'] > 500000]
female_prime = df[(df['gender'] == 'F') & (df['credit_score'] >= 750)]
not_defaulted = df[~df['is_defaulted']]
print(f'High balance: {len(high_balance)}')
print(f'Female prime: {len(female_prime)}')
# .loc[] - select rows and columns simultaneously
subset = df.loc[df['age'] > 35, ['customer_id', 'age', 'balance']]
# Safe assignment with .loc[]
df.loc[df['balance'] < 0, 'balance'] = 0
groupby() splits data into groups, applies a function to each group, and combines results. Chain .agg() for multiple aggregations. .apply() runs a function on each row (axis=1) or column (axis=0). Method chaining writes entire pipelines without intermediate variables - wrap in parentheses for multi-line readability.
import pandas as pd
df = pd.read_csv('sales.csv') # your dataset
# GroupBy with multiple aggregations
summary = df.groupby('region').agg(
total_sales = ('sale_amount', 'sum'),
avg_sale = ('sale_amount', 'mean'),
transaction_count = ('id', 'count')
).reset_index().sort_values('total_sales', ascending=False)
print(summary)
# .apply() with lambda - risk tier classification
df['risk_tier'] = df['credit_score'].apply(
lambda s: 'Low' if s >= 750 else ('Medium' if s >= 650 else 'High')
)
# Method chain
result = (
df
.query('date >= "2024-01-01"')
.assign(revenue_ngn = lambda x: x['revenue_usd'] * 1450)
.groupby('category')
.agg(total = ('revenue_ngn', 'sum'))
.sort_values('total', ascending=False)
.head(10)
)
Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.
Load the Titanic dataset. Find: (1) survival rate by passenger class, (2) mean age by gender among survivors, (3) the embarkation port with the highest average fare.
Load a sales dataset. Using groupby, produce a monthly revenue summary by product category with total revenue, transaction count, and average order value.