Topics covered: Missing data mechanisms (MCAR/MAR/MNAR), imputation strategies, IQR and z-score outlier detection, winsorisation, string cleaning, one-hot and ordinal encoding
Learning objectives: By the end of this week you will be able to apply data cleaning and preprocessing concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.
Missing data has three mechanisms (Rubin 1976): MCAR (missing completely at random), MAR (missing at random - depends on observed variables), and MNAR (missing not at random - depends on the missing value itself). Strategy depends on mechanism and proportion missing. Under 5% MCAR: row deletion is acceptable. Otherwise: mean/median for numerical, mode for categorical. For time series: forward fill (ffill()) or backward fill (bfill()).
import pandas as pd
import numpy as np
np.random.seed(42)
n = 200
df = pd.DataFrame({
'age': np.random.randint(18,70,n).astype(float),
'income': np.random.normal(250000,80000,n),
'edu': np.random.choice(['Primary','Secondary','Tertiary',None], n)
})
df.loc[df.sample(frac=0.08).index, 'age'] = np.nan
df.loc[df.sample(frac=0.12).index, 'income'] = np.nan
# Assess missingness
print(pd.DataFrame({
'missing_count': df.isnull().sum(),
'missing_pct': (df.isnull().mean()*100).round(2)
}).query('missing_count > 0'))
# Impute
df['income'].fillna(df['income'].median(), inplace=True)
df['age'].fillna(df['age'].mean(), inplace=True)
df['edu'].fillna(df['edu'].mode()[0], inplace=True)
print('Missing after imputation:', df.isnull().sum().sum())
The IQR method: outliers are below Q1 - 1.5*IQR or above Q3 + 1.5*IQR (Tukey fences). This is robust because it uses median-based spread, resistant to extreme values. The z-score method flags values more than 3 standard deviations from the mean. Treatment options: delete rows, cap/winsorise (replace beyond fence with fence value), log transform, or retain if legitimate.
import pandas as pd
import numpy as np
def detect_outliers_iqr(series, multiplier=1.5):
Q1, Q3 = series.quantile(0.25), series.quantile(0.75)
IQR = Q3 - Q1
low = Q1 - multiplier * IQR
high = Q3 + multiplier * IQR
return (series < low) | (series > high), low, high
def winsorise(series, lo=0.01, hi=0.99):
return series.clip(lower=series.quantile(lo), upper=series.quantile(hi))
np.random.seed(0)
incomes = np.append(np.random.normal(250000, 60000, 195), [2000000, 3000000, -100000, 1800000, 2500000])
df = pd.DataFrame({'income': incomes})
mask, low, high = detect_outliers_iqr(df['income'])
print(f'Outliers: {mask.sum()} rows')
df['income_clean'] = winsorise(df['income'])
print(f'Range before: {df["income"].min():.0f} to {df["income"].max():.0f}')
print(f'Range after: {df["income_clean"].min():.0f} to {df["income_clean"].max():.0f}')
Machine learning algorithms require numerical input. One-hot encoding converts a categorical column into k-1 binary columns. Use pd.get_dummies() with drop_first=True to avoid multicollinearity. For ordinal variables (natural order), use .map() with a dictionary. String cleaning: .str.strip() removes whitespace, .str.lower() standardises case, .str.replace() handles patterns.
import pandas as pd
df = pd.DataFrame({'edu': ['Primary','Secondary','Tertiary','Secondary','Primary'],
'city': [' Lagos ','KANO','abuja','Lagos','kano ']})
# One-hot encoding (baseline category dropped)
ohe = pd.get_dummies(df['edu'], prefix='edu', drop_first=True)
print(ohe) # columns: edu_Secondary, edu_Tertiary
# Ordinal encoding
ordinal_map = {'Primary': 1, 'Secondary': 2, 'Tertiary': 3}
df['edu_ord'] = df['edu'].map(ordinal_map)
# String cleaning
df['city_clean'] = df['city'].str.strip().str.title()
print(df['city_clean'].value_counts())
Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.
On a Kaggle housing dataset: assess and impute missing values with justification, detect outliers with IQR and z-score, standardise all text columns, encode categoricals. Document every decision.
Write a reusable data_cleaning_pipeline(df) function that imputes by column type, winsorises at 1st/99th percentile, strips/title-cases strings, and returns a cleaning_report dict.