W4
Intermediate 3 sessions · 6 hours · Python

Week 4: Data Cleaning and Preprocessing

Topics covered: Missing data mechanisms (MCAR/MAR/MNAR), imputation strategies, IQR and z-score outlier detection, winsorisation, string cleaning, one-hot and ordinal encoding

Learning objectives: By the end of this week you will be able to apply data cleaning and preprocessing concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: Missing Data Analysis and Imputation

Missing data has three mechanisms (Rubin 1976): MCAR (missing completely at random), MAR (missing at random - depends on observed variables), and MNAR (missing not at random - depends on the missing value itself). Strategy depends on mechanism and proportion missing. Under 5% MCAR: row deletion is acceptable. Otherwise: mean/median for numerical, mode for categorical. For time series: forward fill (ffill()) or backward fill (bfill()).

import pandas as pd
import numpy as np

np.random.seed(42)
n = 200
df = pd.DataFrame({
    'age':    np.random.randint(18,70,n).astype(float),
    'income': np.random.normal(250000,80000,n),
    'edu':    np.random.choice(['Primary','Secondary','Tertiary',None], n)
})
df.loc[df.sample(frac=0.08).index, 'age'] = np.nan
df.loc[df.sample(frac=0.12).index, 'income'] = np.nan

# Assess missingness
print(pd.DataFrame({
    'missing_count': df.isnull().sum(),
    'missing_pct': (df.isnull().mean()*100).round(2)
}).query('missing_count > 0'))

# Impute
df['income'].fillna(df['income'].median(), inplace=True)
df['age'].fillna(df['age'].mean(), inplace=True)
df['edu'].fillna(df['edu'].mode()[0], inplace=True)
print('Missing after imputation:', df.isnull().sum().sum())

Session 2: Outlier Detection and Treatment

The IQR method: outliers are below Q1 - 1.5*IQR or above Q3 + 1.5*IQR (Tukey fences). This is robust because it uses median-based spread, resistant to extreme values. The z-score method flags values more than 3 standard deviations from the mean. Treatment options: delete rows, cap/winsorise (replace beyond fence with fence value), log transform, or retain if legitimate.

import pandas as pd
import numpy as np

def detect_outliers_iqr(series, multiplier=1.5):
    Q1, Q3 = series.quantile(0.25), series.quantile(0.75)
    IQR = Q3 - Q1
    low  = Q1 - multiplier * IQR
    high = Q3 + multiplier * IQR
    return (series < low) | (series > high), low, high

def winsorise(series, lo=0.01, hi=0.99):
    return series.clip(lower=series.quantile(lo), upper=series.quantile(hi))

np.random.seed(0)
incomes = np.append(np.random.normal(250000, 60000, 195), [2000000, 3000000, -100000, 1800000, 2500000])
df = pd.DataFrame({'income': incomes})

mask, low, high = detect_outliers_iqr(df['income'])
print(f'Outliers: {mask.sum()} rows')
df['income_clean'] = winsorise(df['income'])
print(f'Range before: {df["income"].min():.0f} to {df["income"].max():.0f}')
print(f'Range after:  {df["income_clean"].min():.0f} to {df["income_clean"].max():.0f}')

Session 3: Encoding Categorical Variables

Machine learning algorithms require numerical input. One-hot encoding converts a categorical column into k-1 binary columns. Use pd.get_dummies() with drop_first=True to avoid multicollinearity. For ordinal variables (natural order), use .map() with a dictionary. String cleaning: .str.strip() removes whitespace, .str.lower() standardises case, .str.replace() handles patterns.

import pandas as pd

df = pd.DataFrame({'edu': ['Primary','Secondary','Tertiary','Secondary','Primary'],
                   'city': [' Lagos ','KANO','abuja','Lagos','kano ']})

# One-hot encoding (baseline category dropped)
ohe = pd.get_dummies(df['edu'], prefix='edu', drop_first=True)
print(ohe)  # columns: edu_Secondary, edu_Tertiary

# Ordinal encoding
ordinal_map = {'Primary': 1, 'Secondary': 2, 'Tertiary': 3}
df['edu_ord'] = df['edu'].map(ordinal_map)

# String cleaning
df['city_clean'] = df['city'].str.strip().str.title()
print(df['city_clean'].value_counts())

Week 4 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

On a Kaggle housing dataset: assess and impute missing values with justification, detect outliers with IQR and z-score, standardise all text columns, encode categoricals. Document every decision.

Write a reusable data_cleaning_pipeline(df) function that imputes by column type, winsorises at 1st/99th percentile, strips/title-cases strings, and returns a cleaning_report dict.

Previous WeekNext: Week 5