Week 8: Correlation, Regression and Causality | Python Data Science Tutorials

Learning objectives: By the end of this week you will be able to apply correlation, regression and causality concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.

Session 1: Correlation Analysis and Matrices

Pearson's r: measures linear relationship strength (-1 to 1). Sensitive to outliers. Assumes continuous, approximately normal variables. Spearman's rho: rank-based, measures monotonic (not necessarily linear) relationships. Use when data is ordinal, non-linear but monotonic, or contains outliers. Partial correlation measures association between two variables after removing the influence of a third variable.

import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
n = 200
df = pd.DataFrame({
    'gdp_per_capita': np.random.normal(5000, 1500, n),
    'literacy_rate':  np.random.uniform(0.55, 0.98, n),
    'infant_mortality': np.random.exponential(30, n)
})

print('Pearson correlation matrix:')
print(df.corr(method='pearson').round(3))

r_sp, p_sp = stats.spearmanr(df['gdp_per_capita'], df['infant_mortality'])
print(f'Spearman rho={r_sp:.3f}, p={p_sp:.4f}')

fig, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(df.corr(), annot=True, cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Economic Indicator Correlation Matrix')
plt.tight_layout(); plt.show()

Session 2: Linear Regression with statsmodels

Simple linear regression: Y = beta0 + beta1*X + epsilon. beta1 is the change in Y per unit increase in X. OLS minimises sum of squared residuals. Multiple regression adds predictors. Check assumptions (LINE): Linearity (residuals vs fitted), Independence, Normality of residuals (Q-Q plot), Equal variance/homoscedasticity (scale-location plot). Multicollinearity: VIF > 10 indicates a problem.

import pandas as pd
import numpy as np
import statsmodels.api as sm

np.random.seed(10)
n = 300
X = np.random.normal(0, 1, (n, 3))
Y = 50000 + 8000*X[:,0] - 3000*X[:,1] + 12000*X[:,2] + np.random.normal(0, 5000, n)

df = pd.DataFrame(X, columns=['years_exp', 'edu_score', 'location'])
df['salary'] = Y

# statsmodels provides p-values, confidence intervals, diagnostic info
X_const = sm.add_constant(df.drop('salary', axis=1))
model = sm.OLS(df['salary'], X_const).fit()
print(model.summary())

# Variance Inflation Factor (VIF) for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame({'feature': X_const.columns[1:],
                    'VIF': [variance_inflation_factor(X_const.values, i+1) for i in range(3)]})
print(vif)

Session 3: Logistic Regression and Causal Inference

Logistic regression models the log-odds of a binary outcome: log(p/(1-p)) = beta0 + beta1*X. Exponentiated coefficients are odds ratios. Use AUC-ROC as the primary evaluation metric for imbalanced classes (not accuracy). Correlation does not imply causation: a confounder Z can cause both X and Y producing a spurious correlation. Simpson's Paradox: a trend in groups can reverse when groups are combined.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np

np.random.seed(5)
n = 500
X = np.column_stack([np.random.normal(0,1,n), np.random.normal(0,1,n),
                     np.random.binomial(1, 0.4, n)])
log_odds = -1.5 + 1.2*X[:,0] + 0.8*X[:,1] + 0.6*X[:,2]
Y = np.random.binomial(1, 1/(1+np.exp(-log_odds)))

X_tr, X_te, y_tr, y_te = train_test_split(X, Y, test_size=0.25, random_state=42)
model = LogisticRegression(max_iter=200).fit(X_tr, y_tr)
y_prob = model.predict_proba(X_te)[:,1]

print(f'AUC-ROC: {roc_auc_score(y_te, y_prob):.4f}')
print(classification_report(y_te, model.predict(X_te)))

Week 8 Assignments

Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.

Multiple regression: check VIF, produce 4 diagnostic plots, run Breusch-Pagan test for heteroscedasticity, interpret every coefficient with 95% CIs.

Logistic regression classification: report accuracy, precision, recall, F1, AUC-ROC. Plot ROC curve. Find optimal threshold by maximising F1.

Previous Week Next: Week 9