Topics covered: Skewness and kurtosis, bivariate analysis, matplotlib figure architecture, seaborn (histplot/boxplot/violinplot/heatmap), Plotly interactive charts, publication-quality styling
Learning objectives: By the end of this week you will be able to apply exploratory data analysis and visualisation concepts to real datasets, write executable Python code for each technique, and complete both graded assignments independently.
Univariate analysis examines one variable at a time. Compute: mean, median, standard deviation, skewness (positive = long right tail, negative = long left tail), and kurtosis (tail heaviness). Bivariate analysis examines pairs: correlation for two numerical variables, box plots for numerical vs categorical, cross-tabulation for two categorical variables.
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(0)
df = pd.DataFrame({
'income': np.random.lognormal(12.5, 0.8, 1000),
'age': np.random.normal(38, 10, 1000).clip(18, 70),
'education': np.random.choice(['Primary','Secondary','Tertiary'], 1000)
})
def eda_summary(series):
return pd.Series({
'mean': round(series.mean(), 2),
'median': round(series.median(), 2),
'std': round(series.std(), 2),
'skewness': round(stats.skew(series.dropna()), 3),
'kurtosis': round(stats.kurtosis(series.dropna()), 3),
'iqr': round(series.quantile(0.75) - series.quantile(0.25), 2)
})
print(eda_summary(df['income']))
print('Interpretation: skewness > 1 indicates strong right skew (income is log-normally distributed)')
Create Figure and Axes explicitly with fig, ax = plt.subplots() for production code. Set DPI >= 150 for screen, 300 for print. Use seaborn's 'colorblind' palette for accessibility. Remove top and right spines. Always include axis labels with units, and a descriptive title. Save with plt.savefig('plot.png', dpi=300, bbox_inches='tight').
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
plt.rcParams.update({
'figure.dpi': 150, 'font.size': 11,
'axes.spines.top': False, 'axes.spines.right': False
})
palette = sns.color_palette('colorblind')
fig, axes = plt.subplots(2, 2, figsize=(12, 9))
fig.suptitle('Credit Applicant EDA Dashboard', fontsize=14, fontweight='bold')
sns.histplot(df['income'], kde=True, bins=40, ax=axes[0,0], color=palette[0])
axes[0,0].set_title('Income Distribution')
axes[0,0].set_xlabel('Annual Income (NGN)')
sns.boxplot(data=df, x='education', y='income', ax=axes[0,1], palette=palette[:3])
axes[0,1].set_title('Income by Education Level')
corr = df.select_dtypes('number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=axes[1,0])
axes[1,0].set_title('Correlation Matrix')
df['education'].value_counts().plot(kind='bar', ax=axes[1,1], color=palette[:3])
axes[1,1].set_title('Education Distribution')
plt.tight_layout()
plt.savefig('eda_dashboard.png', dpi=300, bbox_inches='tight')
plt.show()
plotly.express (import as px) creates interactive charts in one function call. Charts are JavaScript-based and render in Jupyter or as standalone HTML files. Use .show() in Jupyter, .write_html('chart.html') to save. Key functions: px.histogram(), px.box(), px.scatter(), px.bar(), px.line(). Pass hover_data to control tooltip content.
import plotly.express as px
# Interactive scatter plot
fig = px.scatter(
df,
x='age', y='income',
color='education',
hover_data=['income','age','education'],
title='Income vs Age by Education Level',
labels={'income': 'Annual Income (NGN)', 'age': 'Age (Years)'},
color_discrete_sequence=px.colors.colorbrewer.Set1
)
fig.update_layout(height=500, template='plotly_white')
fig.show()
# Save as interactive HTML
# fig.write_html('income_scatter.html')
Submit completed notebooks to your GitHub repository before the next session. Feedback within 48 hours.
Using a financial or health dataset from Kaggle: (1) univariate analysis of all numerical variables with skewness interpretation, (2) bivariate scatter matrix, (3) box plots by target variable, (4) correlation heatmap, (5) 3 written findings with business implications.
Create a Plotly HTML dashboard with: a distribution chart with KDE overlay, a grouped bar chart, and a scatter plot with colour encoding.