Natural Language Processing2024github.com/kmexa

Multi-Class News Article Text Classification

Multi-class text classification pipeline on the 20 Newsgroups dataset (6 categories) using TF-IDF vectorisation with bigrams and three classifiers. Linear SVM achieves 94% accuracy. The pipeline mirrors workflows used in policy document analysis and automated report categorisation.

Download Notebook (.ipynb) View on GitHub

Methodology

01Text preprocessing: lowercase normalisation, digit removal, punctuation stripping, stopword filtering to reduce noise

02TF-IDF vectorisation with unigrams and bigrams, 15,000 features; bigrams capture compound terms such as "space shuttle" and "gun control" that meaningfully improve classification

03Three-classifier benchmark: Multinomial Naive Bayes, Logistic Regression, Linear SVM

04Linear SVM selected: consistent advantage on high-dimensional sparse TF-IDF feature spaces

05Confusion matrix and per-class F1 evaluation reveals which category pairs are most frequently confused

06Keyword analysis: coefficient extraction shows the 15 most discriminative terms per category

Results

94%

Accuracy

Model	Accuracy	Weighted F1
Naive Bayes	~90%	~0.90
Logistic Regression	~93%	~0.93
Linear SVM (selected)	~94%	~0.94

Multi-Class News Article Text Classification

Methodology

Results

More Portfolio Projects