Natural Language Processing2024github.com/kmexa

Multi-Class News Article Text Classification

Multi-class text classification pipeline on the 20 Newsgroups dataset (6 categories) using TF-IDF vectorisation with bigrams and three classifiers. Linear SVM achieves 94% accuracy. The pipeline mirrors workflows used in policy document analysis and automated report categorisation.

Download Notebook (.ipynb) View on GitHub

Methodology

01Text preprocessing: lowercase normalisation, digit removal, punctuation stripping, stopword filtering to reduce noise
02TF-IDF vectorisation with unigrams and bigrams, 15,000 features; bigrams capture compound terms such as "space shuttle" and "gun control" that meaningfully improve classification
03Three-classifier benchmark: Multinomial Naive Bayes, Logistic Regression, Linear SVM
04Linear SVM selected: consistent advantage on high-dimensional sparse TF-IDF feature spaces
05Confusion matrix and per-class F1 evaluation reveals which category pairs are most frequently confused
06Keyword analysis: coefficient extraction shows the 15 most discriminative terms per category

Results

94%
Accuracy
6
Categories
15,000
Features
ModelAccuracyWeighted F1
Naive Bayes~90%~0.90
Logistic Regression~93%~0.93
Linear SVM (selected)~94%~0.94
TF-IDFLinearSVCbigramsscikit-learn20 NewsgroupsPython

More Portfolio Projects

Fraud Detection NLP Spam Classifier Customer Churn News Classifier TreasuryIQ