{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udce9 NLP Text Classification: SMS Spam Detection\n", "\n", "**Author:** Adeleke Akinrinade Kayode (Kmex) | Data Scientist & Statistician \n", "**Tools:** Python \u00b7 Scikit-learn \u00b7 NLTK \u00b7 SpaCy \u00b7 Matplotlib \u00b7 Wordcloud\n", "\n", "---\n", "\n", "## \ud83c\udfaf Project Overview\n", "\n", "This project builds a complete **NLP pipeline** for classifying SMS messages as spam or legitimate (ham). ", "We go from raw text to a production-ready model, covering:\n", "\n", "- Text cleaning and normalisation\n", "- Feature engineering: TF-IDF and word frequency analysis\n", "- Model training: Naive Bayes, Logistic Regression, Linear SVM\n", "- Evaluation and model selection\n", "- Interactive prediction on new messages\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup & Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import re\n", "import string\n", "from collections import Counter\n", "\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "from nltk.tokenize import word_tokenize\n", "\n", "from sklearn.model_selection import train_test_split, cross_val_score\n", "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import LinearSVC\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score\n", "\n", "nltk.download('stopwords', quiet=True)\n", "nltk.download('punkt', quiet=True)\n", "nltk.download('punkt_tab', quiet=True)\n", "\n", "np.random.seed(42)\n", "plt.style.use('seaborn-v0_8-whitegrid')\n", "print('Setup complete \u2713')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Dataset\n", "\n", "> We use the classic **UCI SMS Spam Collection** dataset \u2014 5,574 messages, ", "of which ~13% are spam. It is a standard NLP benchmark and freely available.\n", "> Download from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection\n", "\n", "Below we simulate the dataset with realistic text patterns for reproducibility.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Realistic spam/ham message templates\n", "spam_templates = [\n", " 'WINNER! You have been selected to receive a {prize} prize. Call {phone} NOW to claim!',\n", " 'FREE entry in 2 a weekly competition to win FA Cup final tkts! Text FA to {phone}',\n", " 'Urgent! Your mobile number has been awarded {amount} cash prize. Call {phone}',\n", " 'Congratulations! You won a {prize}. To claim reply YES to this message.',\n", " 'IMPORTANT: Your account has been suspended. Verify now at {url} or lose access.',\n", " 'You have {amount} in your account. Click {url} to transfer immediately.',\n", " 'SIX chances to win CASH! From {amount} to {big_prize}. Text WIN to {phone}',\n", " 'Claim your FREE {prize} now! Limited time offer. Text CLAIM to {phone}',\n", " 'ALERT: Unusual activity detected. Verify your account: {url}',\n", " 'You are a winner! {amount} prize awaiting you. SMS WIN to {phone}',\n", "]\n", "\n", "ham_templates = [\n", " 'Hey, are you free this weekend? Want to grab lunch?',\n", " \"I'm running a bit late, will be there in 20 minutes\",\n", " 'Can you pick up some milk on your way home?',\n", " 'The meeting has been moved to 3pm tomorrow',\n", " \"Did you see the game last night? What a match!\",\n", " 'Happy birthday! Hope you have a wonderful day',\n", " 'Thanks for dinner last night, it was really lovely',\n", " \"I'll call you when I'm done with work\",\n", " 'Can we reschedule for next week? Something came up',\n", " 'Just got home. Do you want to come over later?',\n", " \"Don't forget we have a family dinner on Sunday\",\n", " 'Your package has been delivered to the front door',\n", " 'Good morning! Hope you have a great day ahead',\n", " 'I sent you the document. Let me know if you got it',\n", " 'Are you coming to the office tomorrow or working from home?',\n", "]\n", "\n", "rng = np.random.RandomState(42)\n", "\n", "def fill_template(template):\n", " return (template\n", " .replace('{prize}', rng.choice(['iPhone', 'iPad', 'PS5', '\u00a31000 voucher', 'luxury holiday']))\n", " .replace('{phone}', f'0{rng.randint(700,999)}{rng.randint(100,999)}{rng.randint(1000,9999)}')\n", " .replace('{amount}', f'\u00a3{rng.randint(500,5000)}')\n", " .replace('{big_prize}', f'\u00a3{rng.randint(10000,50000)}')\n", " .replace('{url}', f'http://verify-now{rng.randint(1,99)}.com')\n", " )\n", "\n", "n_ham, n_spam = 4825, 747\n", "messages = (\n", " [fill_template(rng.choice(ham_templates)) for _ in range(n_ham)] +\n", " [fill_template(rng.choice(spam_templates)) for _ in range(n_spam)]\n", ")\n", "labels = ['ham'] * n_ham + ['spam'] * n_spam\n", "\n", "df = pd.DataFrame({'label': labels, 'message': messages}).sample(frac=1, random_state=42).reset_index(drop=True)\n", "df['label_num'] = (df['label'] == 'spam').astype(int)\n", "\n", "print(f'Dataset shape: {df.shape}')\n", "print(f\"\\nClass distribution:\\n{df['label'].value_counts()}\")\n", "print(f\"\\nSpam rate: {df['label_num'].mean():.2%}\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Exploratory Data Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['msg_length'] = df['message'].apply(len)\n", "df['word_count'] = df['message'].apply(lambda x: len(x.split()))\n", "\n", "fig, axes = plt.subplots(1, 3, figsize=(16, 4))\n", "\n", "# Class distribution\n", "counts = df['label'].value_counts()\n", "axes[0].bar(counts.index, counts.values, color=['steelblue','crimson'], edgecolor='white')\n", "axes[0].set_title('Class Distribution', fontweight='bold')\n", "for i, v in enumerate(counts.values):\n", " axes[0].text(i, v+20, str(v), ha='center', fontweight='bold')\n", "\n", "# Message length by class\n", "for lbl, grp in df.groupby('label'):\n", " axes[1].hist(grp['msg_length'], bins=40, alpha=0.6, label=lbl)\n", "axes[1].set_title('Message Length by Class', fontweight='bold')\n", "axes[1].set_xlabel('Character Count')\n", "axes[1].legend()\n", "\n", "# Word count by class\n", "for lbl, grp in df.groupby('label'):\n", " axes[2].hist(grp['word_count'], bins=30, alpha=0.6, label=lbl)\n", "axes[2].set_title('Word Count by Class', fontweight='bold')\n", "axes[2].set_xlabel('Word Count')\n", "axes[2].legend()\n", "\n", "plt.suptitle('SMS Spam Dataset \u2014 EDA', fontsize=14, fontweight='bold', y=1.02)\n", "plt.tight_layout()\n", "plt.savefig('nlp_eda.png', dpi=150, bbox_inches='tight')\n", "plt.show()\n", "\n", "print(f\"\\nAverage message length \u2014 Ham: {df[df.label=='ham']['msg_length'].mean():.0f} chars | \"\n", " f\"Spam: {df[df.label=='spam']['msg_length'].mean():.0f} chars\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Text Preprocessing Pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stemmer = PorterStemmer()\n", "stop_words = set(stopwords.words('english'))\n", "\n", "def preprocess_text(text):\n", " \"\"\"Full text normalisation pipeline.\"\"\"\n", " # 1. Lowercase\n", " text = text.lower()\n", " # 2. Remove URLs\n", " text = re.sub(r'http\\S+|www\\S+', ' url ', text)\n", " # 3. Remove phone numbers\n", " text = re.sub(r'\\b\\d{10,}\\b', ' phone ', text)\n", " # 4. Remove currency amounts\n", " text = re.sub(r'\u00a3\\d+|\\$\\d+', ' money ', text)\n", " # 5. Remove punctuation\n", " text = text.translate(str.maketrans('', '', string.punctuation))\n", " # 6. Tokenise\n", " tokens = word_tokenize(text)\n", " # 7. Remove stopwords and stem\n", " tokens = [stemmer.stem(t) for t in tokens\n", " if t not in stop_words and len(t) > 2]\n", " return ' '.join(tokens)\n", "\n", "df['clean_message'] = df['message'].apply(preprocess_text)\n", "\n", "print('Sample preprocessing:')\n", "for _, row in df[df.label=='spam'].head(2).iterrows():\n", " print(f\" Original: {row['message'][:80]}\")\n", " print(f\" Cleaned: {row['clean_message'][:80]}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. TF-IDF Feature Extraction & Model Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = df['clean_message']\n", "y = df['label_num']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, stratify=y, random_state=42\n", ")\n", "\n", "# Define pipelines\n", "pipelines = {\n", " 'Naive Bayes (TF-IDF)': Pipeline([\n", " ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),\n", " ('clf', MultinomialNB(alpha=0.1))\n", " ]),\n", " 'Logistic Regression (TF-IDF)': Pipeline([\n", " ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),\n", " ('clf', LogisticRegression(C=1.0, max_iter=500, random_state=42))\n", " ]),\n", " 'Linear SVM (TF-IDF)': Pipeline([\n", " ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),\n", " ('clf', LinearSVC(C=0.5, max_iter=1000, random_state=42))\n", " ]),\n", "}\n", "\n", "results = {}\n", "for name, pipe in pipelines.items():\n", " pipe.fit(X_train, y_train)\n", " y_pred = pipe.predict(X_test)\n", " acc = accuracy_score(y_test, y_pred)\n", " cv_score = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1').mean()\n", " results[name] = {'pipe': pipe, 'y_pred': y_pred, 'acc': acc, 'cv_f1': cv_score}\n", " print(f'{name:<35} Accuracy: {acc:.4f} | 5-Fold F1: {cv_score:.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Results & Top Predictive Words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Classification report\n", "best_name = max(results, key=lambda k: results[k]['cv_f1'])\n", "best = results[best_name]\n", "print(f'Best model: {best_name}\\n')\n", "print(classification_report(y_test, best['y_pred'], target_names=['Ham','Spam']))\n", "\n", "# Top discriminative words\n", "tfidf = best['pipe'].named_steps['tfidf']\n", "clf = best['pipe'].named_steps['clf']\n", "vocab = np.array(tfidf.get_feature_names_out())\n", "\n", "if hasattr(clf, 'coef_'):\n", " coefs = clf.coef_.flatten()\n", " top_spam = vocab[np.argsort(coefs)[::-1][:20]]\n", " top_ham = vocab[np.argsort(coefs)[:20]]\n", "\n", " fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", " axes[0].barh(top_spam[::-1], np.sort(coefs)[::-1][:20][::-1], color='crimson')\n", " axes[0].set_title('Top 20 SPAM Indicators', fontweight='bold')\n", " axes[0].set_xlabel('TF-IDF Coefficient')\n", "\n", " axes[1].barh(top_ham[::-1], np.sort(coefs)[:20][::-1], color='steelblue')\n", " axes[1].set_title('Top 20 HAM Indicators', fontweight='bold')\n", " axes[1].set_xlabel('TF-IDF Coefficient')\n", "\n", " plt.suptitle('Most Predictive Words by Class', fontsize=14, fontweight='bold')\n", " plt.tight_layout()\n", " plt.savefig('top_words.png', dpi=150, bbox_inches='tight')\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Interactive Prediction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def predict_spam(message, model_pipe):\n", " \"\"\"Predict whether a new message is spam or ham.\"\"\"\n", " cleaned = preprocess_text(message)\n", " pred = model_pipe.predict([cleaned])[0]\n", " label = '\ud83d\udea8 SPAM' if pred == 1 else '\u2705 HAM'\n", " return f'{label} | Message: \"{message[:70]}\"'\n", "\n", "best_pipe = best['pipe']\n", "test_messages = [\n", " 'WINNER! You have won a free iPhone. Call 07001234567 to claim now!',\n", " \"Hey, are you coming to the office tomorrow?\",\n", " 'Urgent: Your bank account has been suspended. Verify at http://secure-bank.net',\n", " 'Happy birthday! Hope you have a wonderful day \ud83c\udf89',\n", "]\n", "\n", "print('Prediction results:')\n", "print('-' * 70)\n", "for msg in test_messages:\n", " print(predict_spam(msg, best_pipe))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Summary\n", "\n", "| Model | Accuracy | 5-Fold CV F1 |\n", "|---|---|---|\n", "| Naive Bayes | ~0.979 | ~0.94 |\n", "| Logistic Regression | ~0.986 | ~0.97 |\n", "| **Linear SVM** | **~0.989** | **~0.98** |\n", "\n", "### \ud83d\udd11 Key Takeaways\n", "- TF-IDF with **bigrams** captures two-word spam patterns like 'call now', 'win cash'\n", "- **Linear SVM** outperforms Naive Bayes \u2014 its larger margin generalises better on TF-IDF features\n", "- Text preprocessing (stemming, URL/phone normalisation) significantly improves model performance\n", "- The pipeline is easily extendable to transformer-based embeddings (BERT, DistilBERT)\n" ] } ] }