Detect spam emails using AI and Support Vector Machines (SVM)

Smart Classification: Advanced SVM Techniques for Machine Learning

AI Generated image: Detect spam emails using AI and Support Vector Machines (SVM)

📧 Detect Spam Emails Using AI and Support Vector Machines (SVM)

This project explores the use of Support Vector Machines (SVM) to accurately classify emails as spam or not by leveraging natural language processing and text preprocessing techniques.

🔧 Tools

Python (NumPy, SciPy)
Scikit-learn for SVM model training and evaluation
NLP techniques: Tokenization, Stemming, and Feature Extraction
Text preprocessing and vocabulary creation modules

🎯 Goals

To preprocess and transform raw email text into numerical feature vectors.
To train a robust SVM model for spam detection.
To evaluate model accuracy and identify influential spam-related keywords.

🌟 Impacts

Improves understanding of how machine learning classifies textual data.
Builds a strong foundation in using SVMs for real-world NLP tasks.
Demonstrates end-to-end email classification using explainable AI practices.
Supports the development of secure and intelligent email filtering systems.

Introduction
Setting Up the Environment
Loading Sample Email
Text Preprocessing
Tokenization and Stemming
Vocabulary List
Feature Extraction from Emails
Training the SVM Model
Model Performance
Important Spam Words
Conclusion
Acknowledgment

1. Introduction

AI-Powered Spam Detection Using Support Vector Machines

AI Generated image: AI-Powered Spam Detection Using Support Vector Machines

In this project, we developed a machine learning model to classify email messages as spam or non-spam using a Support Vector Machine (SVM). The project involved preprocessing raw emails, extracting key features, and training a classifier to identify spam indicators effectively. The approach included tokenization, stemming, and mapping words to a predefined vocabulary before converting emails into feature vectors. The model was trained using a dataset of labeled emails and evaluated on a separate test set. The goal was to build a high-accuracy spam detection system that minimizes false positives while efficiently filtering unwanted emails.

2. Setting Up the Environment

We use the following Python libraries:

NumPy - For numerical operations.
Matplotlib - For visualization.
Scikit-learn (SVM) - Machine learning model.
Regular Expressions (re) - For email preprocessing.
NLTK (Natural Language Toolkit) - For text processing and stemming.

    
    Python:
    
    %matplotlib inline
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.io #Used to load the OCTAVE *.mat files
    from sklearn import svm #SVM software
    import re #regular expression for e-mail processing
    from stemming.porter2 import stem
    import nltk, nltk.stem.porter

3. Loading Sample Email

The email content is read from a text file and displayed as follows:

    Python:


    file_path = r'd:\mlprojects\data\emailSample1.txt'  # Use raw string (r'...') 
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    print(content)  # Display the contents

Output:

Anyone knows how much it costs to host a web portal? Well, it depends on how many visitors you're expecting. This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2. To unsubscribe yourself from this mailing list, send an email to: groupname-unsubscribe@egroups.com

4. Text Preprocessing

AI Generated image: Email Text Pre-processing

Before analyzing the text, we apply preprocessing steps:

Convert to lowercase.
Remove HTML tags.
Replace numbers with 'number'.
Replace URLs with 'httpaddr'.
Replace email addresses with 'emailaddr'.
Replace '$' with 'dollar'.

    def preProcess( email ):
        email = email.lower()  # Make the entire e-mail lower case
        email = re.sub(r'<[^<>]+>', ' ', email)   # remove HTML tags
        email = re.sub(r'[0-9]+', 'number', email)  # Convert numbers to strings
        email = re.sub(r'(http|https)://[^\s]*', 'httpaddr', email)  # Convert http to httpaddr
        email = re.sub(r'[^\s]+@[^\s]+', 'emailaddr', email)  # Strings with "@" --> 'emailaddr'
        email = re.sub(r'[$]+', 'dollar', email)  # '$' sign replaced with 'dollar'
        return email

5. Tokenization and Stemming

After preprocessing, the email is tokenized and stemmed.

    def email2TokenList( raw_email ):
        stemmer = nltk.stem.porter.PorterStemmer()
        email = preProcess( raw_email )
        tokens = re.split(r'[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
        tokenlist = []
        for token in tokens:
            token = re.sub(r'[^a-zA-Z0-9]', '', token);
            stemmed = stemmer.stem( token )
            if not len(token): continue
            tokenlist.append(stemmed)
        return tokenlist

Example:

    Original: "Running quickly! Buy $100 worth at Amazon.com"
    Processed Tokens: ['run', 'quickli', 'buy', 'dollar', 'worth', 'amazon']

6. Vocabulary List

The function getVocabDict() loads a predefined vocabulary list where each word is assigned a unique index.

def getVocabDict():
    vocab_dict = {}
    with open("vocab.txt", "r") as file:
        for line in file:
            index, word = line.split()
            vocab_dict[word] = int(index)
    return vocab_dict

7. Feature Extraction from Emails

The function email2FeatureVector() processes an email and converts it into a binary feature vector.

def email2FeatureVector(email, vocab_dict):
    feature_vector = np.zeros(len(vocab_dict))
    words = email.split()
    for word in words:
        if word in vocab_dict:
            feature_vector[vocab_dict[word] - 1] = 1
    return feature_vector

8. Training the SVM Model

The dataset consists of labeled spam and non-spam emails. We train an SVM with a linear kernel using a regularization parameter of C=0.1.

from sklearn.svm import SVC
classifier = SVC(C=0.1, kernel='linear')
classifier.fit(X_train, y_train)

9. Model Performance

After training, we evaluate the accuracy of the model on both training and test data:

train_accuracy = classifier.score(X_train, y_train) * 100
test_accuracy = classifier.score(X_test, y_test) * 100
print(f"Training Accuracy: {train_accuracy:.2f}%")
print(f"Test Accuracy: {test_accuracy:.2f}%")

10. Important Spam Words

AI Generated image: Words that are spam in email.

The trained SVM provides insight into which words contribute most to spam classification.

weights = classifier.coef_.flatten()
indices = np.argsort(weights)[-15:]
important_words = [word for word, index in vocab_dict.items() if index in indices]
print("Top spam-indicative words:", important_words)

After running our codes we got output as:

    The 15 most important words to classify a spam e-mail are:
    ['otherwis', 'clearli', 'remot', 'gt', 'visa', 'base', 'doesn', 'wife', 'previous', 'player', 'mortgag',
     'natur', 'll', 'futur', 'hot']

    The 15 least important words to classify a spam e-mail are:
    ['http', 'toll', 'xp', 'ratio', 'august', 'unsubscrib', 'useless', 'numberth', 'round', 'linux', 'datapow', 
    'wrong', 'urgent', 'that', 'spam']

Spam containing "otherwis" = 804/1277 = 62.96%

NON spam containing "otherwis" = 301/2723 = 11.05%

11. Conclusion

The completion of this project resulted in a highly accurate spam classification model, achieving 99.83% training accuracy and 98.90% test accuracy using an SVM with a linear kernel. The system successfully identified key words and patterns associated with spam, demonstrating its effectiveness in filtering unwanted emails. Furthermore, the project provided valuable insights into natural language processing (NLP) techniques and feature engineering for text classification. Future enhancements could involve integrating deep learning techniques or improving feature extraction methods to adapt to evolving spam strategies.

Acknowledgments

I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.