Gaussian Naive Bayes
Data Preprocessing
The Abalone dataset consists of continuous numerical measurements.
How it was done
- Target Variable Transformation
- The raw
Ringstarget variable was grouped into three distinct classification categories:Young( rings),Adult(), andOld( rings). - This transformed the problem from a continuous regression task into a discrete multi-class classification task.
- The raw
Classifier Implementation
Given an input matrix of continuous features (X_train) and their corresponding labels (y_train), the model assigns new samples to a class using Bayes’ theorem adapted for continuous variables via the Gaussian Probability Density Function (PDF).
How it was done
Training
(for every class )
- Computing Class Prior
- Calculated as the total number of occurrences of class divided by the total number of samples.
- Feature Mean and Variance
- Because the features are continuous, we assume they follow a normal (Gaussian) distribution.
- We computed the gaussian pdf parameters: mean () and variance (), for every single feature column within each class.
- A small epsilon value (
1e-9) was added to the variance to ensure numerical stability and prevent division-by-zero errors during prediction.
Predicting
The model was built to predict using either standard probabilities or log probabilities.
- Via Regular Probabilities (
num_data_rule)- The likelihood of a feature value belonging to class is calculated using the standard Gaussian PDF:
- For a given sample, the probabilities of all features are multiplied together (
np.prod) and multiplied by the class prior.
- Via Log Probabilities (
log_data_rule)- To prevent numerical underflow when multiplying small fractions, the natural logarithm of the Gaussian PDF is used:
- The log-likelihoods of all features are summed together (
np.sum) and added to the log prior.
Visuals
Feature distributions for the first 4 features.
.png)
Multinomial Naive Bayes
Text Preprocessing and Bag-of-Words
The text processing pipeline was implemented from scratch to transform 50,000 raw IMDB movie reviews into a numerical format suitable for the Multinomial Naive Bayes classifier.
How it was done
- Loading a cleaned training dataset
load_and_clean())- a review may contain html tags or words with different casing and punctuation, we’re only interested in the occurrence of words, and thus we removed tags, lowercased everything, and removed punctuation
- Creating a vocab. that doesn’t contain stopwords (fillers)
- Vocab =
- stopwords do not add meaning to our sentiment analysis.
- each word in vocab was then mapped to an index, indices are easier to refer to, esp. to a matrix.
- Bag-of-Words (BoW) Representation
BoW[review_idx, word_idx] = the frequency of the word within the reivew- created a BOW to store the frequency of each word appearing in each review
- since each review contains only a tiny subset of our total vocab., we decided to use a sparse matrix.
Classifier Implementation
Given an input X_train (BoW matrix) and y_train (a list of labels), it assigns new review to a label using baye’s theorem.
We first separate review of a label, since we later build the probabilities based on that.
How it was done
Training
(for every class c)
- Computing Class Prior
- it’s the total # of occurrences of class c against all the occurrences.
- also computed its log version.
- Word Probabilities & Laplace Smoothing
- how the count of each word across reviews is computed,
- how
p(w_i|c)is computed for each word, - Laplace smoothing (+1 to numerator and + vocab_size to denominator) was added to ensure that if a word appears in the test set but was never seen in the training set for a given class, its probability doesn’t drop to absolute zero and ruin the entire equation.
- also computed their log version.
- how the count of each word across reviews is computed,
Predicting
- via the log probabilities.
log_prob_neg = X_test.dot(log_word_probs[0]) + log_priors[0]this results in a vector containinglog(p(0 | rev_i))for every review.
we do the same for class 1: computeprob_pos, and then the class ofrev_iis whichever had the larger probability:p(0|rev_i)vsp(1|rev_i).
- via regular probabilities.
- for a review i, we compute
p(0|rev_i) = priors[0] * p(w0|0) * p(w1|0) * ...we do the same for class 1: computep(1|rev_i), and then compare.
- for a review i, we compute
Visuals
.png)
.png)