Gaussian Naive Bayes: Handling Numbers in Naive Bayes

In Part 1, we saw how the Naive Bayes classifier uses probabilities based on feature frequencies (like counting words) to classify data. But what happens when our input features aren't categories, but continuous numbers like 'Age', 'Salary', or 'Temperature'?

We can't simply count frequencies for every possible number! We need a different way to estimate the likelihood P(Feature | Class). This is where Gaussian Naive Bayes (GNB) comes in. It's a specific type of Naive Bayes designed to work directly with continuous numerical features.

Main Technical Concept: Gaussian Naive Bayes is an extension of Naive Bayes that handles continuous features by assuming that the values of each feature, *for each class*, follow a Gaussian (Normal, or "bell curve") distribution.

The Key Idea: Assuming the Bell Curve

The core idea behind GNB is simple but powerful:

For a given class (e.g., Class 'Yes'), it assumes that the continuous values of a specific feature (e.g., 'Age') are distributed according to a Gaussian (Normal) distribution. It makes the same assumption for Class 'No', but potentially with a different mean and standard deviation.

Why the Gaussian Distribution?

It's a very common distribution found in nature and many real-world datasets.
It's mathematically well-understood and defined by just two parameters:
- Mean (μ): The center of the bell curve.
- Variance (σ²) or Standard Deviation (σ): How spread out the curve is.

Image Credit: Inductiveload on Wikimedia Commons, CC BY-SA 3.0

Calculating Likelihoods (P(feature | Class))

Instead of counting frequencies, GNB calculates the likelihood using the Gaussian Probability Density Function (PDF). Here's the idea:

Calculate Class-Specific Stats: For each class (e.g., 'Yes' and 'No') and for each continuous feature (e.g., 'Age'), calculate the mean (μ) and variance (σ²) of that feature's values *only for the data points belonging to that class*.
Use the Gaussian PDF: When you get a new data point with a specific feature value (x), plug this value, along with the calculated mean (μ) and variance (σ²) *for a given class*, into the Gaussian PDF formula to get the likelihood density P(x | Class).

Gaussian Probability Density Function (PDF) f(x | μ, σ²) = 1√(2πσ²) * e^{- (x - μ)²2σ²}

This formula gives the likelihood density of observing value x, given that the data for this class follows a Normal distribution with mean μ and variance σ².
π ≈ 3.14159, e ≈ 2.71828
Note: This gives a *density*, not a direct probability (it can be > 1), but it works correctly within Bayes' theorem for comparison.

The algorithm calculates this likelihood density for every feature and every class.

Putting it All Together: GNB Prediction

The overall process for classifying a new data point `X = {x₁, x₂, ..., xn}` using Gaussian Naive Bayes is:

Calculate Priors: Determine the prior probability P(C) for each class C (e.g., fraction of 'Yes' samples in training data).
Calculate Likelihoods: For each class C and each feature xᵢ:
- Retrieve the pre-calculated mean (μᵢ,C) and variance (σ²ᵢ,C) of feature `i` for class `C` from the training data.
- Calculate P(xᵢ | C) using the Gaussian PDF formula with these specific μ and σ².
Combine using Bayes' Theorem (Naive Assumption): For each class C, calculate the value proportional to the posterior probability:
Score(C) = P(C) * P(x₁|C) * P(x₂|C) * ... * P(xn|C)
(Multiply the prior by all the individual likelihood densities calculated in step 2).
Predict: Choose the class C that has the highest Score(C).

Essentially, it asks: "Based on the typical 'Age' and 'Salary' distributions we saw for people who *did* purchase (Class 1), and the distributions for those who *didn't* (Class 0), which class does this new person's 'Age' and 'Salary' fit better with, considering the overall likelihood of purchase?"

Implementing GNB in Python (Scikit-learn)

Scikit-learn makes using Gaussian Naive Bayes very easy with the `GaussianNB` classifier.

Workflow using `Social_Network_Ads.csv`

Let's predict whether a user purchased a product based on 'Age' and 'EstimatedSalary'.

# 1. Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# 2. Load the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
# Select 'Age' and 'EstimatedSalary' as features, 'Purchased' as target
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# 3. Split data into Training and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# 4. Feature Scaling (Important for visualization and sometimes GNB)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 5. Fit Gaussian Naive Bayes model to the Training set
classifier = GaussianNB()
classifier.fit(X_train_scaled, y_train) # Learns mean & variance per feature per class

# 6. Predict Test set results
y_pred = classifier.predict(X_test_scaled)

# 7. Evaluate the results
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)

# Accuracy Score
acc = accuracy_score(y_test, y_pred)
print(f'\nAccuracy: {acc:.4f}')

# Classification Report (Precision, Recall, F1-Score)
report = classification_report(y_test, y_pred)
print('\nClassification Report:\n', report)

# 8. Visualize Confusion Matrix (Optional)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Gaussian Naive Bayes')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
# plt.show()

The code trains the GNB model, makes predictions, and then shows the confusion matrix and other metrics to evaluate how well it performed on the unseen test data.

Common Issues & Considerations

Issue / Observation	Potential Cause & Solution	Best Practice
Model accuracy is low.	The Gaussian assumption might be strongly violated for some features; features might be highly correlated; insufficient data. Solution: Check feature distributions (histograms per class). Try transforming non-normal features (e.g., log transform). Consider other algorithms if assumptions don't hold. Check for feature correlation.	Analyze feature distributions per class. Validate assumptions where possible.
Why might feature scaling still be useful if GNB calculates mean/std per feature?	While GNB handles different scales mathematically via separate means/stds, scaling can sometimes help the numerical stability of the calculations, especially if ranges are vastly different. It also ensures visualizations (like decision boundaries) are not distorted.	Scaling continuous features is generally good practice, though its direct impact on GNB accuracy might be less than distance-based algorithms.
Getting probability densities > 1 from the PDF.	This is mathematically possible and correct for a Probability Density Function (PDF), especially if the variance (σ²) is very small. The area under the PDF curve always integrates to 1. Solution: No fix needed, just understand it represents density, not a direct probability for a single point. It works correctly within the relative comparisons of Bayes' Theorem.	Distinguish between Probability Density (PDF value) and Probability (area under curve).
Model performs poorly due to correlated features.	The core "naive" independence assumption is violated. Solution: Consider feature selection to remove highly correlated features. Use dimensionality reduction (like PCA) before GNB (but interpretation becomes harder). Try models that handle correlations better (e.g., Logistic Regression, SVM, Trees).	Check feature correlations during EDA. Be aware of the algorithm's assumptions.

Performance Tips & When to Use GNB

💡Key Points

Check Normality Assumption: While often robust even if violated, GNB works best if your continuous features are *roughly* normally distributed within each class. Use histograms or statistical tests to check. If features are highly non-normal, consider data transformations (like log or Box-Cox) or a different Naive Bayes variant (if data can be discretized) or another algorithm entirely.
Independence Assumption: Remember GNB assumes features are independent. If your features are highly correlated, GNB might not perform optimally compared to models that handle correlations.
Computational Efficiency: GNB is generally very fast to train as it mainly involves calculating means and variances.
Good Baseline: Due to its speed and simplicity, GNB is often a good baseline model to try early in a classification project.
Works Well with High Dimensions: It can perform reasonably well even with a large number of features relative to the number of samples, partly due to the independence assumption simplifying calculations.

Gaussian Naive Bayes: Key Takeaways

Gaussian Naive Bayes (GNB) is a type of Naive Bayes classifier specifically designed for continuous numerical features.
It assumes that features within each class follow a Gaussian (Normal) distribution.
It calculates the likelihood P(feature | Class) using the Gaussian PDF formula, based on the mean and variance of the feature for that class learned from training data.
It still relies on the "naive" assumption that features are conditionally independent given the class.
Requires calculating mean and variance for each feature per class during "training".
Feature scaling is often recommended for numerical stability and better visualization, although GNB handles scales mathematically.
It's implemented easily in Scikit-learn using GaussianNB.

Test Your Knowledge & Interview Prep

Interview Question

Question 1: What is the core assumption that Gaussian Naive Bayes makes about continuous features?

Show Answer

It assumes that the values of each continuous feature, *within each class*, are distributed according to a Gaussian (Normal) distribution.

Question 2: How does Gaussian Naive Bayes calculate the likelihood term P(feature | Class) for a continuous feature?

Show Answer

It calculates the mean (μ) and variance (σ²) of that feature for all training samples belonging to the specific class. Then, it plugs the new data point's feature value (x), along with the calculated μ and σ² for that class, into the Gaussian Probability Density Function (PDF) formula.

Interview Question

Question 3: Why is the "naive" independence assumption still relevant even when using Gaussian Naive Bayes with continuous features?

Show Answer

Because even after calculating the individual likelihoods P(xᵢ|C) for each feature using the Gaussian PDF, the algorithm still combines these likelihoods by *multiplying* them together (along with the prior P(C)) to get the overall score for a class. This multiplication step relies on the assumption that the features x₁, x₂, etc., are conditionally independent given the class C.

Question 4: Is feature scaling (like Standardization) strictly necessary for Gaussian Naive Bayes to work? Why might it still be beneficial?

Show Answer

Strictly speaking, GNB can handle features on different scales because it calculates separate means and variances for each. However, scaling is still often beneficial for:
1. Numerical Stability: It can prevent issues with very large or very small numbers during PDF calculations, especially if variances are tiny.
2. Assumption Validity: Standardizing features makes them closer to a standard normal distribution (mean=0, std=1), which might align better with the Gaussian assumption in some cases.
3. Visualization: Helps when plotting decision boundaries or comparing feature influences.

Interview Question

Question 5: If your continuous features are clearly not normally distributed (e.g., very skewed), what might happen if you apply Gaussian Naive Bayes, and what could you do?

Show Answer

If the Gaussian assumption is strongly violated, the likelihood estimates calculated using the Gaussian PDF will be inaccurate, potentially leading to poor classification performance.
What to do:
1. Try transforming the skewed features to make them more bell-shaped (e.g., using log transform, Box-Cox transform) *before* applying GNB.
2. Consider discretizing the continuous features into bins and using Multinomial Naive Bayes instead.
3. Try a different classification algorithm that doesn't make the Gaussian assumption (e.g., Decision Trees, KNN, SVM).

Gaussian Naive Bayes Explained (Part 2: Continuous Data)

Gaussian Naive Bayes: Handling Numbers in Naive Bayes

The Key Idea: Assuming the Bell Curve

Why the Gaussian Distribution?

Calculating Likelihoods (P(feature | Class))

Putting it All Together: GNB Prediction

Implementing GNB in Python (Scikit-learn)

Workflow using `Social_Network_Ads.csv`

Common Issues & Considerations

Performance Tips & When to Use GNB

💡Key Points

Gaussian Naive Bayes: Key Takeaways

Test Your Knowledge & Interview Prep

You may also be interested in

🚀 Just Released