Naive Bayes is a family of simple yet powerful probabilistic classifiers based on Bayes' theorem with the "naive" assumption of conditional independence between features. Despite this simplifying assumption, Naive Bayes classifiers work remarkably well in practice, especially for text classification, spam filtering, and recommendation systems. They're fast, scalable, and perform well with small training datasets.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, fetch_20newsgroups, make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
print("="*60)
print("NAIVE BAYES FUNDAMENTALS")
print("="*60)
# Core concepts
naive_bayes_concepts = """
NAIVE BAYES KEY CONCEPTS:
1. BAYES' THEOREM:
P(class|features) = P(features|class) * P(class) / P(features)
Where:
• P(class|features): Posterior probability
• P(features|class): Likelihood
• P(class): Prior probability
• P(features): Evidence
2. NAIVE ASSUMPTION:
• Features are conditionally independent given the class
• P(x1,x2,...,xn|class) = P(x1|class) * P(x2|class) * ... * P(xn|class)
• Simplifies computation dramatically
• Often works well despite assumption violation
3. TYPES OF NAIVE BAYES:
A) GAUSSIAN NB:
• For continuous features
• Assumes normal distribution
• Uses mean and variance
B) MULTINOMIAL NB:
• For discrete counts
• Text classification
• Document term frequencies
C) BERNOULLI NB:
• For binary features
• Document classification
• Presence/absence of features
D) COMPLEMENT NB:
• For imbalanced datasets
• Better for skewed classes
4. ADVANTAGES:
• Fast training and prediction
• Works well with small datasets
• Handles high dimensions well
• Provides probability estimates
• No hyperparameter tuning
• Naturally multi-class
5. DISADVANTAGES:
• Assumes feature independence
• Can be sensitive to feature scaling
• Zero frequency problem
• May be outperformed by complex models
"""
print(naive_bayes_concepts)
class GaussianNBAnalyzer:
"""Comprehensive Gaussian Naive Bayes analysis"""
def __init__(self):
self.models = {}
self.results = {}
def visualize_gaussian_assumption(self, X, y):
"""Visualize the Gaussian assumption for features"""
# Use Iris dataset for visualization
iris = load_iris()
X_iris = iris.data[:, :2] # Use first 2 features
y_iris = iris.target
# Fit Gaussian NB
gnb = GaussianNB()
gnb.fit(X_iris, y_iris)
# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_iris[:, 0].min() - 1, X_iris[:, 0].max() + 1
y_min, y_max = X_iris[:, 1].min() - 1, X_iris[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = gnb.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Decision boundary
axes[0, 0].contourf(xx, yy, gnb.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape),
alpha=0.3, cmap='viridis')
scatter = axes[0, 0].scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris,
cmap='viridis', edgecolor='black', linewidth=0.5)
axes[0, 0].set_xlabel(iris.feature_names[0])
axes[0, 0].set_ylabel(iris.feature_names[1])
axes[0, 0].set_title('Decision Boundaries')
plt.colorbar(scatter, ax=axes[0, 0])
# Feature distributions per class
for class_idx in range(3):
mask = y_iris == class_idx
# Feature 1 distribution
axes[0, 1].hist(X_iris[mask, 0], alpha=0.5, bins=15,
label=f'Class {class_idx}', density=True)
# Fit and plot Gaussian
mean = gnb.theta_[class_idx, 0]
var = gnb.var_[class_idx, 0]
x_range = np.linspace(X_iris[:, 0].min(), X_iris[:, 0].max(), 100)
gaussian = (1/np.sqrt(2*np.pi*var)) * np.exp(-0.5*((x_range-mean)**2/var))
axes[0, 1].plot(x_range, gaussian, linewidth=2)
axes[0, 1].set_xlabel(iris.feature_names[0])
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('Feature 1: Gaussian Fits')
axes[0, 1].legend()
# Feature 2 distribution
for class_idx in range(3):
mask = y_iris == class_idx
axes[0, 2].hist(X_iris[mask, 1], alpha=0.5, bins=15,
label=f'Class {class_idx}', density=True)
# Fit and plot Gaussian
mean = gnb.theta_[class_idx, 1]
var = gnb.var_[class_idx, 1]
y_range = np.linspace(X_iris[:, 1].min(), X_iris[:, 1].max(), 100)
gaussian = (1/np.sqrt(2*np.pi*var)) * np.exp(-0.5*((y_range-mean)**2/var))
axes[0, 2].plot(y_range, gaussian, linewidth=2)
axes[0, 2].set_xlabel(iris.feature_names[1])
axes[0, 2].set_ylabel('Density')
axes[0, 2].set_title('Feature 2: Gaussian Fits')
axes[0, 2].legend()
# Probability contours
axes[1, 0].contourf(xx, yy, Z, levels=20, alpha=0.7, cmap='RdYlBu_r')
axes[1, 0].scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris,
cmap='viridis', edgecolor='black', linewidth=0.5, s=30)
axes[1, 0].set_xlabel(iris.feature_names[0])
axes[1, 0].set_ylabel(iris.feature_names[1])
axes[1, 0].set_title('Probability Contours (Class 1)')
# Learned parameters
params_text = "Learned Parameters:\n\n"
for class_idx in range(3):
params_text += f"Class {class_idx}:\n"
params_text += f" Prior: {gnb.class_prior_[class_idx]:.3f}\n"
params_text += f" Mean: {gnb.theta_[class_idx]}\n"
params_text += f" Var: {gnb.var_[class_idx]}\n\n"
axes[1, 1].text(0.1, 0.5, params_text, fontsize=10,
verticalalignment='center', family='monospace')
axes[1, 1].set_title('Learned Parameters')
axes[1, 1].axis('off')
# Confusion matrix
y_pred = gnb.predict(X_iris)
cm = confusion_matrix(y_iris, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 2])
axes[1, 2].set_xlabel('Predicted')
axes[1, 2].set_ylabel('Actual')
axes[1, 2].set_title(f'Confusion Matrix (Acc: {accuracy_score(y_iris, y_pred):.3f})')
plt.suptitle('Gaussian Naive Bayes Analysis', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return gnb
def compare_with_different_variances(self, n_samples=1000):
"""Compare performance with different feature variances"""
# Generate datasets with different variances
variances = [0.5, 1.0, 2.0, 5.0]
results = []
fig, axes = plt.subplots(2, len(variances), figsize=(16, 8))
for idx, var in enumerate(variances):
# Generate data with specific variance
X, y = make_classification(n_samples=n_samples, n_features=2,
n_informative=2, n_redundant=0,
n_clusters_per_class=1, class_sep=var,
random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train Gaussian NB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Evaluate
train_score = gnb.score(X_train, y_train)
test_score = gnb.score(X_test, y_test)
results.append({'variance': var, 'train': train_score, 'test': test_score})
# Plot data distribution
axes[0, idx].scatter(X[:, 0], X[:, 1], c=y, alpha=0.5, cmap='viridis')
axes[0, idx].set_title(f'Variance: {var}')
axes[0, idx].set_xlabel('Feature 1')
axes[0, idx].set_ylabel('Feature 2')
# Plot decision boundary
h = 0.5
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = gnb.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[1, idx].contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
axes[1, idx].scatter(X_test[:, 0], X_test[:, 1], c=y_test,
cmap='viridis', edgecolor='black', linewidth=0.5, s=30)
axes[1, idx].set_title(f'Test Acc: {test_score:.3f}')
axes[1, idx].set_xlabel('Feature 1')
axes[1, idx].set_ylabel('Feature 2')
plt.suptitle('Effect of Feature Variance on Gaussian NB', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return pd.DataFrame(results)
# Gaussian NB Analysis
gaussian_analyzer = GaussianNBAnalyzer()
print("\n" + "="*60)
print("GAUSSIAN NAIVE BAYES")
print("="*60)
print("\n1. Visualizing Gaussian Assumptions:")
iris = load_iris()
gnb_model = gaussian_analyzer.visualize_gaussian_assumption(iris.data, iris.target)
print("\n2. Effect of Feature Variance:")
variance_results = gaussian_analyzer.compare_with_different_variances()
print("\nResults by variance:")
print(variance_results)
class TextClassificationNB:
"""Naive Bayes for text classification"""
def __init__(self):
self.models = {}
self.vectorizers = {}
def compare_nb_variants_text(self):
"""Compare different NB variants for text classification"""
# Create sample text dataset
documents = [
# Sports
"The team won the championship game last night",
"Players trained hard for the upcoming match",
"The basketball season starts next month",
"Football fans celebrated the victory",
"Athletes prepare for Olympic games",
# Technology
"New smartphone features artificial intelligence",
"Software developers release updated version",
"Cloud computing transforms business operations",
"Machine learning algorithms improve accuracy",
"Cybersecurity threats increase globally",
# Food
"Italian restaurant serves authentic pasta",
"Fresh ingredients make better recipes",
"Cooking classes teach culinary skills",
"Local farmers market sells organic produce",
"Chef prepares gourmet meal for guests"
]
labels = [0, 0, 0, 0, 0, # Sports
1, 1, 1, 1, 1, # Technology
2, 2, 2, 2, 2] # Food
label_names = ['Sports', 'Technology', 'Food']
# Vectorize text
count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(documents)
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)
# Binary vectorizer
binary_vectorizer = CountVectorizer(binary=True)
X_binary = binary_vectorizer.fit_transform(documents)
# Compare different NB variants
models = {
'Multinomial (Counts)': (MultinomialNB(), X_counts),
'Multinomial (TF-IDF)': (MultinomialNB(), X_tfidf),
'Bernoulli': (BernoulliNB(), X_binary),
'Complement': (ComplementNB(), X_counts)
}
# Cross-validation scores
cv_scores = {}
for name, (model, X) in models.items():
scores = cross_val_score(model, X, labels, cv=3)
cv_scores[name] = scores
model.fit(X, labels) # Fit for later use
self.models[name] = model
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# CV scores comparison
axes[0, 0].boxplot(cv_scores.values(), labels=cv_scores.keys())
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Cross-Validation Scores')
axes[0, 0].set_xticklabels(cv_scores.keys(), rotation=45, ha='right')
axes[0, 0].grid(True, alpha=0.3, axis='y')
# Feature importance (top words per class)
mnb = self.models['Multinomial (Counts)']
feature_names = count_vectorizer.get_feature_names_out()
# Get log probabilities for each class
for class_idx, class_name in enumerate(label_names):
# Get top features for this class
log_prob = mnb.feature_log_prob_[class_idx]
top_indices = np.argsort(log_prob)[-10:][::-1]
top_words = [feature_names[i] for i in top_indices]
top_probs = np.exp(log_prob[top_indices])
# Plot
ax_idx = class_idx + 1 if class_idx < 2 else 3
row = 0 if class_idx < 2 else 1
col = ax_idx if class_idx < 2 else 1
axes[row, col].barh(range(10), top_probs, color=f'C{class_idx}')
axes[row, col].set_yticks(range(10))
axes[row, col].set_yticklabels(top_words)
axes[row, col].set_xlabel('Probability')
axes[row, col].set_title(f'Top Words: {class_name}')
axes[row, col].grid(True, alpha=0.3, axis='x')
# Model comparison summary
summary_text = "Model Comparison:\n\n"
for name, scores in cv_scores.items():
summary_text += f"{name}:\n"
summary_text += f" Mean: {scores.mean():.3f}\n"
summary_text += f" Std: {scores.std():.3f}\n\n"
axes[1, 0].text(0.1, 0.5, summary_text, fontsize=10,
verticalalignment='center', family='monospace')
axes[1, 0].set_title('Summary Statistics')
axes[1, 0].axis('off')
plt.suptitle('Naive Bayes Text Classification Comparison',
fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return cv_scores
def spam_detection_example(self):
"""Implement spam detection with Naive Bayes"""
# Create spam/ham dataset
messages = [
# Spam
"WINNER! You've won $1000 cash prize! Click here now!",
"Free credit report! Limited time offer! Act now!",
"Congratulations! You've been selected for a free vacation!",
"URGENT: Your account will be closed. Verify immediately!",
"Make money fast! Work from home! Guaranteed income!",
"Hot singles in your area! Click to meet them now!",
"Lose weight fast with this one simple trick!",
"Your prescription is ready. Order pills online cheap!",
# Ham (legitimate)
"Meeting scheduled for tomorrow at 2pm",
"Can you review the attached document?",
"Thanks for your help with the project",
"Dinner plans confirmed for Saturday",
"Your package has been delivered",
"Reminder: Doctor appointment next Tuesday",
"Happy birthday! Hope you have a great day",
"Please submit your report by Friday"
]
labels = [1]*8 + [0]*8 # 1=spam, 0=ham
# Additional features
features_df = pd.DataFrame({
'message': messages,
'length': [len(m) for m in messages],
'exclamation': [m.count('!') for m in messages],
'capitals': [sum(1 for c in m if c.isupper())/len(m) for m in messages],
'dollar': [m.count('$') for m in messages]
})
# Vectorize text
vectorizer = TfidfVectorizer(max_features=50)
X_text = vectorizer.fit_transform(messages)
# Combine text and numerical features
X_numerical = features_df[['length', 'exclamation', 'capitals', 'dollar']].values
X_combined = np.hstack([X_text.toarray(), X_numerical])
# Train models
nb_text = MultinomialNB()
nb_text.fit(X_text, labels)
gnb_combined = GaussianNB()
gnb_combined.fit(X_combined, labels)
# Predictions and probabilities
prob_text = nb_text.predict_proba(X_text)
prob_combined = gnb_combined.predict_proba(X_combined)
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Probability distribution
axes[0, 0].hist(prob_text[labels==0, 1], alpha=0.5, bins=10,
label='Ham', color='green')
axes[0, 0].hist(prob_text[labels==1, 1], alpha=0.5, bins=10,
label='Spam', color='red')
axes[0, 0].set_xlabel('Spam Probability (Text Only)')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_title('Probability Distribution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Feature importance
feature_importance = nb_text.feature_log_prob_[1] - nb_text.feature_log_prob_[0]
top_spam_indices = np.argsort(feature_importance)[-5:]
top_ham_indices = np.argsort(feature_importance)[:5]
feature_names = vectorizer.get_feature_names_out()
spam_words = [feature_names[i] for i in top_spam_indices]
ham_words = [feature_names[i] for i in top_ham_indices]
y_pos = np.arange(5)
axes[0, 1].barh(y_pos, feature_importance[top_spam_indices],
color='red', alpha=0.7)
axes[0, 1].barh(y_pos + 5, feature_importance[top_ham_indices],
color='green', alpha=0.7)
axes[0, 1].set_yticks(np.arange(10))
axes[0, 1].set_yticklabels(spam_words + ham_words)
axes[0, 1].set_xlabel('Log Probability Difference')
axes[0, 1].set_title('Spam vs Ham Indicators')
axes[0, 1].grid(True, alpha=0.3, axis='x')
# Numerical features analysis
for idx, (feature, color) in enumerate(zip(['exclamation', 'capitals'],
['orange', 'purple'])):
axes[1, 0].scatter(features_df[feature][labels==0],
prob_text[labels==0, 1],
alpha=0.5, label=f'Ham ({feature})',
color=color, marker='o')
axes[1, 0].scatter(features_df[feature][labels==1],
prob_text[labels==1, 1],
alpha=0.5, label=f'Spam ({feature})',
color=color, marker='^')
axes[1, 0].set_xlabel('Feature Value')
axes[1, 0].set_ylabel('Spam Probability')
axes[1, 0].set_title('Feature Correlation with Spam')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Model comparison
pred_text = nb_text.predict(X_text)
pred_combined = gnb_combined.predict(X_combined)
results_text = f"Model Performance:\n\n"
results_text += f"Text Only Model:\n"
results_text += f" Accuracy: {accuracy_score(labels, pred_text):.3f}\n"
results_text += f" Spam detected: {pred_text.sum()}/{sum(labels)}\n\n"
results_text += f"Combined Features Model:\n"
results_text += f" Accuracy: {accuracy_score(labels, pred_combined):.3f}\n"
results_text += f" Spam detected: {pred_combined.sum()}/{sum(labels)}\n"
axes[1, 1].text(0.1, 0.5, results_text, fontsize=11,
verticalalignment='center', family='monospace')
axes[1, 1].set_title('Performance Summary')
axes[1, 1].axis('off')
plt.suptitle('Spam Detection with Naive Bayes', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return nb_text, gnb_combined
# Text classification
text_classifier = TextClassificationNB()
print("\n" + "="*60)
print("TEXT CLASSIFICATION WITH NAIVE BAYES")
print("="*60)
print("\n1. Comparing NB Variants for Text:")
text_scores = text_classifier.compare_nb_variants_text()
print("\n2. Spam Detection Example:")
spam_model, combined_model = text_classifier.spam_detection_example()
class AdvancedNaiveBayes:
"""Advanced Naive Bayes techniques"""
def __init__(self):
self.models = {}
def handle_zero_probability(self):
"""Demonstrate Laplace smoothing for zero probability problem"""
# Create dataset with rare features
X_train = np.array([
[1, 1, 0],
[1, 1, 0],
[0, 1, 1],
[0, 0, 1]
])
y_train = np.array([0, 0, 1, 1])
# Test sample with unseen feature combination
X_test = np.array([[1, 0, 0]]) # Feature 2 is 0 for class 0 in training
# Compare with and without smoothing
mnb_no_smooth = MultinomialNB(alpha=0.0) # No smoothing
mnb_smooth = MultinomialNB(alpha=1.0) # Laplace smoothing
mnb_no_smooth.fit(X_train, y_train)
mnb_smooth.fit(X_train, y_train)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Feature probabilities without smoothing
feature_prob_no_smooth = np.exp(mnb_no_smooth.feature_log_prob_)
im1 = axes[0].imshow(feature_prob_no_smooth, cmap='YlOrRd', vmin=0, vmax=1)
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('Class')
axes[0].set_title('Without Smoothing (α=0)')
plt.colorbar(im1, ax=axes[0])
# Add text annotations
for i in range(2):
for j in range(3):
axes[0].text(j, i, f'{feature_prob_no_smooth[i, j]:.2f}',
ha='center', va='center')
# Feature probabilities with smoothing
feature_prob_smooth = np.exp(mnb_smooth.feature_log_prob_)
im2 = axes[1].imshow(feature_prob_smooth, cmap='YlOrRd', vmin=0, vmax=1)
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Class')
axes[1].set_title('With Laplace Smoothing (α=1)')
plt.colorbar(im2, ax=axes[1])
for i in range(2):
for j in range(3):
axes[1].text(j, i, f'{feature_prob_smooth[i, j]:.2f}',
ha='center', va='center')
# Effect of different alpha values
alphas = np.logspace(-3, 1, 20)
test_probs = []
for alpha in alphas:
mnb_temp = MultinomialNB(alpha=alpha)
mnb_temp.fit(X_train, y_train)
prob = mnb_temp.predict_proba(X_test)[0, 0]
test_probs.append(prob)
axes[2].plot(alphas, test_probs, marker='o', linewidth=2)
axes[2].set_xscale('log')
axes[2].set_xlabel('Alpha (Smoothing Parameter)')
axes[2].set_ylabel('P(Class 0 | Test Sample)')
axes[2].set_title('Effect of Smoothing on Prediction')
axes[2].grid(True, alpha=0.3)
axes[2].axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.suptitle('Handling Zero Probability with Laplace Smoothing',
fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print("\nZero Probability Problem:")
print(f"Without smoothing - P(Class 0): {mnb_no_smooth.predict_proba(X_test)[0, 0]:.4f}")
print(f"With smoothing - P(Class 0): {mnb_smooth.predict_proba(X_test)[0, 0]:.4f}")
def semi_supervised_nb(self, n_labeled=50):
"""Semi-supervised learning with Naive Bayes"""
# Generate dataset
X, y = make_classification(n_samples=500, n_features=20,
n_informative=15, n_redundant=5,
n_classes=3, random_state=42)
# Create semi-supervised scenario
# Only label first n_labeled samples
y_semi = y.copy()
y_semi[n_labeled:] = -1 # Unlabeled
# Self-training approach
from sklearn.semi_supervised import SelfTrainingClassifier
base_nb = GaussianNB()
self_training_nb = SelfTrainingClassifier(base_nb, threshold=0.75)
# Train on labeled data only
X_labeled = X[:n_labeled]
y_labeled = y[:n_labeled]
nb_supervised = GaussianNB()
nb_supervised.fit(X_labeled, y_labeled)
# Self-training (uses unlabeled data)
self_training_nb.fit(X, y_semi)
# Evaluate
X_test = X[400:]
y_test = y[400:]
acc_supervised = nb_supervised.score(X_test, y_test)
acc_semi = self_training_nb.score(X_test, y_test)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Learning curves
n_labeled_range = [10, 20, 30, 40, 50, 75, 100]
acc_sup_list = []
acc_semi_list = []
for n in n_labeled_range:
y_temp = y.copy()
y_temp[n:] = -1
# Supervised
nb_temp = GaussianNB()
nb_temp.fit(X[:n], y[:n])
acc_sup_list.append(nb_temp.score(X_test, y_test))
# Semi-supervised
st_temp = SelfTrainingClassifier(GaussianNB(), threshold=0.75)
st_temp.fit(X, y_temp)
acc_semi_list.append(st_temp.score(X_test, y_test))
axes[0].plot(n_labeled_range, acc_sup_list, 'o-', label='Supervised')
axes[0].plot(n_labeled_range, acc_semi_list, 's-', label='Semi-supervised')
axes[0].set_xlabel('Number of Labeled Samples')
axes[0].set_ylabel('Test Accuracy')
axes[0].set_title('Learning Curves')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Confidence distribution
proba_supervised = nb_supervised.predict_proba(X[n_labeled:400])
proba_semi = self_training_nb.predict_proba(X[n_labeled:400])
max_prob_sup = np.max(proba_supervised, axis=1)
max_prob_semi = np.max(proba_semi, axis=1)
axes[1].hist(max_prob_sup, alpha=0.5, bins=20, label='Supervised')
axes[1].hist(max_prob_semi, alpha=0.5, bins=20, label='Semi-supervised')
axes[1].set_xlabel('Maximum Class Probability')
axes[1].set_ylabel('Count')
axes[1].set_title('Prediction Confidence')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Performance comparison
methods = ['Supervised\nOnly', 'Semi-supervised\n(Self-training)']
accuracies = [acc_supervised, acc_semi]
bars = axes[2].bar(methods, accuracies, color=['coral', 'lightgreen'])
axes[2].set_ylabel('Test Accuracy')
axes[2].set_title(f'Performance with {n_labeled} Labeled Samples')
axes[2].set_ylim(0, 1)
axes[2].grid(True, alpha=0.3, axis='y')
# Add value labels
for bar, acc in zip(bars, accuracies):
axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height(),
f'{acc:.3f}', ha='center', va='bottom')
plt.suptitle('Semi-supervised Learning with Naive Bayes',
fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print(f"\nSemi-supervised Learning Results:")
print(f" Labeled samples: {n_labeled}")
print(f" Supervised accuracy: {acc_supervised:.3f}")
print(f" Semi-supervised accuracy: {acc_semi:.3f}")
print(f" Improvement: {(acc_semi - acc_supervised)*100:.1f}%")
# Advanced techniques
advanced_nb = AdvancedNaiveBayes()
print("\n" + "="*60)
print("ADVANCED NAIVE BAYES TECHNIQUES")
print("="*60)
print("\n1. Handling Zero Probability:")
advanced_nb.handle_zero_probability()
print("\n2. Semi-supervised Learning:")
advanced_nb.semi_supervised_nb()
print("\n" + "="*60)
print("NAIVE BAYES BEST PRACTICES")
print("="*60)
best_practices = """
KEY GUIDELINES:
1. CHOOSING THE RIGHT VARIANT:
• Gaussian NB: Continuous features, normal distribution
• Multinomial NB: Count data, text classification
• Bernoulli NB: Binary features, document classification
• Complement NB: Imbalanced datasets
2. DATA PREPROCESSING:
• Scale features for Gaussian NB (sometimes helps)
• Use appropriate vectorization for text (Count, TF-IDF)
• Handle missing values before training
• Consider log-transform for skewed features
3. HANDLING COMMON ISSUES:
• Zero probability: Use Laplace smoothing (alpha > 0)
• Correlated features: Consider feature selection
• Imbalanced classes: Adjust priors or use Complement NB
• Continuous features: Check normality assumption
4. ADVANTAGES TO LEVERAGE:
✓ Fast training and prediction
✓ Works with small training sets
✓ Provides probability estimates
✓ Naturally handles multi-class
✓ Good baseline model
5. LIMITATIONS TO CONSIDER:
✗ Assumes feature independence
✗ Sensitive to feature representation
✗ May underperform complex models
✗ Probability estimates can be poor
6. WHEN TO USE NAIVE BAYES:
• Text classification (spam, sentiment)
• Real-time prediction needed
• Small training dataset
• Multi-class problems
• Baseline model needed
• Interpretable probabilities required
"""
print(best_practices)
# Performance comparison
comparison_data = {
'Aspect': ['Training Speed', 'Prediction Speed', 'Small Data', 'Large Data',
'Interpretability', 'Feature Independence'],
'Naive Bayes': ['Fast', 'Fast', 'Good', 'Good', 'High', 'Required'],
'Logistic Reg': ['Medium', 'Fast', 'Poor', 'Good', 'High', 'Not required'],
'SVM': ['Slow', 'Medium', 'Good', 'Poor', 'Low', 'Not required'],
'Random Forest': ['Medium', 'Fast', 'Poor', 'Good', 'Medium', 'Not required'],
'Neural Network': ['Slow', 'Fast', 'Poor', 'Excellent', 'Low', 'Not required']
}
comparison_df = pd.DataFrame(comparison_data)
print("\nClassifier Comparison:")
print("="*60)
print(comparison_df.to_string(index=False))
# Implementation checklist
checklist = """
NAIVE BAYES IMPLEMENTATION CHECKLIST:
□ Choose appropriate NB variant for data type
□ Preprocess features appropriately
□ Handle missing values
□ Consider feature scaling (for Gaussian)
□ Set smoothing parameter (for discrete)
□ Check feature independence assumption
□ Validate with cross-validation
□ Compare with baseline models
□ Examine prediction probabilities
□ Test on holdout set
□ Monitor for concept drift in production
"""
print(checklist)
Build your own Naive Bayes classifier from scratch:
Build a complete sentiment analysis pipeline:
Implement online learning with Naive Bayes: