Model Selection & Tuning
Optimize Your Machine Learning Pipeline! 🎯
Model selection and hyperparameter tuning are critical steps in building high-performing machine learning systems. Learn systematic approaches to choose the best algorithm, optimize hyperparameters, avoid overfitting, and build robust models that generalize well to new data. Master the techniques that separate good models from great ones.
Model Selection Framework
graph TD
A[Model Selection Pipeline] --> B[Algorithm Selection]
A --> C[Hyperparameter Tuning]
A --> D[Model Evaluation]
B --> E[Compare Algorithms]
B --> F[Baseline Models]
B --> G[Ensemble Methods]
C --> H[Grid Search]
C --> I[Random Search]
C --> J[Bayesian Optimization]
C --> K[Genetic Algorithms]
D --> L[Cross-Validation]
D --> M[Validation Curves]
D --> N[Learning Curves]
D --> O[Final Evaluation]
L --> P[K-Fold]
L --> Q[Stratified K-Fold]
L --> R[Time Series Split]
O --> S[Test Set Performance]
style A fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#bbf,stroke:#333,stroke-width:2px
style D fill:#9f9,stroke:#333,stroke-width:2px
Cross-Validation Strategies
Understanding Different CV Methods
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import (
KFold, StratifiedKFold, TimeSeriesSplit,
RepeatedKFold, RepeatedStratifiedKFold,
LeaveOneOut, LeavePOut, ShuffleSplit,
cross_val_score, cross_validate, validation_curve, learning_curve
)
from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
# Generate datasets
np.random.seed(42)
X_class, y_class = make_classification(
n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_classes=3, random_state=42
)
X_reg, y_reg = make_regression(
n_samples=1000, n_features=20, n_informative=15,
noise=0.1, random_state=42
)
class CrossValidationAnalyzer:
"""Comprehensive Cross-Validation Analysis"""
def __init__(self):
self.cv_results = {}
self.best_params = {}
def visualize_cv_strategies(self, X, y, n_splits=5):
"""Visualize different cross-validation strategies"""
strategies = {
'KFold': KFold(n_splits=n_splits, shuffle=True, random_state=42),
'StratifiedKFold': StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42),
'ShuffleSplit': ShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=42),
'TimeSeriesSplit': TimeSeriesSplit(n_splits=n_splits)
}
fig, axes = plt.subplots(4, 1, figsize=(12, 10))
n_samples = 100 # Use subset for visualization
X_vis = X[:n_samples]
y_vis = y[:n_samples]
for idx, (name, cv) in enumerate(strategies.items()):
# Create visualization
ax = axes[idx]
# Plot each fold
for fold, (train_idx, test_idx) in enumerate(cv.split(X_vis, y_vis)):
# Create array for visualization
indices = np.zeros(n_samples)
indices[train_idx] = 1
indices[test_idx] = 2
# Plot
ax.scatter(range(n_samples), [fold] * n_samples,
c=indices, cmap='coolwarm', s=10, alpha=0.8)
ax.set_xlim(0, n_samples)
ax.set_ylim(-0.5, n_splits - 0.5)
ax.set_ylabel('Fold')
ax.set_title(f'{name} Cross-Validation')
ax.grid(True, alpha=0.3)
# Add legend for first plot only
if idx == 0:
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='blue', alpha=0.8, label='Not used'),
Patch(facecolor='white', alpha=0.8, label='Training'),
Patch(facecolor='red', alpha=0.8, label='Testing')
]
ax.legend(handles=legend_elements, loc='upper right')
axes[-1].set_xlabel('Sample Index')
plt.suptitle('Cross-Validation Strategies Visualization', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
def compare_cv_strategies(self, X, y, model, cv_strategies):
"""Compare different CV strategies"""
results = {}
for name, cv in cv_strategies.items():
# Perform cross-validation
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
results[name] = {
'mean_score': scores.mean(),
'std_score': scores.std(),
'scores': scores,
'n_splits': cv.get_n_splits() if hasattr(cv, 'get_n_splits') else len(scores)
}
# Create comparison DataFrame
comparison_df = pd.DataFrame({
name: [res['mean_score'], res['std_score'], res['n_splits']]
for name, res in results.items()
}).T
comparison_df.columns = ['Mean Score', 'Std Dev', 'N Splits']
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Bar plot of mean scores with error bars
names = list(results.keys())
means = [results[n]['mean_score'] for n in names]
stds = [results[n]['std_score'] for n in names]
axes[0].bar(names, means, yerr=stds, capsize=5, alpha=0.7)
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Cross-Validation Strategy Comparison')
axes[0].set_ylim([min(means) - 0.1, max(means) + 0.1])
axes[0].grid(True, alpha=0.3)
# Add value labels
for i, (mean, std) in enumerate(zip(means, stds)):
axes[0].text(i, mean + std + 0.01, f'{mean:.3f}±{std:.3f}',
ha='center', fontsize=9)
# Box plot of score distributions
score_data = [results[n]['scores'] for n in names]
axes[1].boxplot(score_data, labels=names)
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Score Distributions')
axes[1].grid(True, alpha=0.3)
plt.suptitle('Cross-Validation Strategy Performance', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return comparison_df, results
def nested_cross_validation(self, X, y, model, param_grid,
inner_cv=5, outer_cv=5):
"""Perform nested cross-validation for unbiased evaluation"""
from sklearn.model_selection import GridSearchCV
outer_scores = []
best_params_list = []
# Outer CV loop
outer_cv_split = KFold(n_splits=outer_cv, shuffle=True, random_state=42)
print("Performing Nested Cross-Validation...")
print("="*50)
for fold, (train_idx, test_idx) in enumerate(outer_cv_split.split(X, y)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Inner CV for hyperparameter tuning
inner_cv_split = KFold(n_splits=inner_cv, shuffle=True, random_state=42)
grid_search = GridSearchCV(
model, param_grid, cv=inner_cv_split,
scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)
# Evaluate on outer fold test set
score = grid_search.score(X_test, y_test)
outer_scores.append(score)
best_params_list.append(grid_search.best_params_)
print(f"Fold {fold+1}: Score={score:.3f}, Best params={grid_search.best_params_}")
print(f"\nNested CV Score: {np.mean(outer_scores):.3f} ± {np.std(outer_scores):.3f}")
return outer_scores, best_params_list
# Initialize analyzer
cv_analyzer = CrossValidationAnalyzer()
print("="*60)
print("CROSS-VALIDATION STRATEGIES")
print("="*60)
# Visualize CV strategies
print("\nVisualizing cross-validation strategies...")
cv_analyzer.visualize_cv_strategies(X_class, y_class)
# Compare CV strategies
print("\nComparing CV strategies...")
model = RandomForestClassifier(n_estimators=50, random_state=42)
cv_strategies = {
'KFold': KFold(n_splits=5, shuffle=True, random_state=42),
'StratifiedKFold': StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
'RepeatedKFold': RepeatedKFold(n_splits=5, n_repeats=2, random_state=42),
'ShuffleSplit': ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
}
comparison_df, results = cv_analyzer.compare_cv_strategies(
X_class, y_class, model, cv_strategies
)
print("\nCV Strategy Comparison:")
print(comparison_df)
# Nested cross-validation
print("\n" + "="*60)
print("NESTED CROSS-VALIDATION")
print("="*60)
param_grid = {
'n_estimators': [50, 100],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5]
}
nested_scores, nested_params = cv_analyzer.nested_cross_validation(
X_class[:500], y_class[:500], # Use subset for speed
RandomForestClassifier(random_state=42),
param_grid,
inner_cv=3,
outer_cv=5
)
Hyperparameter Tuning Methods
Grid Search, Random Search, and Bayesian Optimization
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint, uniform
class HyperparameterTuner:
"""Advanced hyperparameter tuning methods"""
def __init__(self):
self.search_results = {}
self.best_models = {}
def grid_search_tuning(self, X, y, model, param_grid):
"""Exhaustive grid search"""
print("Performing Grid Search...")
grid_search = GridSearchCV(
model, param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=1,
return_train_score=True
)
grid_search.fit(X, y)
self.search_results['grid'] = pd.DataFrame(grid_search.cv_results_)
self.best_models['grid'] = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
return grid_search
def random_search_tuning(self, X, y, model, param_distributions, n_iter=100):
"""Random search with distributions"""
print("Performing Random Search...")
random_search = RandomizedSearchCV(
model, param_distributions,
n_iter=n_iter,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=1,
random_state=42,
return_train_score=True
)
random_search.fit(X, y)
self.search_results['random'] = pd.DataFrame(random_search.cv_results_)
self.best_models['random'] = random_search.best_estimator_
print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.3f}")
return random_search
def bayesian_optimization(self, X, y, model_class, search_space, n_calls=50):
"""Bayesian optimization using scikit-optimize"""
try:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
print("Performing Bayesian Optimization...")
bayes_search = BayesSearchCV(
model_class(),
search_space,
n_iter=n_calls,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
bayes_search.fit(X, y)
self.best_models['bayesian'] = bayes_search.best_estimator_
print(f"Best parameters: {bayes_search.best_params_}")
print(f"Best score: {bayes_search.best_score_:.3f}")
return bayes_search
except ImportError:
print("scikit-optimize not installed.")
print("Install with: pip install scikit-optimize")
return None
def compare_search_methods(self, X, y):
"""Compare different search methods"""
from sklearn.ensemble import RandomForestClassifier
# Define search spaces
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20],
'min_samples_split': [2, 5, 10]
}
param_distributions = {
'n_estimators': randint(50, 200),
'max_depth': randint(5, 20),
'min_samples_split': randint(2, 10)
}
results = {}
# Grid Search
print("\n" + "="*40)
print("GRID SEARCH")
print("="*40)
import time
start_time = time.time()
grid_search = self.grid_search_tuning(
X, y,
RandomForestClassifier(random_state=42),
param_grid
)
grid_time = time.time() - start_time
results['Grid Search'] = {
'best_score': grid_search.best_score_,
'time': grid_time,
'n_evaluations': len(grid_search.cv_results_['mean_test_score'])
}
# Random Search
print("\n" + "="*40)
print("RANDOM SEARCH")
print("="*40)
start_time = time.time()
random_search = self.random_search_tuning(
X, y,
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=27 # Same as grid search combinations
)
random_time = time.time() - start_time
results['Random Search'] = {
'best_score': random_search.best_score_,
'time': random_time,
'n_evaluations': len(random_search.cv_results_['mean_test_score'])
}
# Bayesian Optimization (if available)
try:
from skopt.space import Real, Integer
print("\n" + "="*40)
print("BAYESIAN OPTIMIZATION")
print("="*40)
search_space = {
'n_estimators': Integer(50, 200),
'max_depth': Integer(5, 20),
'min_samples_split': Integer(2, 10)
}
start_time = time.time()
bayes_search = self.bayesian_optimization(
X, y,
RandomForestClassifier,
search_space,
n_calls=27
)
if bayes_search:
bayes_time = time.time() - start_time
results['Bayesian Opt'] = {
'best_score': bayes_search.best_score_,
'time': bayes_time,
'n_evaluations': 27
}
except ImportError:
pass
# Visualization
if len(results) > 0:
results_df = pd.DataFrame(results).T
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Best scores
axes[0].bar(results_df.index, results_df['best_score'])
axes[0].set_ylabel('Best Score')
axes[0].set_title('Optimization Performance')
axes[0].set_ylim([min(results_df['best_score']) - 0.01,
max(results_df['best_score']) + 0.01])
axes[0].grid(True, alpha=0.3)
# Time taken
axes[1].bar(results_df.index, results_df['time'])
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Computation Time')
axes[1].grid(True, alpha=0.3)
# Efficiency (score per second)
efficiency = results_df['best_score'] / results_df['time']
axes[2].bar(results_df.index, efficiency)
axes[2].set_ylabel('Score / Time')
axes[2].set_title('Optimization Efficiency')
axes[2].grid(True, alpha=0.3)
plt.suptitle('Hyperparameter Search Methods Comparison', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return results_df
def visualize_search_results(self, search_cv):
"""Visualize hyperparameter search results"""
results = pd.DataFrame(search_cv.cv_results_)
# Get parameter names
param_names = [p for p in results.columns if p.startswith('param_')]
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Plot training vs validation scores
axes[0].scatter(results['mean_train_score'], results['mean_test_score'],
alpha=0.5, s=20)
axes[0].plot([0, 1], [0, 1], 'r--', lw=2)
axes[0].set_xlabel('Training Score')
axes[0].set_ylabel('Validation Score')
axes[0].set_title('Training vs Validation Scores')
axes[0].grid(True, alpha=0.3)
# Highlight best point
best_idx = search_cv.best_index_
axes[0].scatter(results.loc[best_idx, 'mean_train_score'],
results.loc[best_idx, 'mean_test_score'],
color='red', s=100, marker='*',
label='Best params', zorder=5)
axes[0].legend()
# Parameter importance (if we have 2 or more parameters)
if len(param_names) >= 2:
# Take first two parameters for visualization
param1 = param_names[0]
param2 = param_names[1] if len(param_names) > 1 else 'mean_test_score'
# Create pivot table for heatmap
pivot = results.pivot_table(
values='mean_test_score',
index=param1,
columns=param2 if param2 != 'mean_test_score' else None,
aggfunc='mean'
)
if param2 != 'mean_test_score':
sns.heatmap(pivot, annot=True, fmt='.3f', cmap='YlOrRd', ax=axes[1])
axes[1].set_title(f'Score Heatmap: {param1} vs {param2}')
else:
axes[1].bar(range(len(results)), results['mean_test_score'])
axes[1].set_xlabel('Configuration')
axes[1].set_ylabel('Validation Score')
axes[1].set_title('Score by Configuration')
plt.suptitle('Hyperparameter Search Results', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
# Hyperparameter tuning
print("\n" + "="*60)
print("HYPERPARAMETER TUNING METHODS")
print("="*60)
tuner = HyperparameterTuner()
# Compare search methods
print("\nComparing search methods...")
search_comparison = tuner.compare_search_methods(X_class[:500], y_class[:500])
if search_comparison is not None:
print("\nSearch Method Comparison:")
print(search_comparison)
# Visualize search results
if 'grid' in tuner.best_models:
print("\nVisualizing grid search results...")
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
{'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 20]},
cv=5, return_train_score=True
)
grid_search.fit(X_class[:500], y_class[:500])
tuner.visualize_search_results(grid_search)
Model Selection Strategies
Comparing and Selecting Models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
class ModelSelector:
"""Systematic model selection"""
def __init__(self):
self.models = {}
self.results = {}
self.best_model = None
def compare_algorithms(self, X, y, models_dict=None):
"""Compare multiple algorithms"""
if models_dict is None:
models_dict = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100),
'SVM': SVC(probability=True),
'KNN': KNeighborsClassifier(),
'Naive Bayes': GaussianNB()
}
from sklearn.model_selection import cross_validate
results = []
for name, model in models_dict.items():
# Create pipeline with scaling
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])
# Cross-validation with multiple metrics
cv_results = cross_validate(
pipeline, X, y,
cv=5,
scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
return_train_score=True
)
results.append({
'Algorithm': name,
'Accuracy': cv_results['test_accuracy'].mean(),
'Accuracy_Std': cv_results['test_accuracy'].std(),
'Precision': cv_results['test_precision_macro'].mean(),
'Recall': cv_results['test_recall_macro'].mean(),
'F1': cv_results['test_f1_macro'].mean(),
'Train_Time': cv_results['fit_time'].mean(),
'Test_Time': cv_results['score_time'].mean()
})
self.models[name] = pipeline
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('Accuracy', ascending=False)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
# Accuracy comparison
axes[0].barh(results_df['Algorithm'], results_df['Accuracy'])
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Model Accuracy Comparison')
axes[0].grid(True, alpha=0.3)
# Precision vs Recall
axes[1].scatter(results_df['Precision'], results_df['Recall'], s=100)
for i, txt in enumerate(results_df['Algorithm']):
axes[1].annotate(txt, (results_df['Precision'].iloc[i],
results_df['Recall'].iloc[i]),
fontsize=8)
axes[1].set_xlabel('Precision')
axes[1].set_ylabel('Recall')
axes[1].set_title('Precision vs Recall')
axes[1].grid(True, alpha=0.3)
# F1 Score
axes[2].barh(results_df['Algorithm'], results_df['F1'])
axes[2].set_xlabel('F1 Score')
axes[2].set_title('F1 Score Comparison')
axes[2].grid(True, alpha=0.3)
# Training time
axes[3].barh(results_df['Algorithm'], results_df['Train_Time'])
axes[3].set_xlabel('Training Time (seconds)')
axes[3].set_title('Training Time Comparison')
axes[3].grid(True, alpha=0.3)
# Accuracy vs Training Time trade-off
axes[4].scatter(results_df['Train_Time'], results_df['Accuracy'], s=100)
for i, txt in enumerate(results_df['Algorithm']):
axes[4].annotate(txt, (results_df['Train_Time'].iloc[i],
results_df['Accuracy'].iloc[i]),
fontsize=8)
axes[4].set_xlabel('Training Time (seconds)')
axes[4].set_ylabel('Accuracy')
axes[4].set_title('Accuracy vs Training Time Trade-off')
axes[4].grid(True, alpha=0.3)
# Radar chart for multi-metric comparison
from math import pi
categories = ['Accuracy', 'Precision', 'Recall', 'F1']
N = len(categories)
angles = [n / N * 2 * pi for n in range(N)]
angles += angles[:1]
ax = plt.subplot(2, 3, 6, projection='polar')
# Plot top 3 models
for i in range(min(3, len(results_df))):
values = results_df.iloc[i][['Accuracy', 'Precision', 'Recall', 'F1']].values.tolist()
values += values[:1]
ax.plot(angles, values, 'o-', linewidth=2,
label=results_df.iloc[i]['Algorithm'])
ax.fill(angles, values, alpha=0.25)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
ax.set_ylim(0, 1)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
ax.set_title('Multi-Metric Comparison (Top 3)', y=1.08)
plt.suptitle('Algorithm Comparison Dashboard', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
self.results = results_df
self.best_model = self.models[results_df.iloc[0]['Algorithm']]
return results_df
def validation_curves(self, X, y, model, param_name, param_range):
"""Plot validation curves for a parameter"""
train_scores, val_scores = validation_curve(
model, X, y,
param_name=param_name,
param_range=param_range,
cv=5,
scoring='accuracy'
)
# Calculate mean and std
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_mean, 'o-', label='Training score', linewidth=2)
plt.fill_between(param_range, train_mean - train_std,
train_mean + train_std, alpha=0.3)
plt.plot(param_range, val_mean, 'o-', label='Validation score', linewidth=2)
plt.fill_between(param_range, val_mean - val_std,
val_mean + val_std, alpha=0.3)
plt.xlabel(param_name)
plt.ylabel('Accuracy')
plt.title(f'Validation Curve: {param_name}')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
# Mark best value
best_idx = np.argmax(val_mean)
plt.axvline(x=param_range[best_idx], color='red', linestyle='--',
label=f'Best: {param_range[best_idx]}')
plt.legend()
plt.show()
return train_scores, val_scores
def learning_curves(self, X, y, model, train_sizes=None):
"""Plot learning curves"""
if train_sizes is None:
train_sizes = np.linspace(0.1, 1.0, 10)
train_sizes_abs, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=train_sizes,
cv=5,
scoring='accuracy',
n_jobs=-1
)
# Calculate mean and std
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(train_sizes_abs, train_mean, 'o-', label='Training score', linewidth=2)
plt.fill_between(train_sizes_abs, train_mean - train_std,
train_mean + train_std, alpha=0.3)
plt.plot(train_sizes_abs, val_mean, 'o-', label='Validation score', linewidth=2)
plt.fill_between(train_sizes_abs, val_mean - val_std,
val_mean + val_std, alpha=0.3)
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curves')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
# Add annotations for bias/variance
final_train = train_mean[-1]
final_val = val_mean[-1]
gap = final_train - final_val
if gap > 0.1:
plt.text(0.5, 0.2, 'High Variance\n(Overfitting)',
transform=plt.gca().transAxes,
fontsize=12, color='red',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
elif final_val < 0.7:
plt.text(0.5, 0.2, 'High Bias\n(Underfitting)',
transform=plt.gca().transAxes,
fontsize=12, color='blue',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
else:
plt.text(0.5, 0.2, 'Good Fit',
transform=plt.gca().transAxes,
fontsize=12, color='green',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.show()
return train_sizes_abs, train_scores, val_scores
# Model selection
print("\n" + "="*60)
print("MODEL SELECTION STRATEGIES")
print("="*60)
selector = ModelSelector()
# Compare algorithms
print("\nComparing different algorithms...")
comparison_results = selector.compare_algorithms(X_class[:500], y_class[:500])
print("\nAlgorithm Comparison Results:")
print(comparison_results.to_string())
# Validation curves
print("\nPlotting validation curves...")
selector.validation_curves(
X_class, y_class,
RandomForestClassifier(random_state=42),
'n_estimators',
[10, 50, 100, 200, 500]
)
# Learning curves
print("\nPlotting learning curves...")
selector.learning_curves(
X_class, y_class,
RandomForestClassifier(n_estimators=100, random_state=42)
)
Pipeline Optimization
Building and Tuning Complete ML Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
class PipelineOptimizer:
"""Optimize complete machine learning pipelines"""
def __init__(self):
self.best_pipeline = None
self.pipeline_results = {}
def create_pipeline(self, steps):
"""Create a pipeline from steps"""
return Pipeline(steps)
def optimize_pipeline(self, X, y):
"""Optimize a complete pipeline including preprocessing"""
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest()),
('classifier', RandomForestClassifier())
])
# Parameter grid for pipeline
param_grid = [
{
'feature_selection__k': [5, 10, 15, 20],
'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [50, 100],
'classifier__max_depth': [5, 10, None]
},
{
'feature_selection__k': [5, 10, 15, 20],
'classifier': [LogisticRegression()],
'classifier__C': [0.1, 1.0, 10.0],
'classifier__max_iter': [1000]
}
]
# Grid search
grid_search = GridSearchCV(
pipeline, param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=1
)
print("Optimizing pipeline...")
grid_search.fit(X, y)
self.best_pipeline = grid_search.best_estimator_
print(f"\nBest pipeline parameters:")
print(grid_search.best_params_)
print(f"Best score: {grid_search.best_score_:.3f}")
return grid_search
def pipeline_ablation_study(self, X, y):
"""Study the impact of different pipeline components"""
# Different pipeline configurations
pipelines = {
'Basic': Pipeline([
('classifier', RandomForestClassifier(n_estimators=100))
]),
'Scaled': Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100))
]),
'PCA': Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', RandomForestClassifier(n_estimators=100))
]),
'Feature Selection': Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(k=10)),
('classifier', RandomForestClassifier(n_estimators=100))
]),
'Full Pipeline': Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(k=15)),
('pca', PCA(n_components=10)),
('classifier', RandomForestClassifier(n_estimators=100))
])
}
results = []
for name, pipeline in pipelines.items():
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
results.append({
'Pipeline': name,
'Mean Score': scores.mean(),
'Std Score': scores.std(),
'Components': len(pipeline.steps)
})
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('Mean Score', ascending=False)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Bar plot of scores
axes[0].bar(results_df['Pipeline'], results_df['Mean Score'],
yerr=results_df['Std Score'], capsize=5)
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Pipeline Configuration Comparison')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)
# Components vs Score
axes[1].scatter(results_df['Components'], results_df['Mean Score'], s=100)
for i, txt in enumerate(results_df['Pipeline']):
axes[1].annotate(txt, (results_df['Components'].iloc[i],
results_df['Mean Score'].iloc[i]),
fontsize=9, ha='center')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Mean Score')
axes[1].set_title('Pipeline Complexity vs Performance')
axes[1].grid(True, alpha=0.3)
plt.suptitle('Pipeline Ablation Study', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
return results_df
# Pipeline optimization
print("\n" + "="*60)
print("PIPELINE OPTIMIZATION")
print("="*60)
pipeline_opt = PipelineOptimizer()
# Optimize pipeline
print("\nOptimizing complete pipeline...")
optimized_pipeline = pipeline_opt.optimize_pipeline(X_class[:500], y_class[:500])
# Ablation study
print("\nPerforming pipeline ablation study...")
ablation_results = pipeline_opt.pipeline_ablation_study(X_class[:500], y_class[:500])
print("\nPipeline Ablation Results:")
print(ablation_results.to_string(index=False))
Best Practices and Guidelines
print("\n" + "="*60)
print("MODEL SELECTION & TUNING BEST PRACTICES")
print("="*60)
best_practices = """
KEY GUIDELINES:
1. CROSS-VALIDATION STRATEGY:
• Use StratifiedKFold for imbalanced classification
• Use TimeSeriesSplit for temporal data
• Use RepeatedKFold for small datasets
• Use nested CV for unbiased evaluation
• 5-10 folds typically sufficient
2. HYPERPARAMETER TUNING ORDER:
• Start with random search for exploration
• Use grid search for fine-tuning
• Consider Bayesian optimization for expensive models
• Always use cross-validation
• Set aside final test set
3. AVOIDING OVERFITTING:
• Use proper cross-validation
• Don't tune on test set
• Use regularization
• Monitor training vs validation scores
• Consider simpler models
4. SEARCH STRATEGY:
• Coarse-to-fine approach
• Log scale for regularization parameters
• Start with default parameters
• Focus on most impactful parameters first
5. COMPUTATIONAL EFFICIENCY:
• Use RandomizedSearchCV for large spaces
• Parallelize with n_jobs=-1
• Use early stopping when available
• Cache repeated computations
• Consider successive halving
6. EVALUATION METRICS:
• Choose metric based on problem
• Consider multiple metrics
• Account for class imbalance
• Use domain-specific metrics
• Consider computational cost
7. PIPELINE OPTIMIZATION:
• Tune preprocessing with model
• Use Pipeline for consistency
• Include feature selection
• Optimize end-to-end
8. FINAL MODEL SELECTION:
• Compare multiple algorithms
• Consider ensemble methods
• Evaluate on holdout test set
• Check for data leakage
• Document all steps
"""
print(best_practices)
# Common pitfalls
pitfalls = """
COMMON PITFALLS TO AVOID:
1. ❌ Tuning on test set
✓ Use separate validation set or cross-validation
2. ❌ Not scaling features for distance-based models
✓ Include scaling in pipeline
3. ❌ Using accuracy for imbalanced datasets
✓ Use F1, AUC-ROC, or balanced accuracy
4. ❌ Overfitting hyperparameters
✓ Use nested cross-validation
5. ❌ Ignoring computational cost
✓ Consider time/accuracy trade-off
6. ❌ Not setting random seeds
✓ Ensure reproducibility
7. ❌ Tuning too many parameters at once
✓ Use sequential or hierarchical approach
8. ❌ Not checking learning curves
✓ Diagnose bias/variance issues
"""
print(pitfalls)
# Summary table
summary_data = {
'Method': ['Grid Search', 'Random Search', 'Bayesian Opt', 'Genetic Algo', 'Manual'],
'Exhaustive': ['Yes', 'No', 'No', 'No', 'No'],
'Efficiency': ['Low', 'Medium', 'High', 'Medium', 'Low'],
'Parallelizable': ['Yes', 'Yes', 'Limited', 'Yes', 'No'],
'Good For': ['Small spaces', 'Large spaces', 'Expensive models', 'Complex spaces', 'Understanding'],
'Convergence': ['Guaranteed', 'Probabilistic', 'Probabilistic', 'Probabilistic', 'No guarantee']
}
summary_df = pd.DataFrame(summary_data)
print("\n" + "="*60)
print("HYPERPARAMETER SEARCH METHODS SUMMARY")
print("="*60)
print(summary_df.to_string(index=False))
Practice Exercises
Exercise 1: Custom Cross-Validation
Implement custom cross-validation strategies:
- Create stratified group k-fold for grouped data
- Implement Monte Carlo cross-validation
- Design blocked time series cross-validation
- Create adversarial validation splits
Exercise 2: Advanced Optimization
Implement advanced hyperparameter optimization:
- Build multi-objective optimization (accuracy vs speed)
- Implement successive halving for efficiency
- Create adaptive search strategies
- Design meta-learning for parameter initialization
Exercise 3: AutoML Pipeline
Build an automated machine learning pipeline:
- Automatic feature engineering
- Algorithm selection based on data characteristics
- Automated hyperparameter tuning
- Ensemble generation and selection
Key Takeaways
- 🎯 Proper cross-validation prevents overfitting
- 🔍 Grid search is exhaustive but expensive
- 🎲 Random search often finds good solutions faster
- 📊 Nested CV provides unbiased performance estimates
- 📈 Learning curves diagnose bias/variance issues
- ⚡ Bayesian optimization is efficient for expensive models
- 🔄 Pipeline optimization tunes preprocessing with model
- 📉 Validation curves show parameter sensitivity
- 🎚️ Start coarse, then fine-tune parameters
- ⚠️ Always keep a final test set untouched