Scikit-learn Basics
Your Gateway to Machine Learning! 🤖
Scikit-learn is the Swiss Army knife of machine learning in Python. With its consistent API, comprehensive algorithms, and powerful utilities, it makes machine learning accessible to everyone. From your first classifier to complex pipelines, scikit-learn provides the tools you need to build, evaluate, and deploy ML models with confidence.
The Machine Learning Workflow
graph LR
A[Raw Data] --> B[Data Preprocessing]
B --> C[Feature Engineering]
C --> D[Train/Test Split]
D --> E[Model Selection]
E --> F[Model Training]
F --> G[Model Evaluation]
G --> H{Good Performance?}
H -->|No| I[Hyperparameter Tuning]
I --> F
H -->|Yes| J[Model Deployment]
B --> K[Scaling/Normalization]
B --> L[Encoding]
B --> M[Missing Values]
E --> N[Classification]
E --> O[Regression]
E --> P[Clustering]
G --> Q[Metrics]
G --> R[Cross-validation]
Installation and Setup
# Installation
"""
pip install scikit-learn
pip install numpy pandas matplotlib seaborn
pip install jupyter notebook
"""
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Scikit-learn imports
from sklearn import __version__
print(f"Scikit-learn version: {__version__}")
# Common imports organized by category
# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer
# Model selection
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
# Models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
# Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
# Utilities
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris, load_boston, make_classification, make_regression
# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
Understanding Scikit-learn's API
The Estimator Interface
# Scikit-learn follows a consistent API pattern
# 1. All estimators follow the same interface
class SklearnEstimatorPattern:
"""
Common pattern for all scikit-learn estimators:
- fit(X, y): Learn from data
- predict(X): Make predictions
- score(X, y): Evaluate performance
"""
def fit(self, X, y=None):
"""Learn from training data"""
# Learn patterns from X (and y for supervised learning)
return self
def predict(self, X):
"""Make predictions on new data"""
# Apply learned patterns to make predictions
return predictions
def score(self, X, y):
"""Return the score of predictions"""
# Evaluate model performance
return score
def fit_predict(self, X, y=None):
"""Fit and predict in one step (for clustering)"""
return self.fit(X, y).predict(X)
def fit_transform(self, X, y=None):
"""Fit and transform in one step (for preprocessing)"""
return self.fit(X, y).transform(X)
# 2. Example with a real classifier
from sklearn.tree import DecisionTreeClassifier
# Create model instance
model = DecisionTreeClassifier(random_state=42)
# Load example data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Fit model (training)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Get probability predictions (for classifiers)
y_proba = model.predict_proba(X_test)
# Evaluate model
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
# 3. Attributes learned during fit (end with underscore)
print(f"Classes: {model.classes_}")
print(f"Number of features: {model.n_features_in_}")
print(f"Feature importances: {model.feature_importances_}")
# 4. Get and set parameters
# Get parameters
params = model.get_params()
print(f"Model parameters: {params}")
# Set parameters
model.set_params(max_depth=3, min_samples_split=5)
# 5. Clone estimator
from sklearn.base import clone
model_clone = clone(model) # Creates unfitted copy with same parameters
Data Preprocessing
Scaling and Normalization
# Different scaling techniques
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer
from sklearn.preprocessing import MaxAbsScaler, QuantileTransformer, PowerTransformer
# Create sample data
np.random.seed(42)
data = pd.DataFrame({
'feature1': np.random.normal(100, 50, 1000),
'feature2': np.random.exponential(2, 1000),
'feature3': np.random.uniform(0, 1, 1000),
'outliers': np.concatenate([np.random.normal(0, 1, 990), np.random.normal(10, 1, 10)])
})
# 1. StandardScaler - Standardization (mean=0, std=1)
scaler_standard = StandardScaler()
data_standard = scaler_standard.fit_transform(data)
print("StandardScaler - Good for normally distributed features")
print(f"Mean: {data_standard.mean(axis=0)}")
print(f"Std: {data_standard.std(axis=0)}")
# 2. MinMaxScaler - Normalization to [0, 1]
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)
print("\nMinMaxScaler - Good when you know min/max bounds")
print(f"Min: {data_minmax.min(axis=0)}")
print(f"Max: {data_minmax.max(axis=0)}")
# 3. RobustScaler - Robust to outliers
scaler_robust = RobustScaler()
data_robust = scaler_robust.fit_transform(data)
print("\nRobustScaler - Good when data contains outliers")
# 4. Normalizer - Normalize samples to unit norm
normalizer = Normalizer(norm='l2') # l1, l2, or max
data_normalized = normalizer.fit_transform(data)
print("\nNormalizer - Good for text/sparse data")
# 5. QuantileTransformer - Map to uniform or normal distribution
qt_uniform = QuantileTransformer(output_distribution='uniform')
data_uniform = qt_uniform.fit_transform(data)
qt_normal = QuantileTransformer(output_distribution='normal')
data_normal = qt_normal.fit_transform(data)
# 6. PowerTransformer - Map to Gaussian distribution
pt_yeo = PowerTransformer(method='yeo-johnson') # Works with positive and negative
pt_box = PowerTransformer(method='box-cox') # Only positive values
# Visualize different scaling methods
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()
scalers = [
('Original', data['feature2'].values.reshape(-1, 1)),
('StandardScaler', StandardScaler().fit_transform(data[['feature2']])),
('MinMaxScaler', MinMaxScaler().fit_transform(data[['feature2']])),
('RobustScaler', RobustScaler().fit_transform(data[['feature2']])),
('Normalizer', Normalizer().fit_transform(data[['feature2']])),
('QuantileTransformer (uniform)', QuantileTransformer(output_distribution='uniform').fit_transform(data[['feature2']])),
('QuantileTransformer (normal)', QuantileTransformer(output_distribution='normal').fit_transform(data[['feature2']])),
('PowerTransformer', PowerTransformer().fit_transform(data[['feature2']]))
]
for ax, (name, scaled_data) in zip(axes, scalers):
ax.hist(scaled_data, bins=50, edgecolor='black', alpha=0.7)
ax.set_title(name)
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# When to use which scaler
scaling_guide = """
SCALING GUIDE:
1. StandardScaler:
- When features are normally distributed
- For algorithms that assume zero mean (PCA, linear models with L2 regularization)
2. MinMaxScaler:
- When you know the min/max bounds
- For neural networks (often prefer [0,1] or [-1,1])
- For image data
3. RobustScaler:
- When data contains outliers
- Uses median and IQR instead of mean and std
4. Normalizer:
- For text data (TF-IDF)
- When you care about direction, not magnitude
5. QuantileTransformer:
- When you want uniform or normal distribution
- Reduces impact of outliers
6. PowerTransformer:
- To make data more Gaussian-like
- Stabilize variance and minimize skewness
"""
print(scaling_guide)
Handling Categorical Variables
# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
import pandas as pd
# Create sample data
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'XL', 'M'],
'quality': ['good', 'bad', 'excellent', 'good', 'bad'],
'price': [10, 20, 15, 12, 18]
})
# 1. Label Encoding - For ordinal or target variables
label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])
print("Label Encoding:")
print(df[['color', 'color_encoded']])
print(f"Classes: {label_encoder.classes_}")
# Inverse transform
original = label_encoder.inverse_transform([0, 1, 2])
print(f"Inverse transform: {original}")
# 2. Ordinal Encoding - For ordinal features with order
ordinal_encoder = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])
df['size_encoded'] = ordinal_encoder.fit_transform(df[['size']])
print("\nOrdinal Encoding (with order):")
print(df[['size', 'size_encoded']])
# 3. One-Hot Encoding - For nominal features
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first') # drop='first' to avoid dummy variable trap
color_encoded = onehot_encoder.fit_transform(df[['color']])
color_df = pd.DataFrame(
color_encoded,
columns=onehot_encoder.get_feature_names_out(['color'])
)
print("\nOne-Hot Encoding:")
print(pd.concat([df['color'], color_df], axis=1))
# 4. Using pandas get_dummies (convenient for DataFrames)
df_dummies = pd.get_dummies(df, columns=['color', 'quality'], drop_first=True)
print("\nPandas get_dummies:")
print(df_dummies.head())
# 5. Target Encoding (Mean Encoding) - Advanced technique
class TargetEncoder:
"""Simple target encoder for demonstration"""
def __init__(self, smoothing=1.0):
self.smoothing = smoothing
self.encoding_map = {}
self.global_mean = None
def fit(self, X, y):
self.global_mean = y.mean()
for category in X.unique():
mask = X == category
n = mask.sum()
mean = y[mask].mean()
# Smoothing to prevent overfitting
smoothed_mean = (mean * n + self.global_mean * self.smoothing) / (n + self.smoothing)
self.encoding_map[category] = smoothed_mean
return self
def transform(self, X):
return X.map(self.encoding_map).fillna(self.global_mean)
# Example of target encoding
target_encoder = TargetEncoder(smoothing=2.0)
target_encoder.fit(df['color'], df['price'])
df['color_target_encoded'] = target_encoder.transform(df['color'])
print("\nTarget Encoding:")
print(df[['color', 'price', 'color_target_encoded']])
# 6. Handling unknown categories
# Create train and test with different categories
train_df = df.iloc[:3]
test_df = pd.DataFrame({'color': ['red', 'yellow']}) # 'yellow' is unknown
# OneHotEncoder with handle_unknown
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
ohe.fit(train_df[['color']])
test_encoded = ohe.transform(test_df[['color']])
print("\nHandling unknown categories:")
print(f"Test data: {test_df['color'].values}")
print(f"Encoded (unknown ignored): {test_encoded}")
Handling Missing Values
# Imputation strategies for missing data
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.experimental import enable_iterative_imputer # Required for IterativeImputer
import numpy as np
import pandas as pd
# Create data with missing values
np.random.seed(42)
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5, np.nan, 7],
'B': [np.nan, 2, 3, 4, np.nan, 6, 7],
'C': [1, 2, 3, np.nan, 5, 6, np.nan],
'D': ['cat', 'dog', np.nan, 'cat', 'dog', 'cat', np.nan]
})
print("Original data with missing values:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")
# 1. Simple Imputer - Basic strategies
# For numerical columns
num_cols = ['A', 'B', 'C']
# Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
df_mean = df.copy()
df_mean[num_cols] = imputer_mean.fit_transform(df[num_cols])
print("\nMean Imputation:")
print(df_mean)
# Median imputation
imputer_median = SimpleImputer(strategy='median')
df_median = df.copy()
df_median[num_cols] = imputer_median.fit_transform(df[num_cols])
# Most frequent (mode) imputation
imputer_mode = SimpleImputer(strategy='most_frequent')
df_mode = df.copy()
df_mode[num_cols] = imputer_mode.fit_transform(df[num_cols])
# Constant imputation
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
df_constant = df.copy()
df_constant[num_cols] = imputer_constant.fit_transform(df[num_cols])
# For categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
df['D_imputed'] = cat_imputer.fit_transform(df[['D']])
print("\nCategorical imputation (most frequent):")
print(df[['D', 'D_imputed']])
# 2. KNN Imputer - Uses K-Nearest Neighbors
knn_imputer = KNNImputer(n_neighbors=2)
df_knn = df.copy()
df_knn[num_cols] = knn_imputer.fit_transform(df[num_cols])
print("\nKNN Imputation:")
print(df_knn[num_cols])
# 3. Iterative Imputer (MICE) - Multiple Imputation
iterative_imputer = IterativeImputer(random_state=42)
df_iterative = df.copy()
df_iterative[num_cols] = iterative_imputer.fit_transform(df[num_cols])
print("\nIterative Imputation (MICE):")
print(df_iterative[num_cols])
# 4. Advanced: Custom imputation based on other features
class GroupImputer:
"""Impute based on group statistics"""
def __init__(self, group_col, target_col, strategy='mean'):
self.group_col = group_col
self.target_col = target_col
self.strategy = strategy
self.group_values = {}
def fit(self, df):
if self.strategy == 'mean':
self.group_values = df.groupby(self.group_col)[self.target_col].mean().to_dict()
elif self.strategy == 'median':
self.group_values = df.groupby(self.group_col)[self.target_col].median().to_dict()
return self
def transform(self, df):
df_copy = df.copy()
for group, value in self.group_values.items():
mask = (df_copy[self.group_col] == group) & df_copy[self.target_col].isna()
df_copy.loc[mask, self.target_col] = value
return df_copy
# 5. Missing indicator - Add binary indicators for missing values
from sklearn.impute import MissingIndicator
indicator = MissingIndicator()
mask = indicator.fit_transform(df[num_cols])
df_with_indicator = pd.concat([
df,
pd.DataFrame(mask, columns=[f'{col}_was_missing' for col in num_cols])
], axis=1)
print("\nWith missing indicators:")
print(df_with_indicator.head())
# Visualization of imputation effects
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Compare different imputation methods
methods = {
'Original': df[num_cols],
'Mean': df_mean[num_cols],
'KNN': df_knn[num_cols],
'Iterative': df_iterative[num_cols]
}
for ax, (method, data) in zip(axes.flatten(), methods.items()):
data.plot(kind='box', ax=ax)
ax.set_title(f'{method} Imputation')
ax.set_ylabel('Value')
plt.tight_layout()
plt.show()
Classification Models
Building Your First Classifier
# Complete classification workflow
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd
import numpy as np
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
print("Dataset Info:")
print(f"Shape: {X.shape}")
print(f"Features: {X.columns.tolist()[:5]}...")
print(f"Target distribution:\n{y.value_counts()}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize multiple classifiers
classifiers = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(kernel='rbf', random_state=42)
}
# Train and evaluate each classifier
results = []
for name, clf in classifiers.items():
# Use scaled data for SVM and Logistic Regression
if name in ['Logistic Regression', 'SVM']:
clf.fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
else:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
results.append({
'Model': name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1-Score': f1
})
print(f"\n{name} Results:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Classification Report:")
print(classification_report(y_test, y_pred, target_names=['malignant', 'benign']))
# Compare models
results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.to_string(index=False))
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar plot of metrics
results_df.set_index('Model')[['Accuracy', 'Precision', 'Recall', 'F1-Score']].plot(
kind='bar', ax=axes[0], rot=45
)
axes[0].set_title('Model Performance Comparison')
axes[0].set_ylabel('Score')
axes[0].legend(loc='lower right')
# Confusion matrix for best model
best_model = classifiers['Random Forest']
best_model.fit(X_train, y_train)
y_pred_best = best_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1])
axes[1].set_title('Confusion Matrix - Random Forest')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
# Feature importance (for tree-based models)
if hasattr(best_model, 'feature_importances_'):
importances = pd.DataFrame({
'feature': X.columns,
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Important Features:")
print(importances.head(10).to_string(index=False))
Regression Models
Building Regression Models
# Complete regression workflow
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import pandas as pd
# Load dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='price')
print("California Housing Dataset:")
print(f"Shape: {X.shape}")
print(f"Features: {X.columns.tolist()}")
print(f"Target statistics:")
print(y.describe())
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize regression models
regressors = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0, random_state=42),
'Lasso Regression': Lasso(alpha=0.1, random_state=42),
'ElasticNet': ElasticNet(alpha=0.1, random_state=42),
'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42),
'SVR': SVR(kernel='rbf', C=1.0)
}
# Train and evaluate models
results = []
for name, model in regressors.items():
# Use scaled data for linear models and SVR
if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'ElasticNet', 'SVR']:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
results.append({
'Model': name,
'RMSE': rmse,
'MAE': mae,
'R²': r2
})
print(f"\n{name}:")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R² Score: {r2:.3f}")
# Compare models
results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.sort_values('R²', ascending=False).to_string(index=False))
# Visualize predictions vs actual
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
models_to_plot = ['Linear Regression', 'Ridge Regression', 'Random Forest',
'Gradient Boosting', 'SVR', 'Decision Tree']
for ax, model_name in zip(axes, models_to_plot):
model = regressors[model_name]
if model_name in ['Linear Regression', 'Ridge Regression', 'SVR']:
y_pred = model.predict(X_test_scaled)
else:
y_pred = model.predict(X_test)
ax.scatter(y_test, y_pred, alpha=0.5, s=10)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax.set_xlabel('Actual Price')
ax.set_ylabel('Predicted Price')
ax.set_title(f'{model_name}\nR²: {r2_score(y_test, y_pred):.3f}')
plt.tight_layout()
plt.show()
# Residual analysis
best_model = regressors['Gradient Boosting']
best_model.fit(X_train, y_train)
y_pred_best = best_model.predict(X_test)
residuals = y_test - y_pred_best
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Residual plot
axes[0].scatter(y_pred_best, residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot')
# Histogram of residuals
axes[1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')
plt.tight_layout()
plt.show()
Clustering Models
Unsupervised Learning with Clustering
# Clustering algorithms
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.datasets import make_blobs, make_moons
import numpy as np
import pandas as pd
# Generate sample data
X_blobs, y_true = make_blobs(n_samples=300, centers=4, n_features=2,
cluster_std=0.5, random_state=42)
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
# Scale the data
scaler = StandardScaler()
X_blobs_scaled = scaler.fit_transform(X_blobs)
X_moons_scaled = scaler.fit_transform(X_moons)
# 1. K-Means Clustering
def perform_kmeans(X, n_clusters_range):
"""Find optimal number of clusters using elbow method"""
inertias = []
silhouette_scores = []
for k in n_clusters_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
if k > 1: # Silhouette score requires at least 2 clusters
score = silhouette_score(X, kmeans.labels_)
silhouette_scores.append(score)
else:
silhouette_scores.append(0)
return inertias, silhouette_scores
# Find optimal k
k_range = range(2, 10)
inertias, sil_scores = perform_kmeans(X_blobs_scaled, k_range)
# Plot elbow curve
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(k_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[1].plot(k_range, sil_scores, 'ro-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis')
plt.tight_layout()
plt.show()
# 2. Apply different clustering algorithms
clustering_algorithms = {
'K-Means': KMeans(n_clusters=4, random_state=42),
'DBSCAN': DBSCAN(eps=0.3, min_samples=5),
'Agglomerative': AgglomerativeClustering(n_clusters=4),
'Gaussian Mixture': GaussianMixture(n_components=4, random_state=42)
}
# Apply to blob data
results = []
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for ax, (name, algorithm) in zip(axes, clustering_algorithms.items()):
# Fit and predict
if name == 'Gaussian Mixture':
labels = algorithm.fit_predict(X_blobs_scaled)
else:
labels = algorithm.fit_predict(X_blobs_scaled)
# Plot
scatter = ax.scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels, cmap='viridis', s=50, alpha=0.7)
ax.set_title(name)
# Calculate metrics (if valid clustering)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
if n_clusters > 1:
sil_score = silhouette_score(X_blobs_scaled, labels)
ch_score = calinski_harabasz_score(X_blobs_scaled, labels)
db_score = davies_bouldin_score(X_blobs_scaled, labels)
results.append({
'Algorithm': name,
'N_Clusters': n_clusters,
'Silhouette': sil_score,
'Calinski-Harabasz': ch_score,
'Davies-Bouldin': db_score
})
ax.text(0.02, 0.98, f'Silhouette: {sil_score:.3f}',
transform=ax.transAxes, va='top')
# True labels for comparison
axes[4].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.7)
axes[4].set_title('True Labels')
# Remove empty subplot
fig.delaxes(axes[5])
plt.tight_layout()
plt.show()
# Display metrics comparison
results_df = pd.DataFrame(results)
print("\nClustering Metrics Comparison:")
print(results_df.to_string(index=False))
print("\nNote: Higher Silhouette and Calinski-Harabasz are better")
print(" Lower Davies-Bouldin is better")
# 3. DBSCAN for non-spherical clusters (moon dataset)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# K-Means on moon data (poor performance)
kmeans_moon = KMeans(n_clusters=2, random_state=42)
labels_kmeans = kmeans_moon.fit_predict(X_moons_scaled)
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_kmeans, cmap='viridis', s=50, alpha=0.7)
axes[0].set_title('K-Means on Moon Dataset')
# DBSCAN on moon data (better performance)
dbscan_moon = DBSCAN(eps=0.3, min_samples=5)
labels_dbscan = dbscan_moon.fit_predict(X_moons_scaled)
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_dbscan, cmap='viridis', s=50, alpha=0.7)
axes[1].set_title('DBSCAN on Moon Dataset')
plt.tight_layout()
plt.show()
Model Persistence
# Saving and loading models
import joblib
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import os
# Create a simple pipeline
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42))
])
# Train pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Original model score: {score:.3f}")
# Method 1: Using joblib (recommended for scikit-learn)
# Save
joblib.dump(pipeline, 'model_pipeline.joblib')
print(f"Model saved to 'model_pipeline.joblib'")
# Load
loaded_pipeline = joblib.load('model_pipeline.joblib')
loaded_score = loaded_pipeline.score(X_test, y_test)
print(f"Loaded model score: {loaded_score:.3f}")
# Method 2: Using pickle
# Save
with open('model_pipeline.pkl', 'wb') as f:
pickle.dump(pipeline, f)
# Load
with open('model_pipeline.pkl', 'rb') as f:
loaded_pipeline_pkl = pickle.load(f)
# Method 3: Save individual components
# Useful when you need to inspect or modify components
model_components = {
'scaler': pipeline.named_steps['scaler'],
'classifier': pipeline.named_steps['classifier'],
'feature_names': load_iris().feature_names,
'target_names': load_iris().target_names.tolist(),
'model_params': pipeline.get_params()
}
joblib.dump(model_components, 'model_components.joblib')
# Load components and rebuild
components = joblib.load('model_components.joblib')
rebuilt_pipeline = Pipeline([
('scaler', components['scaler']),
('classifier', components['classifier'])
])
# Version control for models
class ModelVersionControl:
"""Simple model versioning system"""
def __init__(self, base_path='models'):
self.base_path = base_path
os.makedirs(base_path, exist_ok=True)
def save_model(self, model, version, metadata=None):
"""Save model with version"""
import datetime
model_data = {
'model': model,
'version': version,
'timestamp': datetime.datetime.now().isoformat(),
'metadata': metadata or {}
}
filepath = os.path.join(self.base_path, f'model_v{version}.joblib')
joblib.dump(model_data, filepath)
print(f"Model saved: {filepath}")
return filepath
def load_model(self, version):
"""Load specific version"""
filepath = os.path.join(self.base_path, f'model_v{version}.joblib')
if os.path.exists(filepath):
model_data = joblib.load(filepath)
print(f"Loaded model version {version} from {model_data['timestamp']}")
return model_data['model']
else:
raise FileNotFoundError(f"Model version {version} not found")
def list_versions(self):
"""List all available versions"""
versions = []
for filename in os.listdir(self.base_path):
if filename.startswith('model_v') and filename.endswith('.joblib'):
version = filename.replace('model_v', '').replace('.joblib', '')
versions.append(version)
return sorted(versions)
# Usage
mvc = ModelVersionControl()
mvc.save_model(pipeline, version='1.0', metadata={'accuracy': score})
mvc.save_model(pipeline, version='1.1', metadata={'accuracy': score, 'improved': True})
print(f"Available versions: {mvc.list_versions()}")
loaded_model = mvc.load_model('1.0')
# Clean up files
for file in ['model_pipeline.joblib', 'model_pipeline.pkl', 'model_components.joblib']:
if os.path.exists(file):
os.remove(file)
Practice Exercises
Exercise 1: Complete ML Pipeline
Build a complete machine learning pipeline that:
- Loads and explores a dataset
- Handles missing values and categorical variables
- Performs feature scaling
- Compares multiple models
- Selects the best model using cross-validation
- Saves the final model for deployment
Exercise 2: Custom Transformer
Create a custom scikit-learn transformer that:
- Implements the transformer interface (fit, transform)
- Performs custom feature engineering
- Can be used in a Pipeline
- Handles both training and test data correctly
Exercise 3: Model Comparison Framework
Develop a framework that:
- Takes a dataset and list of models
- Automatically handles preprocessing based on data types
- Performs cross-validation for each model
- Generates comparison visualizations
- Recommends the best model with explanations
Key Takeaways
- 📚 Scikit-learn provides a consistent API across all algorithms
- 🔧 Preprocessing is crucial: scaling, encoding, imputation
- 🎯 fit() trains the model, predict() makes predictions
- 📊 Always split data into train/test sets
- ⚖️ Different algorithms suit different problems
- 💾 Models can be saved and loaded for deployment
- 🔄 Pipelines chain preprocessing and modeling steps
- 📈 Evaluate models with appropriate metrics