Scikit-learn Basics - Python Data Science Path

Your Gateway to Machine Learning! 🤖

Scikit-learn is the Swiss Army knife of machine learning in Python. With its consistent API, comprehensive algorithms, and powerful utilities, it makes machine learning accessible to everyone. From your first classifier to complex pipelines, scikit-learn provides the tools you need to build, evaluate, and deploy ML models with confidence.

The Machine Learning Workflow

graph LR A[Raw Data] --> B[Data Preprocessing] B --> C[Feature Engineering] C --> D[Train/Test Split] D --> E[Model Selection] E --> F[Model Training] F --> G[Model Evaluation] G --> H{Good Performance?} H -->|No| I[Hyperparameter Tuning] I --> F H -->|Yes| J[Model Deployment] B --> K[Scaling/Normalization] B --> L[Encoding] B --> M[Missing Values] E --> N[Classification] E --> O[Regression] E --> P[Clustering] G --> Q[Metrics] G --> R[Cross-validation]

Installation and Setup

# Installation
"""
pip install scikit-learn
pip install numpy pandas matplotlib seaborn
pip install jupyter notebook
"""

# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn import __version__
print(f"Scikit-learn version: {__version__}")

# Common imports organized by category
# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer

# Model selection
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold

# Models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB

# Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, roc_curve

# Utilities
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris, load_boston, make_classification, make_regression

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

Understanding Scikit-learn's API

The Estimator Interface

# Scikit-learn follows a consistent API pattern

# 1. All estimators follow the same interface
class SklearnEstimatorPattern:
    """
    Common pattern for all scikit-learn estimators:
    - fit(X, y): Learn from data
    - predict(X): Make predictions
    - score(X, y): Evaluate performance
    """
    
    def fit(self, X, y=None):
        """Learn from training data"""
        # Learn patterns from X (and y for supervised learning)
        return self
    
    def predict(self, X):
        """Make predictions on new data"""
        # Apply learned patterns to make predictions
        return predictions
    
    def score(self, X, y):
        """Return the score of predictions"""
        # Evaluate model performance
        return score
    
    def fit_predict(self, X, y=None):
        """Fit and predict in one step (for clustering)"""
        return self.fit(X, y).predict(X)
    
    def fit_transform(self, X, y=None):
        """Fit and transform in one step (for preprocessing)"""
        return self.fit(X, y).transform(X)

# 2. Example with a real classifier
from sklearn.tree import DecisionTreeClassifier

# Create model instance
model = DecisionTreeClassifier(random_state=42)

# Load example data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit model (training)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Get probability predictions (for classifiers)
y_proba = model.predict_proba(X_test)

# Evaluate model
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

# 3. Attributes learned during fit (end with underscore)
print(f"Classes: {model.classes_}")
print(f"Number of features: {model.n_features_in_}")
print(f"Feature importances: {model.feature_importances_}")

# 4. Get and set parameters
# Get parameters
params = model.get_params()
print(f"Model parameters: {params}")

# Set parameters
model.set_params(max_depth=3, min_samples_split=5)

# 5. Clone estimator
from sklearn.base import clone
model_clone = clone(model)  # Creates unfitted copy with same parameters

Data Preprocessing

Scaling and Normalization

# Different scaling techniques

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer
from sklearn.preprocessing import MaxAbsScaler, QuantileTransformer, PowerTransformer

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'feature1': np.random.normal(100, 50, 1000),
    'feature2': np.random.exponential(2, 1000),
    'feature3': np.random.uniform(0, 1, 1000),
    'outliers': np.concatenate([np.random.normal(0, 1, 990), np.random.normal(10, 1, 10)])
})

# 1. StandardScaler - Standardization (mean=0, std=1)
scaler_standard = StandardScaler()
data_standard = scaler_standard.fit_transform(data)
print("StandardScaler - Good for normally distributed features")
print(f"Mean: {data_standard.mean(axis=0)}")
print(f"Std: {data_standard.std(axis=0)}")

# 2. MinMaxScaler - Normalization to [0, 1]
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)
print("\nMinMaxScaler - Good when you know min/max bounds")
print(f"Min: {data_minmax.min(axis=0)}")
print(f"Max: {data_minmax.max(axis=0)}")

# 3. RobustScaler - Robust to outliers
scaler_robust = RobustScaler()
data_robust = scaler_robust.fit_transform(data)
print("\nRobustScaler - Good when data contains outliers")

# 4. Normalizer - Normalize samples to unit norm
normalizer = Normalizer(norm='l2')  # l1, l2, or max
data_normalized = normalizer.fit_transform(data)
print("\nNormalizer - Good for text/sparse data")

# 5. QuantileTransformer - Map to uniform or normal distribution
qt_uniform = QuantileTransformer(output_distribution='uniform')
data_uniform = qt_uniform.fit_transform(data)

qt_normal = QuantileTransformer(output_distribution='normal')
data_normal = qt_normal.fit_transform(data)

# 6. PowerTransformer - Map to Gaussian distribution
pt_yeo = PowerTransformer(method='yeo-johnson')  # Works with positive and negative
pt_box = PowerTransformer(method='box-cox')  # Only positive values

# Visualize different scaling methods
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

scalers = [
    ('Original', data['feature2'].values.reshape(-1, 1)),
    ('StandardScaler', StandardScaler().fit_transform(data[['feature2']])),
    ('MinMaxScaler', MinMaxScaler().fit_transform(data[['feature2']])),
    ('RobustScaler', RobustScaler().fit_transform(data[['feature2']])),
    ('Normalizer', Normalizer().fit_transform(data[['feature2']])),
    ('QuantileTransformer (uniform)', QuantileTransformer(output_distribution='uniform').fit_transform(data[['feature2']])),
    ('QuantileTransformer (normal)', QuantileTransformer(output_distribution='normal').fit_transform(data[['feature2']])),
    ('PowerTransformer', PowerTransformer().fit_transform(data[['feature2']]))
]

for ax, (name, scaled_data) in zip(axes, scalers):
    ax.hist(scaled_data, bins=50, edgecolor='black', alpha=0.7)
    ax.set_title(name)
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# When to use which scaler
scaling_guide = """
SCALING GUIDE:
1. StandardScaler: 
   - When features are normally distributed
   - For algorithms that assume zero mean (PCA, linear models with L2 regularization)

2. MinMaxScaler:
   - When you know the min/max bounds
   - For neural networks (often prefer [0,1] or [-1,1])
   - For image data

3. RobustScaler:
   - When data contains outliers
   - Uses median and IQR instead of mean and std

4. Normalizer:
   - For text data (TF-IDF)
   - When you care about direction, not magnitude
   
5. QuantileTransformer:
   - When you want uniform or normal distribution
   - Reduces impact of outliers
   
6. PowerTransformer:
   - To make data more Gaussian-like
   - Stabilize variance and minimize skewness
"""
print(scaling_guide)

Handling Categorical Variables

# Encoding categorical variables

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
import pandas as pd

# Create sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'XL', 'M'],
    'quality': ['good', 'bad', 'excellent', 'good', 'bad'],
    'price': [10, 20, 15, 12, 18]
})

# 1. Label Encoding - For ordinal or target variables
label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])
print("Label Encoding:")
print(df[['color', 'color_encoded']])
print(f"Classes: {label_encoder.classes_}")

# Inverse transform
original = label_encoder.inverse_transform([0, 1, 2])
print(f"Inverse transform: {original}")

# 2. Ordinal Encoding - For ordinal features with order
ordinal_encoder = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])
df['size_encoded'] = ordinal_encoder.fit_transform(df[['size']])
print("\nOrdinal Encoding (with order):")
print(df[['size', 'size_encoded']])

# 3. One-Hot Encoding - For nominal features
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid dummy variable trap
color_encoded = onehot_encoder.fit_transform(df[['color']])
color_df = pd.DataFrame(
    color_encoded, 
    columns=onehot_encoder.get_feature_names_out(['color'])
)
print("\nOne-Hot Encoding:")
print(pd.concat([df['color'], color_df], axis=1))

# 4. Using pandas get_dummies (convenient for DataFrames)
df_dummies = pd.get_dummies(df, columns=['color', 'quality'], drop_first=True)
print("\nPandas get_dummies:")
print(df_dummies.head())

# 5. Target Encoding (Mean Encoding) - Advanced technique
class TargetEncoder:
    """Simple target encoder for demonstration"""
    
    def __init__(self, smoothing=1.0):
        self.smoothing = smoothing
        self.encoding_map = {}
        self.global_mean = None
    
    def fit(self, X, y):
        self.global_mean = y.mean()
        
        for category in X.unique():
            mask = X == category
            n = mask.sum()
            mean = y[mask].mean()
            
            # Smoothing to prevent overfitting
            smoothed_mean = (mean * n + self.global_mean * self.smoothing) / (n + self.smoothing)
            self.encoding_map[category] = smoothed_mean
        
        return self
    
    def transform(self, X):
        return X.map(self.encoding_map).fillna(self.global_mean)

# Example of target encoding
target_encoder = TargetEncoder(smoothing=2.0)
target_encoder.fit(df['color'], df['price'])
df['color_target_encoded'] = target_encoder.transform(df['color'])
print("\nTarget Encoding:")
print(df[['color', 'price', 'color_target_encoded']])

# 6. Handling unknown categories
# Create train and test with different categories
train_df = df.iloc[:3]
test_df = pd.DataFrame({'color': ['red', 'yellow']})  # 'yellow' is unknown

# OneHotEncoder with handle_unknown
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
ohe.fit(train_df[['color']])
test_encoded = ohe.transform(test_df[['color']])
print("\nHandling unknown categories:")
print(f"Test data: {test_df['color'].values}")
print(f"Encoded (unknown ignored): {test_encoded}")

Handling Missing Values

# Imputation strategies for missing data

from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.experimental import enable_iterative_imputer  # Required for IterativeImputer
import numpy as np
import pandas as pd

# Create data with missing values
np.random.seed(42)
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5, np.nan, 7],
    'B': [np.nan, 2, 3, 4, np.nan, 6, 7],
    'C': [1, 2, 3, np.nan, 5, 6, np.nan],
    'D': ['cat', 'dog', np.nan, 'cat', 'dog', 'cat', np.nan]
})

print("Original data with missing values:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")

# 1. Simple Imputer - Basic strategies
# For numerical columns
num_cols = ['A', 'B', 'C']

# Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
df_mean = df.copy()
df_mean[num_cols] = imputer_mean.fit_transform(df[num_cols])
print("\nMean Imputation:")
print(df_mean)

# Median imputation
imputer_median = SimpleImputer(strategy='median')
df_median = df.copy()
df_median[num_cols] = imputer_median.fit_transform(df[num_cols])

# Most frequent (mode) imputation
imputer_mode = SimpleImputer(strategy='most_frequent')
df_mode = df.copy()
df_mode[num_cols] = imputer_mode.fit_transform(df[num_cols])

# Constant imputation
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
df_constant = df.copy()
df_constant[num_cols] = imputer_constant.fit_transform(df[num_cols])

# For categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
df['D_imputed'] = cat_imputer.fit_transform(df[['D']])
print("\nCategorical imputation (most frequent):")
print(df[['D', 'D_imputed']])

# 2. KNN Imputer - Uses K-Nearest Neighbors
knn_imputer = KNNImputer(n_neighbors=2)
df_knn = df.copy()
df_knn[num_cols] = knn_imputer.fit_transform(df[num_cols])
print("\nKNN Imputation:")
print(df_knn[num_cols])

# 3. Iterative Imputer (MICE) - Multiple Imputation
iterative_imputer = IterativeImputer(random_state=42)
df_iterative = df.copy()
df_iterative[num_cols] = iterative_imputer.fit_transform(df[num_cols])
print("\nIterative Imputation (MICE):")
print(df_iterative[num_cols])

# 4. Advanced: Custom imputation based on other features
class GroupImputer:
    """Impute based on group statistics"""
    
    def __init__(self, group_col, target_col, strategy='mean'):
        self.group_col = group_col
        self.target_col = target_col
        self.strategy = strategy
        self.group_values = {}
    
    def fit(self, df):
        if self.strategy == 'mean':
            self.group_values = df.groupby(self.group_col)[self.target_col].mean().to_dict()
        elif self.strategy == 'median':
            self.group_values = df.groupby(self.group_col)[self.target_col].median().to_dict()
        return self
    
    def transform(self, df):
        df_copy = df.copy()
        
        for group, value in self.group_values.items():
            mask = (df_copy[self.group_col] == group) & df_copy[self.target_col].isna()
            df_copy.loc[mask, self.target_col] = value
        
        return df_copy

# 5. Missing indicator - Add binary indicators for missing values
from sklearn.impute import MissingIndicator

indicator = MissingIndicator()
mask = indicator.fit_transform(df[num_cols])
df_with_indicator = pd.concat([
    df,
    pd.DataFrame(mask, columns=[f'{col}_was_missing' for col in num_cols])
], axis=1)
print("\nWith missing indicators:")
print(df_with_indicator.head())

# Visualization of imputation effects
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Compare different imputation methods
methods = {
    'Original': df[num_cols],
    'Mean': df_mean[num_cols],
    'KNN': df_knn[num_cols],
    'Iterative': df_iterative[num_cols]
}

for ax, (method, data) in zip(axes.flatten(), methods.items()):
    data.plot(kind='box', ax=ax)
    ax.set_title(f'{method} Imputation')
    ax.set_ylabel('Value')

plt.tight_layout()
plt.show()

Classification Models

Building Your First Classifier

# Complete classification workflow

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd
import numpy as np

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

print("Dataset Info:")
print(f"Shape: {X.shape}")
print(f"Features: {X.columns.tolist()[:5]}...")
print(f"Target distribution:\n{y.value_counts()}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize multiple classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}

# Train and evaluate each classifier
results = []

for name, clf in classifiers.items():
    # Use scaled data for SVM and Logistic Regression
    if name in ['Logistic Regression', 'SVM']:
        clf.fit(X_train_scaled, y_train)
        y_pred = clf.predict(X_test_scaled)
    else:
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })
    
    print(f"\n{name} Results:")
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Classification Report:")
    print(classification_report(y_test, y_pred, target_names=['malignant', 'benign']))

# Compare models
results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.to_string(index=False))

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of metrics
results_df.set_index('Model')[['Accuracy', 'Precision', 'Recall', 'F1-Score']].plot(
    kind='bar', ax=axes[0], rot=45
)
axes[0].set_title('Model Performance Comparison')
axes[0].set_ylabel('Score')
axes[0].legend(loc='lower right')

# Confusion matrix for best model
best_model = classifiers['Random Forest']
best_model.fit(X_train, y_train)
y_pred_best = best_model.predict(X_test)

cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1])
axes[1].set_title('Confusion Matrix - Random Forest')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

# Feature importance (for tree-based models)
if hasattr(best_model, 'feature_importances_'):
    importances = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 Important Features:")
    print(importances.head(10).to_string(index=False))

Regression Models

Building Regression Models

# Complete regression workflow

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import pandas as pd

# Load dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='price')

print("California Housing Dataset:")
print(f"Shape: {X.shape}")
print(f"Features: {X.columns.tolist()}")
print(f"Target statistics:")
print(y.describe())

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize regression models
regressors = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=42),
    'Lasso Regression': Lasso(alpha=0.1, random_state=42),
    'ElasticNet': ElasticNet(alpha=0.1, random_state=42),
    'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42),
    'SVR': SVR(kernel='rbf', C=1.0)
}

# Train and evaluate models
results = []

for name, model in regressors.items():
    # Use scaled data for linear models and SVR
    if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'ElasticNet', 'SVR']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2
    })
    
    print(f"\n{name}:")
    print(f"RMSE: {rmse:.3f}")
    print(f"MAE: {mae:.3f}")
    print(f"R² Score: {r2:.3f}")

# Compare models
results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.sort_values('R²', ascending=False).to_string(index=False))

# Visualize predictions vs actual
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

models_to_plot = ['Linear Regression', 'Ridge Regression', 'Random Forest', 
                  'Gradient Boosting', 'SVR', 'Decision Tree']

for ax, model_name in zip(axes, models_to_plot):
    model = regressors[model_name]
    
    if model_name in ['Linear Regression', 'Ridge Regression', 'SVR']:
        y_pred = model.predict(X_test_scaled)
    else:
        y_pred = model.predict(X_test)
    
    ax.scatter(y_test, y_pred, alpha=0.5, s=10)
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    ax.set_xlabel('Actual Price')
    ax.set_ylabel('Predicted Price')
    ax.set_title(f'{model_name}\nR²: {r2_score(y_test, y_pred):.3f}')

plt.tight_layout()
plt.show()

# Residual analysis
best_model = regressors['Gradient Boosting']
best_model.fit(X_train, y_train)
y_pred_best = best_model.predict(X_test)
residuals = y_test - y_pred_best

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residual plot
axes[0].scatter(y_pred_best, residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot')

# Histogram of residuals
axes[1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

Clustering Models

Unsupervised Learning with Clustering

# Clustering algorithms

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.datasets import make_blobs, make_moons
import numpy as np
import pandas as pd

# Generate sample data
X_blobs, y_true = make_blobs(n_samples=300, centers=4, n_features=2, 
                              cluster_std=0.5, random_state=42)
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# Scale the data
scaler = StandardScaler()
X_blobs_scaled = scaler.fit_transform(X_blobs)
X_moons_scaled = scaler.fit_transform(X_moons)

# 1. K-Means Clustering
def perform_kmeans(X, n_clusters_range):
    """Find optimal number of clusters using elbow method"""
    inertias = []
    silhouette_scores = []
    
    for k in n_clusters_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        
        inertias.append(kmeans.inertia_)
        if k > 1:  # Silhouette score requires at least 2 clusters
            score = silhouette_score(X, kmeans.labels_)
            silhouette_scores.append(score)
        else:
            silhouette_scores.append(0)
    
    return inertias, silhouette_scores

# Find optimal k
k_range = range(2, 10)
inertias, sil_scores = perform_kmeans(X_blobs_scaled, k_range)

# Plot elbow curve
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(k_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')

axes[1].plot(k_range, sil_scores, 'ro-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis')

plt.tight_layout()
plt.show()

# 2. Apply different clustering algorithms
clustering_algorithms = {
    'K-Means': KMeans(n_clusters=4, random_state=42),
    'DBSCAN': DBSCAN(eps=0.3, min_samples=5),
    'Agglomerative': AgglomerativeClustering(n_clusters=4),
    'Gaussian Mixture': GaussianMixture(n_components=4, random_state=42)
}

# Apply to blob data
results = []
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for ax, (name, algorithm) in zip(axes, clustering_algorithms.items()):
    # Fit and predict
    if name == 'Gaussian Mixture':
        labels = algorithm.fit_predict(X_blobs_scaled)
    else:
        labels = algorithm.fit_predict(X_blobs_scaled)
    
    # Plot
    scatter = ax.scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels, cmap='viridis', s=50, alpha=0.7)
    ax.set_title(name)
    
    # Calculate metrics (if valid clustering)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    
    if n_clusters > 1:
        sil_score = silhouette_score(X_blobs_scaled, labels)
        ch_score = calinski_harabasz_score(X_blobs_scaled, labels)
        db_score = davies_bouldin_score(X_blobs_scaled, labels)
        
        results.append({
            'Algorithm': name,
            'N_Clusters': n_clusters,
            'Silhouette': sil_score,
            'Calinski-Harabasz': ch_score,
            'Davies-Bouldin': db_score
        })
        
        ax.text(0.02, 0.98, f'Silhouette: {sil_score:.3f}',
                transform=ax.transAxes, va='top')

# True labels for comparison
axes[4].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.7)
axes[4].set_title('True Labels')

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

# Display metrics comparison
results_df = pd.DataFrame(results)
print("\nClustering Metrics Comparison:")
print(results_df.to_string(index=False))
print("\nNote: Higher Silhouette and Calinski-Harabasz are better")
print("      Lower Davies-Bouldin is better")

# 3. DBSCAN for non-spherical clusters (moon dataset)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# K-Means on moon data (poor performance)
kmeans_moon = KMeans(n_clusters=2, random_state=42)
labels_kmeans = kmeans_moon.fit_predict(X_moons_scaled)
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_kmeans, cmap='viridis', s=50, alpha=0.7)
axes[0].set_title('K-Means on Moon Dataset')

# DBSCAN on moon data (better performance)
dbscan_moon = DBSCAN(eps=0.3, min_samples=5)
labels_dbscan = dbscan_moon.fit_predict(X_moons_scaled)
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_dbscan, cmap='viridis', s=50, alpha=0.7)
axes[1].set_title('DBSCAN on Moon Dataset')

plt.tight_layout()
plt.show()

Model Persistence

# Saving and loading models

import joblib
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import os

# Create a simple pipeline
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Original model score: {score:.3f}")

# Method 1: Using joblib (recommended for scikit-learn)
# Save
joblib.dump(pipeline, 'model_pipeline.joblib')
print(f"Model saved to 'model_pipeline.joblib'")

# Load
loaded_pipeline = joblib.load('model_pipeline.joblib')
loaded_score = loaded_pipeline.score(X_test, y_test)
print(f"Loaded model score: {loaded_score:.3f}")

# Method 2: Using pickle
# Save
with open('model_pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

# Load
with open('model_pipeline.pkl', 'rb') as f:
    loaded_pipeline_pkl = pickle.load(f)

# Method 3: Save individual components
# Useful when you need to inspect or modify components
model_components = {
    'scaler': pipeline.named_steps['scaler'],
    'classifier': pipeline.named_steps['classifier'],
    'feature_names': load_iris().feature_names,
    'target_names': load_iris().target_names.tolist(),
    'model_params': pipeline.get_params()
}

joblib.dump(model_components, 'model_components.joblib')

# Load components and rebuild
components = joblib.load('model_components.joblib')
rebuilt_pipeline = Pipeline([
    ('scaler', components['scaler']),
    ('classifier', components['classifier'])
])

# Version control for models
class ModelVersionControl:
    """Simple model versioning system"""
    
    def __init__(self, base_path='models'):
        self.base_path = base_path
        os.makedirs(base_path, exist_ok=True)
    
    def save_model(self, model, version, metadata=None):
        """Save model with version"""
        import datetime
        
        model_data = {
            'model': model,
            'version': version,
            'timestamp': datetime.datetime.now().isoformat(),
            'metadata': metadata or {}
        }
        
        filepath = os.path.join(self.base_path, f'model_v{version}.joblib')
        joblib.dump(model_data, filepath)
        print(f"Model saved: {filepath}")
        
        return filepath
    
    def load_model(self, version):
        """Load specific version"""
        filepath = os.path.join(self.base_path, f'model_v{version}.joblib')
        
        if os.path.exists(filepath):
            model_data = joblib.load(filepath)
            print(f"Loaded model version {version} from {model_data['timestamp']}")
            return model_data['model']
        else:
            raise FileNotFoundError(f"Model version {version} not found")
    
    def list_versions(self):
        """List all available versions"""
        versions = []
        
        for filename in os.listdir(self.base_path):
            if filename.startswith('model_v') and filename.endswith('.joblib'):
                version = filename.replace('model_v', '').replace('.joblib', '')
                versions.append(version)
        
        return sorted(versions)

# Usage
mvc = ModelVersionControl()
mvc.save_model(pipeline, version='1.0', metadata={'accuracy': score})
mvc.save_model(pipeline, version='1.1', metadata={'accuracy': score, 'improved': True})

print(f"Available versions: {mvc.list_versions()}")
loaded_model = mvc.load_model('1.0')

# Clean up files
for file in ['model_pipeline.joblib', 'model_pipeline.pkl', 'model_components.joblib']:
    if os.path.exists(file):
        os.remove(file)

Practice Exercises

Exercise 1: Complete ML Pipeline

Build a complete machine learning pipeline that:

Loads and explores a dataset
Handles missing values and categorical variables
Performs feature scaling
Compares multiple models
Selects the best model using cross-validation
Saves the final model for deployment

Exercise 2: Custom Transformer

Create a custom scikit-learn transformer that:

Implements the transformer interface (fit, transform)
Performs custom feature engineering
Can be used in a Pipeline
Handles both training and test data correctly

Exercise 3: Model Comparison Framework

Develop a framework that:

Takes a dataset and list of models
Automatically handles preprocessing based on data types
Performs cross-validation for each model
Generates comparison visualizations
Recommends the best model with explanations

Key Takeaways

📚 Scikit-learn provides a consistent API across all algorithms
🔧 Preprocessing is crucial: scaling, encoding, imputation
🎯 fit() trains the model, predict() makes predictions
📊 Always split data into train/test sets
⚖️ Different algorithms suit different problems
💾 Models can be saved and loaded for deployment
🔄 Pipelines chain preprocessing and modeling steps
📈 Evaluate models with appropriate metrics