Skip to main content

🔍 Feature Selection: Choosing the Right Variables

Introduction

Feature selection is a critical preprocessing step in machine learning that identifies the most relevant features for your model. Unlike feature extraction methods (PCA, LDA) that transform features, feature selection methods select a subset of the original features. This preserves interpretability, reduces overfitting, decreases training time, and often improves model performance. We'll explore filter, wrapper, and embedded methods for effective feature selection.

Core Concepts and Methods Overview

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Lasso, Ridge
from sklearn.metrics import accuracy_score, mutual_info_score
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("="*60)
print("FEATURE SELECTION FUNDAMENTALS")
print("="*60)

# Core concepts
feature_selection_concepts = """
FEATURE SELECTION OVERVIEW:

1. THREE MAIN APPROACHES:
   
   A) FILTER METHODS:
      • Statistical tests independent of ML algorithm
      • Fast and scalable
      • Examples: Correlation, Chi-square, ANOVA, Information gain
   
   B) WRAPPER METHODS:
      • Use ML algorithm to evaluate subsets
      • Computationally expensive but often more accurate
      • Examples: RFE, Forward/Backward selection
   
   C) EMBEDDED METHODS:
      • Feature selection during model training
      • Balance between filter and wrapper
      • Examples: LASSO, Ridge, Tree-based importance

2. BENEFITS OF FEATURE SELECTION:
   • Reduces overfitting
   • Improves accuracy
   • Reduces training time
   • Enhances interpretability
   • Removes redundant/irrelevant features

3. SELECTION CRITERIA:
   • Relevance: How well does feature predict target?
   • Redundancy: Does feature duplicate information?
   • Interaction: Do features work better together?

4. CHALLENGES:
   • Feature interactions
   • Non-linear relationships
   • Computational complexity
   • Determining optimal number of features

5. FEATURE SELECTION VS EXTRACTION:
   • Selection: Choose subset of original features
   • Extraction: Create new features (PCA, LDA)
   • Selection preserves interpretability
"""

print(feature_selection_concepts)

Filter Methods

class FilterMethods:
    """Implementation of various filter-based feature selection methods"""
    
    def __init__(self):
        self.scores = {}
        self.selected_features = {}
        
    def correlation_selection(self, X, y, threshold=0.3):
        """Select features based on correlation with target"""
        
        # Calculate correlations
        if isinstance(X, np.ndarray):
            X_df = pd.DataFrame(X)
        else:
            X_df = X
            
        correlations = X_df.corrwith(pd.Series(y))
        correlations_abs = correlations.abs()
        
        # Select features above threshold
        selected = correlations_abs[correlations_abs > threshold].index.tolist()
        
        # Visualization
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        # Correlation values
        axes[0].bar(range(len(correlations)), correlations.values, 
                   color=['green' if abs(c) > threshold else 'gray' 
                          for c in correlations.values])
        axes[0].axhline(y=threshold, color='r', linestyle='--', label=f'Threshold={threshold}')
        axes[0].axhline(y=-threshold, color='r', linestyle='--')
        axes[0].set_xlabel('Feature Index')
        axes[0].set_ylabel('Correlation with Target')
        axes[0].set_title('Feature-Target Correlations')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Distribution of correlations
        axes[1].hist(correlations_abs.values, bins=20, edgecolor='black', alpha=0.7)
        axes[1].axvline(x=threshold, color='r', linestyle='--', label=f'Threshold={threshold}')
        axes[1].set_xlabel('Absolute Correlation')
        axes[1].set_ylabel('Frequency')
        axes[1].set_title('Distribution of Correlations')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        # Selected vs rejected features
        n_selected = len(selected)
        n_rejected = len(correlations) - n_selected
        axes[2].bar(['Selected', 'Rejected'], [n_selected, n_rejected],
                   color=['green', 'red'], alpha=0.7)
        axes[2].set_ylabel('Number of Features')
        axes[2].set_title(f'Feature Selection Results')
        for i, v in enumerate([n_selected, n_rejected]):
            axes[2].text(i, v, str(v), ha='center', va='bottom')
        axes[2].grid(True, alpha=0.3, axis='y')
        
        plt.suptitle('Correlation-based Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nCorrelation-based Selection:")
        print(f"  Features selected: {n_selected}/{len(correlations)}")
        print(f"  Top 5 correlations: {correlations_abs.nlargest(5).values}")
        
        return selected, correlations
    
    def variance_threshold_selection(self, X, threshold=0.01):
        """Remove low-variance features"""
        from sklearn.feature_selection import VarianceThreshold
        
        # Calculate variances
        selector = VarianceThreshold(threshold=threshold)
        X_selected = selector.fit_transform(X)
        
        variances = selector.variances_
        selected_mask = selector.get_support()
        
        # Visualization
        fig, axes = plt.subplots(1, 2, figsize=(12, 5))
        
        # Variance values
        colors = ['green' if m else 'red' for m in selected_mask]
        axes[0].bar(range(len(variances)), variances, color=colors, alpha=0.7)
        axes[0].axhline(y=threshold, color='black', linestyle='--', 
                       label=f'Threshold={threshold}')
        axes[0].set_xlabel('Feature Index')
        axes[0].set_ylabel('Variance')
        axes[0].set_title('Feature Variances')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Log scale for better visibility
        axes[1].bar(range(len(variances)), variances, color=colors, alpha=0.7)
        axes[1].axhline(y=threshold, color='black', linestyle='--')
        axes[1].set_xlabel('Feature Index')
        axes[1].set_ylabel('Variance (log scale)')
        axes[1].set_title('Feature Variances (Log Scale)')
        axes[1].set_yscale('log')
        axes[1].grid(True, alpha=0.3)
        
        plt.suptitle('Variance Threshold Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nVariance Threshold Selection:")
        print(f"  Features selected: {selected_mask.sum()}/{len(variances)}")
        print(f"  Min variance (selected): {variances[selected_mask].min():.4f}")
        print(f"  Max variance: {variances.max():.4f}")
        
        return X_selected, selected_mask, variances
    
    def mutual_information_selection(self, X, y, k=10):
        """Select features based on mutual information"""
        from sklearn.feature_selection import mutual_info_classif, SelectKBest
        
        # Calculate mutual information
        mi_scores = mutual_info_classif(X, y, random_state=42)
        
        # Select top k features
        selector = SelectKBest(score_func=mutual_info_classif, k=k)
        X_selected = selector.fit_transform(X, y)
        selected_mask = selector.get_support()
        
        # Visualization
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        # MI scores
        sorted_idx = np.argsort(mi_scores)[::-1]
        axes[0].bar(range(len(mi_scores)), mi_scores[sorted_idx], 
                   color='steelblue', alpha=0.7)
        axes[0].axvline(x=k-0.5, color='r', linestyle='--', 
                       label=f'Top {k} features')
        axes[0].set_xlabel('Feature Rank')
        axes[0].set_ylabel('Mutual Information')
        axes[0].set_title('Mutual Information Scores')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Distribution of MI scores
        axes[1].hist(mi_scores, bins=20, edgecolor='black', alpha=0.7)
        axes[1].set_xlabel('Mutual Information Score')
        axes[1].set_ylabel('Frequency')
        axes[1].set_title('Distribution of MI Scores')
        axes[1].grid(True, alpha=0.3)
        
        # Top features
        top_k_idx = sorted_idx[:k]
        top_k_scores = mi_scores[top_k_idx]
        axes[2].barh(range(k), top_k_scores[::-1], color='green', alpha=0.7)
        axes[2].set_yticks(range(k))
        axes[2].set_yticklabels([f'Feature {idx}' for idx in top_k_idx[::-1]])
        axes[2].set_xlabel('Mutual Information Score')
        axes[2].set_title(f'Top {k} Features')
        axes[2].grid(True, alpha=0.3, axis='x')
        
        plt.suptitle('Mutual Information Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nMutual Information Selection:")
        print(f"  Features selected: {k}/{len(mi_scores)}")
        print(f"  Top 5 MI scores: {mi_scores[sorted_idx[:5]]}")
        
        return X_selected, selected_mask, mi_scores
    
    def chi_square_selection(self, X, y, k=10):
        """Chi-square test for categorical features"""
        from sklearn.feature_selection import chi2, SelectKBest
        
        # Ensure non-negative values for chi-square
        X_positive = X - X.min() + 1e-10
        
        # Calculate chi-square scores
        chi2_scores, p_values = chi2(X_positive, y)
        
        # Select top k features
        selector = SelectKBest(score_func=chi2, k=k)
        X_selected = selector.fit_transform(X_positive, y)
        
        # Visualization
        fig, axes = plt.subplots(1, 2, figsize=(12, 5))
        
        # Chi-square scores
        sorted_idx = np.argsort(chi2_scores)[::-1]
        axes[0].bar(range(len(chi2_scores)), chi2_scores[sorted_idx],
                   color='coral', alpha=0.7)
        axes[0].axvline(x=k-0.5, color='r', linestyle='--',
                       label=f'Top {k} features')
        axes[0].set_xlabel('Feature Rank')
        axes[0].set_ylabel('Chi-square Score')
        axes[0].set_title('Chi-square Test Scores')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # P-values
        axes[1].scatter(range(len(p_values)), -np.log10(p_values[sorted_idx]),
                       alpha=0.6, s=30)
        axes[1].axhline(y=-np.log10(0.05), color='r', linestyle='--',
                       label='p=0.05')
        axes[1].axvline(x=k-0.5, color='g', linestyle='--',
                       label=f'Top {k} features')
        axes[1].set_xlabel('Feature Rank')
        axes[1].set_ylabel('-log10(p-value)')
        axes[1].set_title('Feature Significance')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.suptitle('Chi-square Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nChi-square Selection:")
        print(f"  Features selected: {k}/{len(chi2_scores)}")
        print(f"  Features with p < 0.05: {(p_values < 0.05).sum()}")
        
        return X_selected, chi2_scores, p_values

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=10, n_redundant=5,
                          n_clusters_per_class=2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize filter methods
filter_selector = FilterMethods()

print("\n" + "="*60)
print("FILTER METHODS DEMONSTRATION")
print("="*60)

print("\n1. Correlation-based Selection:")
selected_corr, correlations = filter_selector.correlation_selection(X_scaled, y)

print("\n2. Variance Threshold Selection:")
X_var, mask_var, variances = filter_selector.variance_threshold_selection(X_scaled)

print("\n3. Mutual Information Selection:")
X_mi, mask_mi, mi_scores = filter_selector.mutual_information_selection(X_scaled, y)

print("\n4. Chi-square Selection:")
X_chi2, chi2_scores, p_values = filter_selector.chi_square_selection(X_scaled, y)

Wrapper Methods

class WrapperMethods:
    """Implementation of wrapper-based feature selection methods"""
    
    def __init__(self):
        self.selected_features = {}
        self.scores = {}
        
    def recursive_feature_elimination(self, X, y, n_features=10):
        """RFE: Recursively remove features"""
        from sklearn.feature_selection import RFE, RFECV
        
        # Base estimator
        estimator = LogisticRegression(max_iter=1000, random_state=42)
        
        # RFE with fixed number of features
        rfe = RFE(estimator, n_features_to_select=n_features)
        rfe.fit(X, y)
        
        # RFECV with cross-validation
        rfecv = RFECV(estimator, step=1, cv=5, 
                     scoring='accuracy', n_jobs=-1)
        rfecv.fit(X, y)
        
        # Visualization
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Feature ranking
        ranking = rfe.ranking_
        axes[0, 0].bar(range(len(ranking)), ranking,
                      color=['green' if r == 1 else 'gray' for r in ranking])
        axes[0, 0].set_xlabel('Feature Index')
        axes[0, 0].set_ylabel('Rank')
        axes[0, 0].set_title(f'RFE Feature Ranking (Selected: {n_features})')
        axes[0, 0].grid(True, alpha=0.3)
        
        # RFECV scores
        axes[0, 1].plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
                       rfecv.cv_results_['mean_test_score'], marker='o')
        axes[0, 1].fill_between(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
                               rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'],
                               rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'],
                               alpha=0.3)
        axes[0, 1].axvline(x=rfecv.n_features_, color='r', linestyle='--',
                          label=f'Optimal: {rfecv.n_features_}')
        axes[0, 1].set_xlabel('Number of Features')
        axes[0, 1].set_ylabel('Cross-validation Score')
        axes[0, 1].set_title('RFECV: Optimal Feature Count')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # Compare selected features
        rfe_selected = np.where(rfe.support_)[0]
        rfecv_selected = np.where(rfecv.support_)[0]
        
        # Venn diagram representation (simplified)
        both = len(set(rfe_selected) & set(rfecv_selected))
        only_rfe = len(set(rfe_selected) - set(rfecv_selected))
        only_rfecv = len(set(rfecv_selected) - set(rfe_selected))
        
        axes[1, 0].bar(['RFE only', 'Both', 'RFECV only'],
                      [only_rfe, both, only_rfecv],
                      color=['blue', 'purple', 'red'], alpha=0.7)
        axes[1, 0].set_ylabel('Number of Features')
        axes[1, 0].set_title('RFE vs RFECV Feature Selection')
        axes[1, 0].grid(True, alpha=0.3, axis='y')
        
        # Performance comparison
        from sklearn.model_selection import cross_val_score
        
        # All features
        scores_all = cross_val_score(estimator, X, y, cv=5)
        
        # RFE selected
        X_rfe = X[:, rfe.support_]
        scores_rfe = cross_val_score(estimator, X_rfe, y, cv=5)
        
        # RFECV selected
        X_rfecv = X[:, rfecv.support_]
        scores_rfecv = cross_val_score(estimator, X_rfecv, y, cv=5)
        
        bp = axes[1, 1].boxplot([scores_all, scores_rfe, scores_rfecv],
                               labels=['All Features', f'RFE ({n_features})',
                                      f'RFECV ({rfecv.n_features_})'])
        axes[1, 1].set_ylabel('Accuracy')
        axes[1, 1].set_title('Performance Comparison')
        axes[1, 1].grid(True, alpha=0.3, axis='y')
        
        plt.suptitle('Recursive Feature Elimination', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nRFE Results:")
        print(f"  RFE selected features: {rfe_selected}")
        print(f"  RFECV optimal features: {rfecv.n_features_}")
        print(f"  RFECV selected features: {rfecv_selected}")
        print(f"  Performance - All: {scores_all.mean():.3f}, "
              f"RFE: {scores_rfe.mean():.3f}, RFECV: {scores_rfecv.mean():.3f}")
        
        return rfe, rfecv
    
    def forward_selection(self, X, y, k=10):
        """Forward feature selection"""
        from sklearn.feature_selection import SequentialFeatureSelector
        
        estimator = LogisticRegression(max_iter=1000, random_state=42)
        
        # Forward selection
        sfs_forward = SequentialFeatureSelector(
            estimator, n_features_to_select=k,
            direction='forward', cv=5
        )
        sfs_forward.fit(X, y)
        
        # Track performance for different feature counts
        scores = []
        selected_features = []
        
        for n in range(1, min(k+5, X.shape[1]+1)):
            sfs_temp = SequentialFeatureSelector(
                estimator, n_features_to_select=n,
                direction='forward', cv=3  # Fewer CV folds for speed
            )
            sfs_temp.fit(X, y)
            X_temp = X[:, sfs_temp.support_]
            
            # Evaluate
            score = cross_val_score(estimator, X_temp, y, cv=3).mean()
            scores.append(score)
            selected_features.append(np.where(sfs_temp.support_)[0])
        
        # Visualization
        fig, axes = plt.subplots(1, 2, figsize=(12, 5))
        
        # Performance curve
        axes[0].plot(range(1, len(scores)+1), scores, marker='o', linewidth=2)
        axes[0].axvline(x=k, color='r', linestyle='--', label=f'Selected: {k}')
        axes[0].set_xlabel('Number of Features')
        axes[0].set_ylabel('Cross-validation Score')
        axes[0].set_title('Forward Selection Performance')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Feature selection pattern
        feature_matrix = np.zeros((len(selected_features), X.shape[1]))
        for i, features in enumerate(selected_features):
            feature_matrix[i, features] = 1
        
        im = axes[1].imshow(feature_matrix.T, aspect='auto', cmap='Greens')
        axes[1].set_xlabel('Selection Step')
        axes[1].set_ylabel('Feature Index')
        axes[1].set_title('Feature Selection Pattern')
        plt.colorbar(im, ax=axes[1])
        
        plt.suptitle('Forward Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nForward Selection Results:")
        print(f"  Selected features: {np.where(sfs_forward.support_)[0]}")
        print(f"  Best score: {max(scores):.3f} with {scores.index(max(scores))+1} features")
        
        return sfs_forward, scores
    
    def backward_elimination(self, X, y, k=10):
        """Backward feature elimination"""
        from sklearn.feature_selection import SequentialFeatureSelector
        
        estimator = LogisticRegression(max_iter=1000, random_state=42)
        
        # Backward elimination
        sfs_backward = SequentialFeatureSelector(
            estimator, n_features_to_select=k,
            direction='backward', cv=5
        )
        sfs_backward.fit(X, y)
        
        # Visualization
        selected = np.where(sfs_backward.support_)[0]
        not_selected = np.where(~sfs_backward.support_)[0]
        
        fig, ax = plt.subplots(figsize=(10, 6))
        
        # Feature importance proxy (using simple correlation)
        correlations = [np.corrcoef(X[:, i], y)[0, 1] for i in range(X.shape[1])]
        
        ax.bar(selected, [correlations[i] for i in selected],
              color='green', alpha=0.7, label='Selected')
        ax.bar(not_selected, [correlations[i] for i in not_selected],
              color='red', alpha=0.7, label='Eliminated')
        
        ax.set_xlabel('Feature Index')
        ax.set_ylabel('Correlation with Target')
        ax.set_title(f'Backward Elimination Results (k={k})')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"\nBackward Elimination Results:")
        print(f"  Selected features: {selected}")
        print(f"  Eliminated features: {not_selected}")
        
        return sfs_backward

# Wrapper methods
wrapper_selector = WrapperMethods()

print("\n" + "="*60)
print("WRAPPER METHODS DEMONSTRATION")
print("="*60)

print("\n1. Recursive Feature Elimination:")
rfe_model, rfecv_model = wrapper_selector.recursive_feature_elimination(X_scaled, y)

print("\n2. Forward Selection:")
forward_model, forward_scores = wrapper_selector.forward_selection(X_scaled, y)

print("\n3. Backward Elimination:")
backward_model = wrapper_selector.backward_elimination(X_scaled, y)

Embedded Methods

class EmbeddedMethods:
    """Implementation of embedded feature selection methods"""
    
    def __init__(self):
        self.models = {}
        self.selected_features = {}
        
    def lasso_selection(self, X, y):
        """L1 regularization for feature selection"""
        from sklearn.linear_model import LassoCV
        
        # Find optimal alpha using cross-validation
        alphas = np.logspace(-4, 1, 50)
        lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=42, max_iter=10000)
        lasso_cv.fit(X, y)
        
        # Get coefficients
        coef = lasso_cv.coef_
        selected = np.where(coef != 0)[0]
        
        # Visualization
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Coefficient path
        from sklearn.linear_model import lasso_path
        alphas_path, coefs_path, _ = lasso_path(X, y, alphas=alphas)
        
        for i in range(coefs_path.shape[0]):
            axes[0, 0].plot(alphas_path, coefs_path[i, :], alpha=0.5)
        axes[0, 0].axvline(x=lasso_cv.alpha_, color='r', linestyle='--',
                          label=f'Optimal α={lasso_cv.alpha_:.4f}')
        axes[0, 0].set_xscale('log')
        axes[0, 0].set_xlabel('Alpha (Regularization Strength)')
        axes[0, 0].set_ylabel('Coefficient Value')
        axes[0, 0].set_title('LASSO Coefficient Path')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # Final coefficients
        axes[0, 1].bar(range(len(coef)), coef,
                      color=['green' if c != 0 else 'gray' for c in coef])
        axes[0, 1].set_xlabel('Feature Index')
        axes[0, 1].set_ylabel('Coefficient')
        axes[0, 1].set_title(f'Final Coefficients ({len(selected)} selected)')
        axes[0, 1].grid(True, alpha=0.3)
        
        # Number of features vs alpha
        n_features = []
        for alpha in alphas:
            lasso_temp = Lasso(alpha=alpha, max_iter=1000)
            lasso_temp.fit(X, y)
            n_features.append((lasso_temp.coef_ != 0).sum())
        
        axes[1, 0].plot(alphas, n_features, marker='o')
        axes[1, 0].axvline(x=lasso_cv.alpha_, color='r', linestyle='--')
        axes[1, 0].set_xscale('log')
        axes[1, 0].set_xlabel('Alpha')
        axes[1, 0].set_ylabel('Number of Selected Features')
        axes[1, 0].set_title('Feature Count vs Regularization')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Cross-validation scores
        axes[1, 1].errorbar(alphas, -lasso_cv.mse_path_.mean(axis=1),
                          yerr=lasso_cv.mse_path_.std(axis=1), 
                          marker='o', markersize=3)
        axes[1, 1].axvline(x=lasso_cv.alpha_, color='r', linestyle='--')
        axes[1, 1].set_xscale('log')
        axes[1, 1].set_xlabel('Alpha')
        axes[1, 1].set_ylabel('Negative MSE')
        axes[1, 1].set_title('Cross-validation Scores')
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.suptitle('LASSO Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nLASSO Selection Results:")
        print(f"  Optimal alpha: {lasso_cv.alpha_:.4f}")
        print(f"  Features selected: {len(selected)}/{len(coef)}")
        print(f"  Selected indices: {selected}")
        
        return lasso_cv, selected
    
    def tree_based_selection(self, X, y):
        """Feature importance from tree-based models"""
        from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
        from sklearn.feature_selection import SelectFromModel
        
        # Train Random Forest
        rf = RandomForestClassifier(n_estimators=100, random_state=42)
        rf.fit(X, y)
        
        # Train Extra Trees
        et = ExtraTreesClassifier(n_estimators=100, random_state=42)
        et.fit(X, y)
        
        # Get feature importances
        rf_importance = rf.feature_importances_
        et_importance = et.feature_importances_
        
        # Select features using threshold
        selector_rf = SelectFromModel(rf, prefit=True)
        X_selected_rf = selector_rf.transform(X)
        
        # Visualization
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Random Forest importance
        sorted_idx = np.argsort(rf_importance)[::-1]
        axes[0, 0].bar(range(len(rf_importance)), rf_importance[sorted_idx],
                      color='forestgreen', alpha=0.7)
        axes[0, 0].set_xlabel('Feature Rank')
        axes[0, 0].set_ylabel('Importance')
        axes[0, 0].set_title('Random Forest Feature Importance')
        axes[0, 0].grid(True, alpha=0.3)
        
        # Extra Trees importance
        sorted_idx_et = np.argsort(et_importance)[::-1]
        axes[0, 1].bar(range(len(et_importance)), et_importance[sorted_idx_et],
                      color='darkblue', alpha=0.7)
        axes[0, 1].set_xlabel('Feature Rank')
        axes[0, 1].set_ylabel('Importance')
        axes[0, 1].set_title('Extra Trees Feature Importance')
        axes[0, 1].grid(True, alpha=0.3)
        
        # Compare RF vs ET importance
        axes[1, 0].scatter(rf_importance, et_importance, alpha=0.6)
        axes[1, 0].plot([0, max(rf_importance)], [0, max(rf_importance)],
                       'r--', alpha=0.5)
        axes[1, 0].set_xlabel('Random Forest Importance')
        axes[1, 0].set_ylabel('Extra Trees Importance')
        axes[1, 0].set_title('RF vs ET Importance Comparison')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Cumulative importance
        cumsum_rf = np.cumsum(rf_importance[sorted_idx])
        cumsum_et = np.cumsum(et_importance[sorted_idx_et])
        
        axes[1, 1].plot(range(1, len(cumsum_rf)+1), cumsum_rf,
                       label='Random Forest', linewidth=2)
        axes[1, 1].plot(range(1, len(cumsum_et)+1), cumsum_et,
                       label='Extra Trees', linewidth=2)
        axes[1, 1].axhline(y=0.95, color='r', linestyle='--',
                          label='95% threshold')
        axes[1, 1].set_xlabel('Number of Features')
        axes[1, 1].set_ylabel('Cumulative Importance')
        axes[1, 1].set_title('Cumulative Feature Importance')
        axes[1, 1].legend()
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.suptitle('Tree-based Feature Selection', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        # Find number of features for 95% importance
        n_features_95_rf = np.argmax(cumsum_rf >= 0.95) + 1
        n_features_95_et = np.argmax(cumsum_et >= 0.95) + 1
        
        print(f"\nTree-based Selection Results:")
        print(f"  RF - Features for 95% importance: {n_features_95_rf}")
        print(f"  ET - Features for 95% importance: {n_features_95_et}")
        print(f"  SelectFromModel selected: {X_selected_rf.shape[1]} features")
        
        return rf, et, rf_importance, et_importance
    
    def elastic_net_selection(self, X, y):
        """Elastic Net (L1 + L2) regularization"""
        from sklearn.linear_model import ElasticNetCV
        
        # Find optimal parameters
        l1_ratios = [0.1, 0.5, 0.7, 0.9, 0.95, 0.99]
        elastic_cv = ElasticNetCV(l1_ratio=l1_ratios, cv=5, 
                                 random_state=42, max_iter=10000)
        elastic_cv.fit(X, y)
        
        # Get coefficients
        coef = elastic_cv.coef_
        selected = np.where(coef != 0)[0]
        
        # Compare with LASSO and Ridge
        from sklearn.linear_model import Ridge, Lasso
        
        ridge = Ridge(alpha=elastic_cv.alpha_)
        ridge.fit(X, y)
        
        lasso = Lasso(alpha=elastic_cv.alpha_, max_iter=10000)
        lasso.fit(X, y)
        
        # Visualization
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        # LASSO coefficients
        axes[0].bar(range(len(lasso.coef_)), lasso.coef_,
                   color='coral', alpha=0.7)
        axes[0].set_title(f'LASSO ({(lasso.coef_ != 0).sum()} features)')
        axes[0].set_xlabel('Feature Index')
        axes[0].set_ylabel('Coefficient')
        axes[0].grid(True, alpha=0.3)
        
        # Ridge coefficients
        axes[1].bar(range(len(ridge.coef_)), ridge.coef_,
                   color='steelblue', alpha=0.7)
        axes[1].set_title(f'Ridge (all {len(ridge.coef_)} features)')
        axes[1].set_xlabel('Feature Index')
        axes[1].set_ylabel('Coefficient')
        axes[1].grid(True, alpha=0.3)
        
        # Elastic Net coefficients
        axes[2].bar(range(len(coef)), coef,
                   color='green', alpha=0.7)
        axes[2].set_title(f'Elastic Net ({len(selected)} features, '
                         f'L1 ratio={elastic_cv.l1_ratio_:.2f})')
        axes[2].set_xlabel('Feature Index')
        axes[2].set_ylabel('Coefficient')
        axes[2].grid(True, alpha=0.3)
        
        plt.suptitle('Elastic Net vs LASSO vs Ridge', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        print(f"\nElastic Net Selection Results:")
        print(f"  Optimal L1 ratio: {elastic_cv.l1_ratio_:.2f}")
        print(f"  Optimal alpha: {elastic_cv.alpha_:.4f}")
        print(f"  Features selected: {len(selected)}")
        print(f"  LASSO selected: {(lasso.coef_ != 0).sum()}")
        print(f"  Ridge keeps all: {len(ridge.coef_)}")
        
        return elastic_cv, selected

# Embedded methods
embedded_selector = EmbeddedMethods()

print("\n" + "="*60)
print("EMBEDDED METHODS DEMONSTRATION")
print("="*60)

print("\n1. LASSO Selection:")
lasso_model, lasso_selected = embedded_selector.lasso_selection(X_scaled, y)

print("\n2. Tree-based Selection:")
rf_model, et_model, rf_imp, et_imp = embedded_selector.tree_based_selection(X_scaled, y)

print("\n3. Elastic Net Selection:")
elastic_model, elastic_selected = embedded_selector.elastic_net_selection(X_scaled, y)

Comparison and Best Practices

class FeatureSelectionComparison:
    """Compare different feature selection methods"""
    
    def __init__(self):
        self.results = {}
        
    def comprehensive_comparison(self, X, y):
        """Compare all feature selection methods"""
        from sklearn.feature_selection import (
            SelectKBest, f_classif, mutual_info_classif,
            RFE, SelectFromModel
        )
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.linear_model import LogisticRegression, Lasso
        
        methods = {
            'Variance Threshold': VarianceThreshold(threshold=0.01),
            'SelectKBest (f_classif)': SelectKBest(f_classif, k=10),
            'SelectKBest (MI)': SelectKBest(mutual_info_classif, k=10),
            'RFE': RFE(LogisticRegression(max_iter=1000), n_features_to_select=10),
            'LASSO': SelectFromModel(Lasso(alpha=0.01, max_iter=1000)),
            'Random Forest': SelectFromModel(RandomForestClassifier(n_estimators=100))
        }
        
        # Evaluate each method
        results = {}
        
        for name, selector in methods.items():
            # Fit and transform
            X_selected = selector.fit_transform(X, y)
            
            # Evaluate using cross-validation
            clf = LogisticRegression(max_iter=1000)
            scores = cross_val_score(clf, X_selected, y, cv=5)
            
            results[name] = {
                'n_features': X_selected.shape[1],
                'mean_score': scores.mean(),
                'std_score': scores.std(),
                'selected_mask': selector.get_support()
            }
        
        # Visualization
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Performance comparison
        names = list(results.keys())
        scores = [results[n]['mean_score'] for n in names]
        stds = [results[n]['std_score'] for n in names]
        
        axes[0, 0].bar(range(len(names)), scores, yerr=stds,
                      capsize=5, color='steelblue', alpha=0.7)
        axes[0, 0].set_xticks(range(len(names)))
        axes[0, 0].set_xticklabels(names, rotation=45, ha='right')
        axes[0, 0].set_ylabel('Accuracy')
        axes[0, 0].set_title('Performance Comparison')
        axes[0, 0].grid(True, alpha=0.3, axis='y')
        
        # Number of features selected
        n_features = [results[n]['n_features'] for n in names]
        axes[0, 1].bar(range(len(names)), n_features,
                      color='coral', alpha=0.7)
        axes[0, 1].set_xticks(range(len(names)))
        axes[0, 1].set_xticklabels(names, rotation=45, ha='right')
        axes[0, 1].set_ylabel('Number of Features')
        axes[0, 1].set_title('Features Selected')
        axes[0, 1].grid(True, alpha=0.3, axis='y')
        
        # Efficiency plot (score vs n_features)
        axes[1, 0].scatter(n_features, scores, s=100, alpha=0.6)
        for i, name in enumerate(names):
            axes[1, 0].annotate(name, (n_features[i], scores[i]),
                              fontsize=8, ha='center')
        axes[1, 0].set_xlabel('Number of Features')
        axes[1, 0].set_ylabel('Accuracy')
        axes[1, 0].set_title('Efficiency: Accuracy vs Feature Count')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Feature overlap heatmap
        n_methods = len(methods)
        overlap_matrix = np.zeros((n_methods, n_methods))
        
        for i, name1 in enumerate(names):
            for j, name2 in enumerate(names):
                mask1 = results[name1]['selected_mask']
                mask2 = results[name2]['selected_mask']
                overlap = (mask1 & mask2).sum() / max((mask1.sum(), mask2.sum()))
                overlap_matrix[i, j] = overlap
        
        im = axes[1, 1].imshow(overlap_matrix, cmap='coolwarm', vmin=0, vmax=1)
        axes[1, 1].set_xticks(range(n_methods))
        axes[1, 1].set_yticks(range(n_methods))
        axes[1, 1].set_xticklabels(names, rotation=45, ha='right')
        axes[1, 1].set_yticklabels(names)
        axes[1, 1].set_title('Feature Selection Overlap')
        plt.colorbar(im, ax=axes[1, 1])
        
        # Add values to heatmap
        for i in range(n_methods):
            for j in range(n_methods):
                axes[1, 1].text(j, i, f'{overlap_matrix[i, j]:.2f}',
                              ha='center', va='center', fontsize=8)
        
        plt.suptitle('Comprehensive Feature Selection Comparison', 
                    fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
        
        # Print summary
        print("\nMethod Comparison Summary:")
        print("-" * 60)
        for name in names:
            print(f"{name:25} Features: {results[name]['n_features']:3} "
                  f"Score: {results[name]['mean_score']:.3f} "
                  f"(±{results[name]['std_score']:.3f})")
        
        return results

# Comparison
comparator = FeatureSelectionComparison()

print("\n" + "="*60)
print("COMPREHENSIVE COMPARISON")
print("="*60)

comparison_results = comparator.comprehensive_comparison(X_scaled, y)

Best Practices and Guidelines

print("\n" + "="*60)
print("FEATURE SELECTION BEST PRACTICES")
print("="*60)

best_practices = """
KEY GUIDELINES:

1. METHOD SELECTION:
   
   FILTER METHODS - Use when:
   • Need fast, scalable solution
   • Working with very high dimensions
   • Want method-agnostic selection
   • Initial feature screening
   
   WRAPPER METHODS - Use when:
   • Have specific model in mind
   • Dataset is not too large
   • Need optimal feature subset
   • Accuracy is more important than speed
   
   EMBEDDED METHODS - Use when:
   • Want to combine selection with training
   • Need regularization anyway
   • Working with linear models or trees
   • Want interpretable feature importance

2. PREPROCESSING:
   • Handle missing values first
   • Scale/normalize for distance-based methods
   • Consider transformations (log, polynomial)
   • Remove zero/low variance features

3. VALIDATION:
   • Always use cross-validation
   • Test on held-out data
   • Check stability across folds
   • Monitor for overfitting

4. PRACTICAL TIPS:
   • Start with filter methods for initial screening
   • Combine multiple methods (ensemble)
   • Consider domain knowledge
   • Validate business/scientific meaning
   • Document selected features

5. COMMON PITFALLS:
   ✗ Selecting features on entire dataset before split
   ✗ Ignoring feature interactions
   ✗ Using too few features (underfitting)
   ✗ Not considering computational cost
   ✗ Ignoring multicollinearity
"""

print(best_practices)

# Method selection guide
selection_guide = """
FEATURE SELECTION METHOD GUIDE:

Dataset Size    | Recommended Methods
----------------|--------------------
< 1,000         | RFE, Forward/Backward, Embedded
1,000-10,000    | Filter + Wrapper, Embedded
10,000-100,000  | Filter, Embedded (LASSO, Trees)
> 100,000       | Filter, Tree-based importance

Feature Count   | Recommended Approach
----------------|--------------------
< 20            | Try all, manual inspection
20-100          | RFE, Embedded methods
100-1,000       | Filter first, then refine
> 1,000         | Filter, Tree importance

Problem Type    | Best Methods
----------------|--------------------
Classification  | Chi-square, MI, Tree importance
Regression      | Correlation, LASSO, Ridge
Time Series     | Lagged correlation, ACF/PACF
NLP             | TF-IDF, Chi-square, MI
Computer Vision | PCA, CNN feature maps
"""

print(selection_guide)

# Performance vs interpretability trade-off
tradeoff = """
PERFORMANCE VS INTERPRETABILITY:

High Performance, Low Interpretability:
• Deep learning features
• Non-linear transformations
• Large ensembles

Balanced:
• Tree-based selection
• LASSO/Elastic Net
• RFE with simple models

High Interpretability, May Sacrifice Performance:
• Simple correlation
• Univariate tests
• Manual selection
• Domain-driven selection
"""

print(tradeoff)

Practice Exercises

Exercise 1: Custom Feature Selection Pipeline

Build a comprehensive feature selection pipeline:

  1. Implement multi-stage selection (filter → wrapper → embedded)
  2. Handle mixed data types (numerical, categorical, text)
  3. Cross-validate at each stage
  4. Track feature stability across folds
  5. Generate automated feature selection report

Exercise 2: Feature Interaction Detection

Develop methods to find feature interactions:

  1. Implement interaction detection algorithms
  2. Visualize pairwise interactions
  3. Select features considering interactions
  4. Compare with individual feature selection
  5. Apply to real-world dataset with known interactions

Exercise 3: Real-time Feature Selection

Create an adaptive feature selection system:

  1. Implement online feature selection
  2. Handle concept drift
  3. Update feature importance dynamically
  4. Monitor feature stability over time
  5. Build dashboard for feature monitoring

Summary and Key Takeaways

🎯 Key Points to Remember