🚀 Transfer Learning: Leveraging Pre-trained Models

Introduction

Transfer learning is a powerful technique that leverages knowledge from pre-trained models to solve new, related tasks with limited data and computational resources. By transferring learned features from models trained on massive datasets like ImageNet or large text corpora, we can achieve state-of-the-art results on specialized tasks with fraction of the data and training time. This lesson covers transfer learning strategies, fine-tuning techniques, domain adaptation, cross-modal transfer, and practical applications across computer vision, NLP, and beyond.

Transfer Learning Fundamentals

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.applications import (
    VGG16, VGG19, ResNet50, ResNet101, ResNet152,
    InceptionV3, InceptionResNetV2, MobileNetV2,
    DenseNet121, NASNetMobile, EfficientNetB0
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import cv2
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {len(tf.config.list_physical_devices('GPU')) > 0}")

print("\n" + "="*60)
print("TRANSFER LEARNING FUNDAMENTALS")
print("="*60)

# Core concepts
transfer_concepts = """
TRANSFER LEARNING KEY CONCEPTS:

1. WHAT IS TRANSFER LEARNING:
   • Reuse learned features from one task for another
   • Leverage pre-trained models
   • Fine-tune for specific domain
   • Reduce training time and data requirements

2. TYPES OF TRANSFER:
   • Feature Extraction: Use as fixed feature extractor
   • Fine-tuning: Unfreeze and retrain some layers
   • Domain Adaptation: Transfer across domains
   • Multi-task Learning: Learn multiple tasks jointly
   • Zero-shot Learning: Recognize unseen classes

3. PRE-TRAINING SOURCES:
   • ImageNet: 14M images, 1000 classes (vision)
   • COCO: Object detection and segmentation
   • OpenImages: 9M images with annotations
   • BERT/GPT: Large text corpora (NLP)
   • AudioSet: Audio event detection

4. WHEN TO USE:
   • Limited training data (< 10K samples)
   • Similar source and target domains
   • Limited computational resources
   • Need quick prototyping
   • Baseline model development

5. STRATEGIES:
   • Freeze all layers except classifier
   • Progressive unfreezing
   • Discriminative learning rates
   • Layer-wise adaptation
   • Knowledge distillation

6. BENEFITS:
   • Faster training
   • Better performance with less data
   • Lower computational cost
   • Regularization effect
   • Access to SOTA features

7. CHALLENGES:
   • Domain shift
   • Negative transfer
   • Catastrophic forgetting
   • Architecture constraints
   • Task mismatch
"""

print(transfer_concepts)

Computer Vision Transfer Learning

class VisionTransferLearning:
    """Transfer learning for computer vision tasks"""
    
    def __init__(self):
        self.models = {}
        self.histories = {}
        
    def load_pretrained_models(self):
        """Load and compare popular pre-trained models"""
        
        models_config = {
            'VGG16': {'model': VGG16, 'size': (224, 224), 'preprocess': 'vgg'},
            'ResNet50': {'model': ResNet50, 'size': (224, 224), 'preprocess': 'resnet'},
            'InceptionV3': {'model': InceptionV3, 'size': (299, 299), 'preprocess': 'inception'},
            'MobileNetV2': {'model': MobileNetV2, 'size': (224, 224), 'preprocess': 'mobilenet'},
            'EfficientNetB0': {'model': EfficientNetB0, 'size': (224, 224), 'preprocess': 'efficientnet'}
        }
        
        model_stats = []
        
        for name, config in models_config.items():
            # Load model without top layers
            base_model = config['model'](
                weights='imagenet',
                include_top=False,
                input_shape=config['size'] + (3,)
            )
            
            # Get statistics
            total_params = base_model.count_params()
            n_layers = len(base_model.layers)
            
            model_stats.append({
                'Model': name,
                'Parameters': f"{total_params/1e6:.1f}M",
                'Layers': n_layers,
                'Input Size': f"{config['size'][0]}x{config['size'][1]}",
                'Top-1 Acc': self.get_imagenet_accuracy(name)
            })
            
            self.models[name] = base_model
        
        # Create comparison table
        stats_df = pd.DataFrame(model_stats)
        
        print("\nPre-trained Models Comparison:")
        print("-" * 60)
        print(stats_df.to_string(index=False))
        
        return stats_df
    
    def get_imagenet_accuracy(self, model_name):
        """Get reported ImageNet accuracy"""
        accuracies = {
            'VGG16': 71.3,
            'ResNet50': 74.9,
            'InceptionV3': 77.9,
            'MobileNetV2': 71.3,
            'EfficientNetB0': 77.1
        }
        return accuracies.get(model_name, 0.0)
    
    def create_transfer_model(self, base_model_name='ResNet50', 
                            num_classes=10, 
                            trainable_layers=0,
                            dropout_rate=0.5):
        """Create transfer learning model with different strategies"""
        
        # Get base model
        if base_model_name == 'ResNet50':
            base_model = ResNet50(weights='imagenet', include_top=False,
                                 input_shape=(224, 224, 3))
        elif base_model_name == 'VGG16':
            base_model = VGG16(weights='imagenet', include_top=False,
                              input_shape=(224, 224, 3))
        elif base_model_name == 'MobileNetV2':
            base_model = MobileNetV2(weights='imagenet', include_top=False,
                                    input_shape=(224, 224, 3))
        else:
            raise ValueError(f"Unknown model: {base_model_name}")
        
        # Freeze base model layers
        base_model.trainable = False
        
        # Unfreeze top layers if specified
        if trainable_layers > 0:
            for layer in base_model.layers[-trainable_layers:]:
                layer.trainable = True
        
        # Build complete model
        inputs = keras.Input(shape=(224, 224, 3))
        
        # Data augmentation
        x = layers.RandomFlip('horizontal')(inputs)
        x = layers.RandomRotation(0.1)(x)
        x = layers.RandomZoom(0.1)(x)
        
        # Base model
        x = base_model(x, training=False)
        
        # Pooling
        x = layers.GlobalAveragePooling2D()(x)
        
        # Classification head
        x = layers.Dense(256, activation='relu')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(dropout_rate)(x)
        
        x = layers.Dense(128, activation='relu')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Dropout(dropout_rate)(x)
        
        outputs = layers.Dense(num_classes, activation='softmax')(x)
        
        model = keras.Model(inputs, outputs)
        
        return model, base_model
    
    def progressive_unfreezing(self, model, base_model, X_train, y_train, 
                             X_val, y_val, epochs_per_stage=5):
        """Implement progressive unfreezing strategy"""
        
        print("\nProgressive Unfreezing Strategy:")
        print("-" * 40)
        
        histories = []
        
        # Stage 1: Train only the classifier head
        print("\nStage 1: Training classifier head only")
        base_model.trainable = False
        
        model.compile(
            optimizer=keras.optimizers.Adam(1e-3),
            loss='categorical_crossentropy',
            metrics=['accuracy']
        )
        
        history1 = model.fit(
            X_train, y_train,
            validation_data=(X_val, y_val),
            epochs=epochs_per_stage,
            batch_size=32,
            verbose=0
        )
        histories.append(history1)
        
        val_acc = history1.history['val_accuracy'][-1]
        print(f"  Validation accuracy: {val_acc:.3f}")
        
        # Stage 2: Unfreeze top layers
        print("\nStage 2: Fine-tuning top 20 layers")
        base_model.trainable = True
        for layer in base_model.layers[:-20]:
            layer.trainable = False
        
        model.compile(
            optimizer=keras.optimizers.Adam(1e-4),  # Lower learning rate
            loss='categorical_crossentropy',
            metrics=['accuracy']
        )
        
        history2 = model.fit(
            X_train, y_train,
            validation_data=(X_val, y_val),
            epochs=epochs_per_stage,
            batch_size=32,
            verbose=0
        )
        histories.append(history2)
        
        val_acc = history2.history['val_accuracy'][-1]
        print(f"  Validation accuracy: {val_acc:.3f}")
        
        # Stage 3: Unfreeze all layers
        print("\nStage 3: Fine-tuning entire network")
        base_model.trainable = True
        
        model.compile(
            optimizer=keras.optimizers.Adam(1e-5),  # Very low learning rate
            loss='categorical_crossentropy',
            metrics=['accuracy']
        )
        
        history3 = model.fit(
            X_train, y_train,
            validation_data=(X_val, y_val),
            epochs=epochs_per_stage,
            batch_size=32,
            verbose=0
        )
        histories.append(history3)
        
        val_acc = history3.history['val_accuracy'][-1]
        print(f"  Validation accuracy: {val_acc:.3f}")
        
        return histories
    
    def visualize_transfer_strategies(self):
        """Visualize different transfer learning strategies"""
        
        strategies = {
            'Feature Extraction': {
                'frozen_layers': 'All base layers',
                'trainable': 'Only classifier',
                'learning_rate': 'High (1e-3)',
                'use_case': 'Very small dataset'
            },
            'Partial Fine-tuning': {
                'frozen_layers': 'Early layers',
                'trainable': 'Top layers + classifier',
                'learning_rate': 'Medium (1e-4)',
                'use_case': 'Medium dataset'
            },
            'Full Fine-tuning': {
                'frozen_layers': 'None',
                'trainable': 'Entire network',
                'learning_rate': 'Low (1e-5)',
                'use_case': 'Large dataset'
            },
            'Progressive Unfreezing': {
                'frozen_layers': 'Gradual unfreezing',
                'trainable': 'Stage-wise',
                'learning_rate': 'Decreasing',
                'use_case': 'Optimal approach'
            }
        }
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        axes = axes.flatten()
        
        for idx, (name, details) in enumerate(strategies.items()):
            ax = axes[idx]
            
            # Create visualization
            layers = ['Input'] + [f'Conv{i}' for i in range(1, 6)] + ['Classifier']
            n_layers = len(layers)
            
            # Color coding
            colors = []
            if name == 'Feature Extraction':
                colors = ['lightblue'] + ['red'] * 5 + ['green']
            elif name == 'Partial Fine-tuning':
                colors = ['lightblue'] + ['red'] * 3 + ['yellow'] * 2 + ['green']
            elif name == 'Full Fine-tuning':
                colors = ['lightblue'] + ['yellow'] * 5 + ['green']
            else:  # Progressive
                colors = ['lightblue'] + ['orange'] * 5 + ['green']
            
            # Plot layers
            y_pos = np.arange(n_layers)
            ax.barh(y_pos, [1] * n_layers, color=colors, edgecolor='black', linewidth=2)
            
            # Labels
            ax.set_yticks(y_pos)
            ax.set_yticklabels(layers)
            ax.set_xlim(0, 1.5)
            ax.set_title(f'{name}', fontsize=12, weight='bold')
            
            # Add details
            text = f"Frozen: {details['frozen_layers']}\n"
            text += f"LR: {details['learning_rate']}\n"
            text += f"Use: {details['use_case']}"
            ax.text(1.1, n_layers/2, text, fontsize=9, va='center')
            
            # Remove x-axis
            ax.set_xticks([])
            
        # Add legend
        from matplotlib.patches import Patch
        legend_elements = [
            Patch(facecolor='red', label='Frozen'),
            Patch(facecolor='yellow', label='Fine-tuning'),
            Patch(facecolor='green', label='Training'),
            Patch(facecolor='orange', label='Progressive')
        ]
        fig.legend(handles=legend_elements, loc='center', 
                  bbox_to_anchor=(0.5, 0.95), ncol=4)
        
        plt.suptitle('Transfer Learning Strategies', fontsize=14, y=1.0)
        plt.tight_layout()
        plt.show()

# Vision transfer learning
vision_transfer = VisionTransferLearning()

print("\n" + "="*60)
print("COMPUTER VISION TRANSFER LEARNING")
print("="*60)

# Load and compare models
model_comparison = vision_transfer.load_pretrained_models()

print("\nVisualizing transfer strategies:")
vision_transfer.visualize_transfer_strategies()

NLP Transfer Learning

class NLPTransferLearning:
    """Transfer learning for NLP tasks"""
    
    def __init__(self):
        self.models = {}
        self.tokenizers = {}
        
    def demonstrate_embedding_transfer(self):
        """Demonstrate word embedding transfer"""
        
        print("\nWord Embedding Transfer:")
        print("-" * 40)
        
        # Simulated pre-trained embeddings
        vocab_size = 10000
        embedding_dim = 100
        
        # Create pre-trained embeddings (simulated)
        pretrained_embeddings = np.random.randn(vocab_size, embedding_dim)
        
        # Build model with pre-trained embeddings
        model_pretrained = keras.Sequential([
            layers.Embedding(vocab_size, embedding_dim,
                           weights=[pretrained_embeddings],
                           trainable=False),  # Frozen embeddings
            layers.LSTM(128, return_sequences=True),
            layers.LSTM(64),
            layers.Dense(32, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ])
        
        # Build model without pre-trained embeddings
        model_scratch = keras.Sequential([
            layers.Embedding(vocab_size, embedding_dim),  # Trainable
            layers.LSTM(128, return_sequences=True),
            layers.LSTM(64),
            layers.Dense(32, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ])
        
        print(f"Model with pre-trained embeddings:")
        print(f"  Trainable parameters: {sum([tf.size(w).numpy() for w in model_pretrained.trainable_weights]):,}")
        print(f"\nModel from scratch:")
        print(f"  Trainable parameters: {sum([tf.size(w).numpy() for w in model_scratch.trainable_weights]):,}")
        
        return model_pretrained, model_scratch
    
    def build_bert_style_model(self, max_length=128, vocab_size=30000):
        """Build BERT-style transformer model (simplified)"""
        
        # Transformer block
        def transformer_block(inputs, embed_dim, num_heads, ff_dim, rate=0.1):
            # Multi-head self-attention
            attn_output = layers.MultiHeadAttention(
                num_heads=num_heads, key_dim=embed_dim
            )(inputs, inputs)
            attn_output = layers.Dropout(rate)(attn_output)
            out1 = layers.LayerNormalization(epsilon=1e-6)(inputs + attn_output)
            
            # Feed forward network
            ffn_output = keras.Sequential([
                layers.Dense(ff_dim, activation="relu"),
                layers.Dense(embed_dim),
            ])(out1)
            ffn_output = layers.Dropout(rate)(ffn_output)
            out2 = layers.LayerNormalization(epsilon=1e-6)(out1 + ffn_output)
            
            return out2
        
        # Model architecture
        embed_dim = 128
        num_heads = 8
        ff_dim = 512
        
        inputs = layers.Input(shape=(max_length,))
        embedding_layer = layers.Embedding(vocab_size, embed_dim)
        x = embedding_layer(inputs)
        
        # Positional encoding
        positions = tf.range(start=0, limit=max_length, delta=1)
        position_embedding = layers.Embedding(max_length, embed_dim)(positions)
        x = x + position_embedding
        
        # Transformer blocks
        x = transformer_block(x, embed_dim, num_heads, ff_dim)
        x = transformer_block(x, embed_dim, num_heads, ff_dim)
        
        # Classification head
        x = layers.GlobalAveragePooling1D()(x)
        x = layers.Dropout(0.1)(x)
        x = layers.Dense(32, activation="relu")(x)
        x = layers.Dropout(0.1)(x)
        outputs = layers.Dense(2, activation="softmax")(x)
        
        model = keras.Model(inputs=inputs, outputs=outputs)
        
        return model
    
    def demonstrate_fine_tuning_strategies(self):
        """Compare different fine-tuning strategies for NLP"""
        
        strategies = {
            'Feature-based': {
                'description': 'Use pre-trained as feature extractor',
                'layers_frozen': 'All transformer layers',
                'train_time': 'Fast',
                'performance': 'Good for small data'
            },
            'Fine-tuning Last': {
                'description': 'Fine-tune last transformer layer',
                'layers_frozen': 'All except last layer',
                'train_time': 'Medium',
                'performance': 'Better than feature-based'
            },
            'Full Fine-tuning': {
                'description': 'Fine-tune entire model',
                'layers_frozen': 'None',
                'train_time': 'Slow',
                'performance': 'Best with enough data'
            },
            'Adapter Tuning': {
                'description': 'Add small trainable adapters',
                'layers_frozen': 'Original weights frozen',
                'train_time': 'Fast',
                'performance': 'Efficient and effective'
            }
        }
        
        # Visualization
        fig, ax = plt.subplots(figsize=(12, 6))
        
        strategies_list = list(strategies.keys())
        metrics = ['Train Time', 'Performance', 'Memory Usage']
        
        # Simulated scores
        scores = {
            'Feature-based': [0.3, 0.6, 0.2],
            'Fine-tuning Last': [0.5, 0.7, 0.4],
            'Full Fine-tuning': [0.9, 0.9, 0.9],
            'Adapter Tuning': [0.4, 0.8, 0.3]
        }
        
        x = np.arange(len(strategies_list))
        width = 0.25
        
        for i, metric in enumerate(metrics):
            values = [scores[s][i] for s in strategies_list]
            ax.bar(x + i * width, values, width, label=metric)
        
        ax.set_xlabel('Strategy')
        ax.set_ylabel('Relative Score')
        ax.set_title('NLP Fine-tuning Strategies Comparison')
        ax.set_xticks(x + width)
        ax.set_xticklabels(strategies_list, rotation=45, ha='right')
        ax.legend()
        ax.grid(True, alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()
        
        # Print details
        print("\nNLP Fine-tuning Strategies:")
        print("-" * 60)
        for name, details in strategies.items():
            print(f"\n{name}:")
            for key, value in details.items():
                print(f"  {key}: {value}")

# NLP transfer learning
nlp_transfer = NLPTransferLearning()

print("\n" + "="*60)
print("NLP TRANSFER LEARNING")
print("="*60)

# Demonstrate embedding transfer
pretrained_model, scratch_model = nlp_transfer.demonstrate_embedding_transfer()

# Build transformer model
print("\nBuilding BERT-style model:")
bert_model = nlp_transfer.build_bert_style_model()
print(f"  Total parameters: {bert_model.count_params():,}")

# Compare strategies
nlp_transfer.demonstrate_fine_tuning_strategies()

Domain Adaptation Techniques

class DomainAdaptation:
    """Domain adaptation and cross-domain transfer"""
    
    def __init__(self):
        self.techniques = {}
        
    def demonstrate_domain_shift(self):
        """Visualize domain shift problem"""
        
        np.random.seed(42)
        
        # Generate source domain data
        source_X = np.random.randn(500, 2)
        source_y = (source_X[:, 0] + source_X[:, 1] > 0).astype(int)
        
        # Generate target domain data (shifted)
        target_X = np.random.randn(500, 2) + [1, 1]  # Shifted distribution
        target_y = (target_X[:, 0] + target_X[:, 1] > 2).astype(int)
        
        # Visualization
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        # Source domain
        axes[0].scatter(source_X[:, 0], source_X[:, 1], c=source_y, 
                       cmap='coolwarm', alpha=0.6, edgecolor='black', linewidth=0.5)
        axes[0].set_title('Source Domain')
        axes[0].set_xlabel('Feature 1')
        axes[0].set_ylabel('Feature 2')
        axes[0].grid(True, alpha=0.3)
        
        # Target domain
        axes[1].scatter(target_X[:, 0], target_X[:, 1], c=target_y, 
                       cmap='coolwarm', alpha=0.6, edgecolor='black', linewidth=0.5)
        axes[1].set_title('Target Domain')
        axes[1].set_xlabel('Feature 1')
        axes[1].set_ylabel('Feature 2')
        axes[1].grid(True, alpha=0.3)
        
        # Combined view
        axes[2].scatter(source_X[:, 0], source_X[:, 1], c='blue', 
                       alpha=0.4, label='Source', s=30)
        axes[2].scatter(target_X[:, 0], target_X[:, 1], c='red', 
                       alpha=0.4, label='Target', s=30)
        axes[2].set_title('Domain Shift Visualization')
        axes[2].set_xlabel('Feature 1')
        axes[2].set_ylabel('Feature 2')
        axes[2].legend()
        axes[2].grid(True, alpha=0.3)
        
        plt.suptitle('Domain Shift Problem in Transfer Learning', fontsize=14)
        plt.tight_layout()
        plt.show()
        
        return source_X, source_y, target_X, target_y
    
    def build_domain_adversarial_network(self, input_dim=100, feature_dim=50):
        """Build Domain Adversarial Neural Network (DANN)"""
        
        # Shared feature extractor
        feature_input = layers.Input(shape=(input_dim,))
        
        feature_extractor = keras.Sequential([
            layers.Dense(128, activation='relu'),
            layers.BatchNormalization(),
            layers.Dropout(0.3),
            layers.Dense(64, activation='relu'),
            layers.BatchNormalization(),
            layers.Dropout(0.3),
            layers.Dense(feature_dim, activation='relu')
        ], name='feature_extractor')
        
        features = feature_extractor(feature_input)
        
        # Task classifier
        task_classifier = keras.Sequential([
            layers.Dense(32, activation='relu'),
            layers.Dropout(0.3),
            layers.Dense(1, activation='sigmoid')
        ], name='task_classifier')
        
        task_output = task_classifier(features)
        
        # Domain discriminator (with gradient reversal)
        from tensorflow.keras import backend as K
        
        class GradientReversal(layers.Layer):
            def __init__(self, hp_lambda=1.0, **kwargs):
                super().__init__(**kwargs)
                self.hp_lambda = hp_lambda
                
            def call(self, x):
                return self.grad_reverse(x)
            
            @tf.custom_gradient
            def grad_reverse(self, x):
                y = tf.identity(x)
                def custom_grad(dy):
                    return -self.hp_lambda * dy
                return y, custom_grad
        
        # Domain classifier with gradient reversal
        reversed_features = GradientReversal()(features)
        
        domain_discriminator = keras.Sequential([
            layers.Dense(32, activation='relu'),
            layers.Dropout(0.3),
            layers.Dense(1, activation='sigmoid')
        ], name='domain_discriminator')
        
        domain_output = domain_discriminator(reversed_features)
        
        # Complete model
        model = keras.Model(
            inputs=feature_input,
            outputs=[task_output, domain_output]
        )
        
        print("\nDomain Adversarial Network Architecture:")
        print("-" * 40)
        print(f"Feature Extractor: Shared representation learning")
        print(f"Task Classifier: Target task prediction")
        print(f"Domain Discriminator: Domain classification (reversed gradients)")
        print(f"Total parameters: {model.count_params():,}")
        
        return model, feature_extractor, task_classifier, domain_discriminator
    
    def adaptation_techniques_comparison(self):
        """Compare different domain adaptation techniques"""
        
        techniques = {
            'Direct Transfer': {
                'complexity': 'Low',
                'data_needed': 'Target labels',
                'performance': 'Poor with shift',
                'use_case': 'Similar domains'
            },
            'Fine-tuning': {
                'complexity': 'Low',
                'data_needed': 'Some target labels',
                'performance': 'Good',
                'use_case': 'Moderate shift'
            },
            'Feature Matching': {
                'complexity': 'Medium',
                'data_needed': 'Unlabeled target',
                'performance': 'Good',
                'use_case': 'Distribution shift'
            },
            'Adversarial (DANN)': {
                'complexity': 'High',
                'data_needed': 'Unlabeled target',
                'performance': 'Very good',
                'use_case': 'Large shift'
            },
            'Self-training': {
                'complexity': 'Medium',
                'data_needed': 'Unlabeled target',
                'performance': 'Good',
                'use_case': 'Confident predictions'
            }
        }
        
        # Create comparison matrix
        fig, ax = plt.subplots(figsize=(10, 6))
        
        techniques_list = list(techniques.keys())
        attributes = ['Complexity', 'Data Efficiency', 'Performance', 'Robustness']
        
        # Create scores (normalized)
        scores = {
            'Direct Transfer': [0.2, 0.3, 0.4, 0.3],
            'Fine-tuning': [0.3, 0.6, 0.7, 0.6],
            'Feature Matching': [0.6, 0.8, 0.7, 0.7],
            'Adversarial (DANN)': [0.9, 0.9, 0.9, 0.8],
            'Self-training': [0.5, 0.7, 0.6, 0.5]
        }
        
        # Create heatmap
        score_matrix = np.array([scores[t] for t in techniques_list])
        
        im = ax.imshow(score_matrix, cmap='YlOrRd', aspect='auto')
        
        # Set ticks and labels
        ax.set_xticks(np.arange(len(attributes)))
        ax.set_yticks(np.arange(len(techniques_list)))
        ax.set_xticklabels(attributes)
        ax.set_yticklabels(techniques_list)
        
        # Add values to cells
        for i in range(len(techniques_list)):
            for j in range(len(attributes)):
                text = ax.text(j, i, f'{score_matrix[i, j]:.1f}',
                             ha="center", va="center", color="black")
        
        # Add colorbar
        plt.colorbar(im, ax=ax)
        ax.set_title('Domain Adaptation Techniques Comparison')
        
        plt.tight_layout()
        plt.show()
        
        return techniques

# Domain adaptation
domain_adapt = DomainAdaptation()

print("\n" + "="*60)
print("DOMAIN ADAPTATION")
print("="*60)

print("\nVisualizing domain shift:")
source_X, source_y, target_X, target_y = domain_adapt.demonstrate_domain_shift()

print("\nBuilding Domain Adversarial Network:")
dann_model, feature_ext, task_clf, domain_disc = domain_adapt.build_domain_adversarial_network()

print("\nComparing adaptation techniques:")
techniques = domain_adapt.adaptation_techniques_comparison()

Knowledge Distillation

class KnowledgeDistillation:
    """Knowledge distillation for model compression"""
    
    def __init__(self):
        self.models = {}
        
    def build_teacher_student_models(self, input_shape=(32, 32, 3), num_classes=10):
        """Build teacher (large) and student (small) models"""
        
        # Teacher model (large)
        teacher = keras.Sequential([
            layers.Conv2D(64, 3, padding='same', activation='relu', 
                         input_shape=input_shape),
            layers.Conv2D(64, 3, padding='same', activation='relu'),
            layers.MaxPooling2D(2),
            layers.Dropout(0.3),
            
            layers.Conv2D(128, 3, padding='same', activation='relu'),
            layers.Conv2D(128, 3, padding='same', activation='relu'),
            layers.MaxPooling2D(2),
            layers.Dropout(0.3),
            
            layers.Conv2D(256, 3, padding='same', activation='relu'),
            layers.Conv2D(256, 3, padding='same', activation='relu'),
            layers.GlobalAveragePooling2D(),
            
            layers.Dense(256, activation='relu'),
            layers.Dropout(0.5),
            layers.Dense(128, activation='relu'),
            layers.Dropout(0.5),
            layers.Dense(num_classes, activation='softmax')
        ], name='teacher')
        
        # Student model (small)
        student = keras.Sequential([
            layers.Conv2D(16, 3, padding='same', activation='relu',
                         input_shape=input_shape),
            layers.MaxPooling2D(2),
            
            layers.Conv2D(32, 3, padding='same', activation='relu'),
            layers.MaxPooling2D(2),
            
            layers.Conv2D(64, 3, padding='same', activation='relu'),
            layers.GlobalAveragePooling2D(),
            
            layers.Dense(32, activation='relu'),
            layers.Dense(num_classes, activation='softmax')
        ], name='student')
        
        print("\nTeacher-Student Model Comparison:")
        print("-" * 40)
        print(f"Teacher parameters: {teacher.count_params():,}")
        print(f"Student parameters: {student.count_params():,}")
        print(f"Compression ratio: {teacher.count_params()/student.count_params():.2f}x")
        
        return teacher, student
    
    def distillation_loss(self, y_true, y_pred_student, y_pred_teacher, 
                         temperature=3.0, alpha=0.7):
        """Custom distillation loss combining hard and soft targets"""
        
        # Hard target loss (standard cross-entropy)
        hard_loss = keras.losses.categorical_crossentropy(y_true, y_pred_student)
        
        # Soft target loss (KL divergence with temperature)
        y_pred_teacher_soft = tf.nn.softmax(y_pred_teacher / temperature)
        y_pred_student_soft = tf.nn.softmax(y_pred_student / temperature)
        
        soft_loss = keras.losses.kullback_leibler_divergence(
            y_pred_teacher_soft, y_pred_student_soft
        ) * (temperature ** 2)
        
        # Combined loss
        return alpha * soft_loss + (1 - alpha) * hard_loss
    
    def demonstrate_distillation_process(self):
        """Visualize the knowledge distillation process"""
        
        # Create visualization
        fig, axes = plt.subplots(2, 3, figsize=(15, 10))
        
        # Step 1: Train Teacher
        ax = axes[0, 0]
        ax.text(0.5, 0.7, 'Step 1: Train Teacher', ha='center', fontsize=12, 
                weight='bold')
        ax.text(0.5, 0.3, 'Large model trained\non full dataset\nwith hard labels',
                ha='center', fontsize=10)
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.axis('off')
        
        # Step 2: Generate Soft Labels
        ax = axes[0, 1]
        ax.text(0.5, 0.7, 'Step 2: Generate Soft Labels', ha='center', 
                fontsize=12, weight='bold')
        ax.text(0.5, 0.3, 'Teacher produces\nsoft probability\ndistributions',
                ha='center', fontsize=10)
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.axis('off')
        
        # Step 3: Train Student
        ax = axes[0, 2]
        ax.text(0.5, 0.7, 'Step 3: Train Student', ha='center', fontsize=12,
                weight='bold')
        ax.text(0.5, 0.3, 'Small model learns from\nsoft labels (knowledge)\nand hard labels',
                ha='center', fontsize=10)
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.axis('off')
        
        # Temperature effect
        ax = axes[1, 0]
        temps = [1, 3, 10]
        x = np.arange(5)
        logits = np.array([3.0, 1.0, 0.5, 0.2, 0.1])
        
        for temp in temps:
            probs = tf.nn.softmax(logits / temp).numpy()
            ax.plot(x, probs, marker='o', label=f'T={temp}')
        
        ax.set_xlabel('Class')
        ax.set_ylabel('Probability')
        ax.set_title('Temperature Effect on Softmax')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Loss comparison
        ax = axes[1, 1]
        epochs = np.arange(1, 21)
        teacher_loss = 0.3 * np.exp(-0.2 * epochs) + 0.05
        student_no_kd = 0.5 * np.exp(-0.1 * epochs) + 0.15
        student_kd = 0.4 * np.exp(-0.15 * epochs) + 0.08
        
        ax.plot(epochs, teacher_loss, label='Teacher', linewidth=2)
        ax.plot(epochs, student_no_kd, label='Student (No KD)', linewidth=2)
        ax.plot(epochs, student_kd, label='Student (With KD)', linewidth=2)
        ax.set_xlabel('Epoch')
        ax.set_ylabel('Loss')
        ax.set_title('Training Loss Comparison')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Model size vs accuracy
        ax = axes[1, 2]
        models = ['Teacher', 'Student\n(No KD)', 'Student\n(KD)', 'Pruned\nTeacher']
        sizes = [100, 20, 20, 40]
        accuracies = [95, 85, 92, 90]
        
        colors = ['blue', 'red', 'green', 'orange']
        ax.scatter(sizes, accuracies, s=200, c=colors, alpha=0.6, edgecolor='black')
        
        for i, model in enumerate(models):
            ax.annotate(model, (sizes[i], accuracies[i]), ha='center', va='center')
        
        ax.set_xlabel('Model Size (MB)')
        ax.set_ylabel('Accuracy (%)')
        ax.set_title('Model Size vs Accuracy Trade-off')
        ax.grid(True, alpha=0.3)
        
        plt.suptitle('Knowledge Distillation Process and Benefits', fontsize=14)
        plt.tight_layout()
        plt.show()

# Knowledge distillation
kd = KnowledgeDistillation()

print("\n" + "="*60)
print("KNOWLEDGE DISTILLATION")
print("="*60)

# Build teacher and student
teacher_model, student_model = kd.build_teacher_student_models()

# Demonstrate process
print("\nDemonstrating distillation process:")
kd.demonstrate_distillation_process()

Practical Implementation Guide

print("\n" + "="*60)
print("TRANSFER LEARNING IMPLEMENTATION GUIDE")
print("="*60)

implementation_guide = """
STEP-BY-STEP IMPLEMENTATION:

1. CHOOSE PRE-TRAINED MODEL:
   • Match to your domain (vision/NLP/audio)
   • Consider model size vs performance
   • Check input requirements
   • Verify license for commercial use

2. PREPARE YOUR DATA:
   • Match preprocessing to pre-trained model
   • Resize images to expected dimensions
   • Normalize using same statistics
   • Handle class imbalance

3. DESIGN ARCHITECTURE:
   ```python
   # Load pre-trained base
   base_model = tf.keras.applications.ResNet50(
       weights='imagenet',
       include_top=False,
       input_shape=(224, 224, 3)
   )
   
   # Freeze base
   base_model.trainable = False
   
   # Add custom head
   model = tf.keras.Sequential([
       base_model,
       tf.keras.layers.GlobalAveragePooling2D(),
       tf.keras.layers.Dense(128, activation='relu'),
       tf.keras.layers.Dropout(0.5),
       tf.keras.layers.Dense(num_classes)
   ])
   ```

4. TRAINING STRATEGY:
   Phase 1: Feature extraction
   • Freeze all base layers
   • Train only new layers
   • Higher learning rate (1e-3)
   • 5-10 epochs
   
   Phase 2: Fine-tuning
   • Unfreeze top layers
   • Lower learning rate (1e-5)
   • Train 10-20 epochs
   • Monitor for overfitting

5. OPTIMIZATION TIPS:
   • Use differential learning rates
   • Apply data augmentation
   • Use callbacks (early stopping, reduce LR)
   • Monitor validation metrics closely

6. EVALUATION:
   • Test on held-out data
   • Check for domain shift
   • Analyze failure cases
   • Compare with baseline

7. DEPLOYMENT:
   • Optimize model size (quantization, pruning)
   • Test inference speed
   • Handle edge cases
   • Version control models
"""

print(implementation_guide)

# Code examples
code_examples = """
PRACTICAL CODE EXAMPLES:

# Example 1: Vision Transfer Learning
base = tf.keras.applications.MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet'
)
base.trainable = False

model = tf.keras.Sequential([
    tf.keras.layers.Input((224, 224, 3)),
    tf.keras.layers.experimental.preprocessing.Rescaling(1./255),
    tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
    base,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(len(class_names))
])

# Example 2: NLP Transfer Learning (using TF Hub)
import tensorflow_hub as hub

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                          dtype=tf.string, trainable=True)

model = tf.keras.Sequential([
    hub_layer,
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)
])

# Example 3: Progressive Unfreezing
def unfreeze_model(model, num_layers):
    for layer in model.layers[-num_layers:]:
        if not isinstance(layer, tf.keras.layers.BatchNormalization):
            layer.trainable = True
    
    # Recompile with lower learning rate
    model.compile(
        optimizer=tf.keras.optimizers.Adam(1e-5),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
"""

print("\nCode Examples:")
print(code_examples)

Best Practices and Common Pitfalls

print("\n" + "="*60)
print("TRANSFER LEARNING BEST PRACTICES")
print("="*60)

best_practices = """
KEY GUIDELINES:

1. MODEL SELECTION:
   • Start with smaller models (MobileNet, EfficientNet-B0)
   • Scale up only if needed
   • Consider inference requirements
   • Check pre-training dataset relevance

2. DATA PREPROCESSING:
   • MUST match pre-trained model's preprocessing
   • Use same normalization (ImageNet stats common)
   • Maintain aspect ratios when resizing
   • Apply augmentation after preprocessing

3. TRAINING STRATEGY:
   • Always start with frozen backbone
   • Gradually unfreeze layers
   • Use lower learning rates for pre-trained layers
   • Monitor for catastrophic forgetting

4. FINE-TUNING TIPS:
   • Fine-tune BatchNorm in training mode
   • Use small learning rates (1e-5 to 1e-4)
   • Unfreeze from top to bottom
   • Stop if validation loss increases

5. REGULARIZATION:
   • Add dropout to new layers (0.2-0.5)
   • Use L2 regularization sparingly
   • Data augmentation crucial
   • Consider mixup/cutmix

6. COMMON PITFALLS:

   ✗ Wrong preprocessing:
   Solution: Check model documentation
   
   ✗ Unfreezing too early:
   Solution: Train head first
   
   ✗ Learning rate too high:
   Solution: Use 10x-100x lower for fine-tuning
   
   ✗ Overfitting to small dataset:
   Solution: Keep more layers frozen
   
   ✗ Catastrophic forgetting:
   Solution: Lower learning rate, gradual unfreezing
   
   ✗ Domain mismatch:
   Solution: Consider domain adaptation

7. WHEN NOT TO USE:
   • Very different domains
   • Sufficient data (>100K samples)
   • Unique data characteristics
   • Real-time requirements incompatible

8. PERFORMANCE OPTIMIZATION:
   • Use mixed precision training
   • Optimize batch size for GPU
   • Cache preprocessed data
   • Use tf.data.Dataset properly
"""

print(best_practices)

# Performance comparison
performance_tips = """
PERFORMANCE COMPARISON:

Approach          | Data Needed | Training Time | Accuracy
-----------------|-------------|---------------|----------
From Scratch     | >100K       | Days          | Baseline
Feature Extract  | 1K-10K      | Minutes       | Good
Fine-tuning      | 10K-50K     | Hours         | Better
Full Fine-tuning | >50K        | Hours-Days    | Best

OPTIMIZATION TECHNIQUES:
• Quantization: 4x size reduction, minimal accuracy loss
• Pruning: Remove redundant weights
• Knowledge Distillation: Compress to smaller model
• ONNX/TensorFlow Lite: Optimize for deployment
"""

print(performance_tips)

Practice Exercises

Exercise 1: Multi-Domain Transfer

Implement transfer learning across multiple domains:

Transfer from ImageNet to medical images
Apply domain adaptation techniques
Handle class imbalance and limited data
Compare different pre-trained backbones
Evaluate domain-specific metrics

Exercise 2: Few-Shot Learning

Build few-shot learning system:

Implement prototypical networks
Use pre-trained features as base
Create support and query sets
Implement metric learning
Test on novel classes

Exercise 3: Cross-Modal Transfer

Transfer between different modalities:

Vision to text (image captioning)
Text to speech synthesis
Audio to vision (sound localization)
Implement shared embeddings
Evaluate cross-modal retrieval

Summary and Key Takeaways

🎯 Key Points to Remember

Leverage Pre-trained Models: Don't train from scratch without good reason
Match Preprocessing: Critical to use same preprocessing as pre-training
Progressive Unfreezing: Start frozen, gradually unfreeze layers
Learning Rates: Use much lower rates for pre-trained layers
Domain Adaptation: Handle domain shift with appropriate techniques
Knowledge Distillation: Compress large models to deployable size
Evaluation: Test thoroughly on target domain data
Start Simple: Begin with feature extraction before fine-tuning