Transfer learning is a powerful technique that leverages knowledge from pre-trained models to solve new, related tasks with limited data and computational resources. By transferring learned features from models trained on massive datasets like ImageNet or large text corpora, we can achieve state-of-the-art results on specialized tasks with fraction of the data and training time. This lesson covers transfer learning strategies, fine-tuning techniques, domain adaptation, cross-modal transfer, and practical applications across computer vision, NLP, and beyond.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.applications import (
VGG16, VGG19, ResNet50, ResNet101, ResNet152,
InceptionV3, InceptionResNetV2, MobileNetV2,
DenseNet121, NASNetMobile, EfficientNetB0
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import cv2
import warnings
warnings.filterwarnings('ignore')
# Set random seeds
np.random.seed(42)
tf.random.set_seed(42)
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {len(tf.config.list_physical_devices('GPU')) > 0}")
print("\n" + "="*60)
print("TRANSFER LEARNING FUNDAMENTALS")
print("="*60)
# Core concepts
transfer_concepts = """
TRANSFER LEARNING KEY CONCEPTS:
1. WHAT IS TRANSFER LEARNING:
• Reuse learned features from one task for another
• Leverage pre-trained models
• Fine-tune for specific domain
• Reduce training time and data requirements
2. TYPES OF TRANSFER:
• Feature Extraction: Use as fixed feature extractor
• Fine-tuning: Unfreeze and retrain some layers
• Domain Adaptation: Transfer across domains
• Multi-task Learning: Learn multiple tasks jointly
• Zero-shot Learning: Recognize unseen classes
3. PRE-TRAINING SOURCES:
• ImageNet: 14M images, 1000 classes (vision)
• COCO: Object detection and segmentation
• OpenImages: 9M images with annotations
• BERT/GPT: Large text corpora (NLP)
• AudioSet: Audio event detection
4. WHEN TO USE:
• Limited training data (< 10K samples)
• Similar source and target domains
• Limited computational resources
• Need quick prototyping
• Baseline model development
5. STRATEGIES:
• Freeze all layers except classifier
• Progressive unfreezing
• Discriminative learning rates
• Layer-wise adaptation
• Knowledge distillation
6. BENEFITS:
• Faster training
• Better performance with less data
• Lower computational cost
• Regularization effect
• Access to SOTA features
7. CHALLENGES:
• Domain shift
• Negative transfer
• Catastrophic forgetting
• Architecture constraints
• Task mismatch
"""
print(transfer_concepts)
class VisionTransferLearning:
"""Transfer learning for computer vision tasks"""
def __init__(self):
self.models = {}
self.histories = {}
def load_pretrained_models(self):
"""Load and compare popular pre-trained models"""
models_config = {
'VGG16': {'model': VGG16, 'size': (224, 224), 'preprocess': 'vgg'},
'ResNet50': {'model': ResNet50, 'size': (224, 224), 'preprocess': 'resnet'},
'InceptionV3': {'model': InceptionV3, 'size': (299, 299), 'preprocess': 'inception'},
'MobileNetV2': {'model': MobileNetV2, 'size': (224, 224), 'preprocess': 'mobilenet'},
'EfficientNetB0': {'model': EfficientNetB0, 'size': (224, 224), 'preprocess': 'efficientnet'}
}
model_stats = []
for name, config in models_config.items():
# Load model without top layers
base_model = config['model'](
weights='imagenet',
include_top=False,
input_shape=config['size'] + (3,)
)
# Get statistics
total_params = base_model.count_params()
n_layers = len(base_model.layers)
model_stats.append({
'Model': name,
'Parameters': f"{total_params/1e6:.1f}M",
'Layers': n_layers,
'Input Size': f"{config['size'][0]}x{config['size'][1]}",
'Top-1 Acc': self.get_imagenet_accuracy(name)
})
self.models[name] = base_model
# Create comparison table
stats_df = pd.DataFrame(model_stats)
print("\nPre-trained Models Comparison:")
print("-" * 60)
print(stats_df.to_string(index=False))
return stats_df
def get_imagenet_accuracy(self, model_name):
"""Get reported ImageNet accuracy"""
accuracies = {
'VGG16': 71.3,
'ResNet50': 74.9,
'InceptionV3': 77.9,
'MobileNetV2': 71.3,
'EfficientNetB0': 77.1
}
return accuracies.get(model_name, 0.0)
def create_transfer_model(self, base_model_name='ResNet50',
num_classes=10,
trainable_layers=0,
dropout_rate=0.5):
"""Create transfer learning model with different strategies"""
# Get base model
if base_model_name == 'ResNet50':
base_model = ResNet50(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))
elif base_model_name == 'VGG16':
base_model = VGG16(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))
elif base_model_name == 'MobileNetV2':
base_model = MobileNetV2(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))
else:
raise ValueError(f"Unknown model: {base_model_name}")
# Freeze base model layers
base_model.trainable = False
# Unfreeze top layers if specified
if trainable_layers > 0:
for layer in base_model.layers[-trainable_layers:]:
layer.trainable = True
# Build complete model
inputs = keras.Input(shape=(224, 224, 3))
# Data augmentation
x = layers.RandomFlip('horizontal')(inputs)
x = layers.RandomRotation(0.1)(x)
x = layers.RandomZoom(0.1)(x)
# Base model
x = base_model(x, training=False)
# Pooling
x = layers.GlobalAveragePooling2D()(x)
# Classification head
x = layers.Dense(256, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(dropout_rate)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(dropout_rate)(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)
model = keras.Model(inputs, outputs)
return model, base_model
def progressive_unfreezing(self, model, base_model, X_train, y_train,
X_val, y_val, epochs_per_stage=5):
"""Implement progressive unfreezing strategy"""
print("\nProgressive Unfreezing Strategy:")
print("-" * 40)
histories = []
# Stage 1: Train only the classifier head
print("\nStage 1: Training classifier head only")
base_model.trainable = False
model.compile(
optimizer=keras.optimizers.Adam(1e-3),
loss='categorical_crossentropy',
metrics=['accuracy']
)
history1 = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs_per_stage,
batch_size=32,
verbose=0
)
histories.append(history1)
val_acc = history1.history['val_accuracy'][-1]
print(f" Validation accuracy: {val_acc:.3f}")
# Stage 2: Unfreeze top layers
print("\nStage 2: Fine-tuning top 20 layers")
base_model.trainable = True
for layer in base_model.layers[:-20]:
layer.trainable = False
model.compile(
optimizer=keras.optimizers.Adam(1e-4), # Lower learning rate
loss='categorical_crossentropy',
metrics=['accuracy']
)
history2 = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs_per_stage,
batch_size=32,
verbose=0
)
histories.append(history2)
val_acc = history2.history['val_accuracy'][-1]
print(f" Validation accuracy: {val_acc:.3f}")
# Stage 3: Unfreeze all layers
print("\nStage 3: Fine-tuning entire network")
base_model.trainable = True
model.compile(
optimizer=keras.optimizers.Adam(1e-5), # Very low learning rate
loss='categorical_crossentropy',
metrics=['accuracy']
)
history3 = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs_per_stage,
batch_size=32,
verbose=0
)
histories.append(history3)
val_acc = history3.history['val_accuracy'][-1]
print(f" Validation accuracy: {val_acc:.3f}")
return histories
def visualize_transfer_strategies(self):
"""Visualize different transfer learning strategies"""
strategies = {
'Feature Extraction': {
'frozen_layers': 'All base layers',
'trainable': 'Only classifier',
'learning_rate': 'High (1e-3)',
'use_case': 'Very small dataset'
},
'Partial Fine-tuning': {
'frozen_layers': 'Early layers',
'trainable': 'Top layers + classifier',
'learning_rate': 'Medium (1e-4)',
'use_case': 'Medium dataset'
},
'Full Fine-tuning': {
'frozen_layers': 'None',
'trainable': 'Entire network',
'learning_rate': 'Low (1e-5)',
'use_case': 'Large dataset'
},
'Progressive Unfreezing': {
'frozen_layers': 'Gradual unfreezing',
'trainable': 'Stage-wise',
'learning_rate': 'Decreasing',
'use_case': 'Optimal approach'
}
}
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, (name, details) in enumerate(strategies.items()):
ax = axes[idx]
# Create visualization
layers = ['Input'] + [f'Conv{i}' for i in range(1, 6)] + ['Classifier']
n_layers = len(layers)
# Color coding
colors = []
if name == 'Feature Extraction':
colors = ['lightblue'] + ['red'] * 5 + ['green']
elif name == 'Partial Fine-tuning':
colors = ['lightblue'] + ['red'] * 3 + ['yellow'] * 2 + ['green']
elif name == 'Full Fine-tuning':
colors = ['lightblue'] + ['yellow'] * 5 + ['green']
else: # Progressive
colors = ['lightblue'] + ['orange'] * 5 + ['green']
# Plot layers
y_pos = np.arange(n_layers)
ax.barh(y_pos, [1] * n_layers, color=colors, edgecolor='black', linewidth=2)
# Labels
ax.set_yticks(y_pos)
ax.set_yticklabels(layers)
ax.set_xlim(0, 1.5)
ax.set_title(f'{name}', fontsize=12, weight='bold')
# Add details
text = f"Frozen: {details['frozen_layers']}\n"
text += f"LR: {details['learning_rate']}\n"
text += f"Use: {details['use_case']}"
ax.text(1.1, n_layers/2, text, fontsize=9, va='center')
# Remove x-axis
ax.set_xticks([])
# Add legend
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='red', label='Frozen'),
Patch(facecolor='yellow', label='Fine-tuning'),
Patch(facecolor='green', label='Training'),
Patch(facecolor='orange', label='Progressive')
]
fig.legend(handles=legend_elements, loc='center',
bbox_to_anchor=(0.5, 0.95), ncol=4)
plt.suptitle('Transfer Learning Strategies', fontsize=14, y=1.0)
plt.tight_layout()
plt.show()
# Vision transfer learning
vision_transfer = VisionTransferLearning()
print("\n" + "="*60)
print("COMPUTER VISION TRANSFER LEARNING")
print("="*60)
# Load and compare models
model_comparison = vision_transfer.load_pretrained_models()
print("\nVisualizing transfer strategies:")
vision_transfer.visualize_transfer_strategies()
class NLPTransferLearning:
"""Transfer learning for NLP tasks"""
def __init__(self):
self.models = {}
self.tokenizers = {}
def demonstrate_embedding_transfer(self):
"""Demonstrate word embedding transfer"""
print("\nWord Embedding Transfer:")
print("-" * 40)
# Simulated pre-trained embeddings
vocab_size = 10000
embedding_dim = 100
# Create pre-trained embeddings (simulated)
pretrained_embeddings = np.random.randn(vocab_size, embedding_dim)
# Build model with pre-trained embeddings
model_pretrained = keras.Sequential([
layers.Embedding(vocab_size, embedding_dim,
weights=[pretrained_embeddings],
trainable=False), # Frozen embeddings
layers.LSTM(128, return_sequences=True),
layers.LSTM(64),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Build model without pre-trained embeddings
model_scratch = keras.Sequential([
layers.Embedding(vocab_size, embedding_dim), # Trainable
layers.LSTM(128, return_sequences=True),
layers.LSTM(64),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
print(f"Model with pre-trained embeddings:")
print(f" Trainable parameters: {sum([tf.size(w).numpy() for w in model_pretrained.trainable_weights]):,}")
print(f"\nModel from scratch:")
print(f" Trainable parameters: {sum([tf.size(w).numpy() for w in model_scratch.trainable_weights]):,}")
return model_pretrained, model_scratch
def build_bert_style_model(self, max_length=128, vocab_size=30000):
"""Build BERT-style transformer model (simplified)"""
# Transformer block
def transformer_block(inputs, embed_dim, num_heads, ff_dim, rate=0.1):
# Multi-head self-attention
attn_output = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)(inputs, inputs)
attn_output = layers.Dropout(rate)(attn_output)
out1 = layers.LayerNormalization(epsilon=1e-6)(inputs + attn_output)
# Feed forward network
ffn_output = keras.Sequential([
layers.Dense(ff_dim, activation="relu"),
layers.Dense(embed_dim),
])(out1)
ffn_output = layers.Dropout(rate)(ffn_output)
out2 = layers.LayerNormalization(epsilon=1e-6)(out1 + ffn_output)
return out2
# Model architecture
embed_dim = 128
num_heads = 8
ff_dim = 512
inputs = layers.Input(shape=(max_length,))
embedding_layer = layers.Embedding(vocab_size, embed_dim)
x = embedding_layer(inputs)
# Positional encoding
positions = tf.range(start=0, limit=max_length, delta=1)
position_embedding = layers.Embedding(max_length, embed_dim)(positions)
x = x + position_embedding
# Transformer blocks
x = transformer_block(x, embed_dim, num_heads, ff_dim)
x = transformer_block(x, embed_dim, num_heads, ff_dim)
# Classification head
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(32, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
return model
def demonstrate_fine_tuning_strategies(self):
"""Compare different fine-tuning strategies for NLP"""
strategies = {
'Feature-based': {
'description': 'Use pre-trained as feature extractor',
'layers_frozen': 'All transformer layers',
'train_time': 'Fast',
'performance': 'Good for small data'
},
'Fine-tuning Last': {
'description': 'Fine-tune last transformer layer',
'layers_frozen': 'All except last layer',
'train_time': 'Medium',
'performance': 'Better than feature-based'
},
'Full Fine-tuning': {
'description': 'Fine-tune entire model',
'layers_frozen': 'None',
'train_time': 'Slow',
'performance': 'Best with enough data'
},
'Adapter Tuning': {
'description': 'Add small trainable adapters',
'layers_frozen': 'Original weights frozen',
'train_time': 'Fast',
'performance': 'Efficient and effective'
}
}
# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
strategies_list = list(strategies.keys())
metrics = ['Train Time', 'Performance', 'Memory Usage']
# Simulated scores
scores = {
'Feature-based': [0.3, 0.6, 0.2],
'Fine-tuning Last': [0.5, 0.7, 0.4],
'Full Fine-tuning': [0.9, 0.9, 0.9],
'Adapter Tuning': [0.4, 0.8, 0.3]
}
x = np.arange(len(strategies_list))
width = 0.25
for i, metric in enumerate(metrics):
values = [scores[s][i] for s in strategies_list]
ax.bar(x + i * width, values, width, label=metric)
ax.set_xlabel('Strategy')
ax.set_ylabel('Relative Score')
ax.set_title('NLP Fine-tuning Strategies Comparison')
ax.set_xticks(x + width)
ax.set_xticklabels(strategies_list, rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
# Print details
print("\nNLP Fine-tuning Strategies:")
print("-" * 60)
for name, details in strategies.items():
print(f"\n{name}:")
for key, value in details.items():
print(f" {key}: {value}")
# NLP transfer learning
nlp_transfer = NLPTransferLearning()
print("\n" + "="*60)
print("NLP TRANSFER LEARNING")
print("="*60)
# Demonstrate embedding transfer
pretrained_model, scratch_model = nlp_transfer.demonstrate_embedding_transfer()
# Build transformer model
print("\nBuilding BERT-style model:")
bert_model = nlp_transfer.build_bert_style_model()
print(f" Total parameters: {bert_model.count_params():,}")
# Compare strategies
nlp_transfer.demonstrate_fine_tuning_strategies()
class DomainAdaptation:
"""Domain adaptation and cross-domain transfer"""
def __init__(self):
self.techniques = {}
def demonstrate_domain_shift(self):
"""Visualize domain shift problem"""
np.random.seed(42)
# Generate source domain data
source_X = np.random.randn(500, 2)
source_y = (source_X[:, 0] + source_X[:, 1] > 0).astype(int)
# Generate target domain data (shifted)
target_X = np.random.randn(500, 2) + [1, 1] # Shifted distribution
target_y = (target_X[:, 0] + target_X[:, 1] > 2).astype(int)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Source domain
axes[0].scatter(source_X[:, 0], source_X[:, 1], c=source_y,
cmap='coolwarm', alpha=0.6, edgecolor='black', linewidth=0.5)
axes[0].set_title('Source Domain')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].grid(True, alpha=0.3)
# Target domain
axes[1].scatter(target_X[:, 0], target_X[:, 1], c=target_y,
cmap='coolwarm', alpha=0.6, edgecolor='black', linewidth=0.5)
axes[1].set_title('Target Domain')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].grid(True, alpha=0.3)
# Combined view
axes[2].scatter(source_X[:, 0], source_X[:, 1], c='blue',
alpha=0.4, label='Source', s=30)
axes[2].scatter(target_X[:, 0], target_X[:, 1], c='red',
alpha=0.4, label='Target', s=30)
axes[2].set_title('Domain Shift Visualization')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.suptitle('Domain Shift Problem in Transfer Learning', fontsize=14)
plt.tight_layout()
plt.show()
return source_X, source_y, target_X, target_y
def build_domain_adversarial_network(self, input_dim=100, feature_dim=50):
"""Build Domain Adversarial Neural Network (DANN)"""
# Shared feature extractor
feature_input = layers.Input(shape=(input_dim,))
feature_extractor = keras.Sequential([
layers.Dense(128, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(feature_dim, activation='relu')
], name='feature_extractor')
features = feature_extractor(feature_input)
# Task classifier
task_classifier = keras.Sequential([
layers.Dense(32, activation='relu'),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid')
], name='task_classifier')
task_output = task_classifier(features)
# Domain discriminator (with gradient reversal)
from tensorflow.keras import backend as K
class GradientReversal(layers.Layer):
def __init__(self, hp_lambda=1.0, **kwargs):
super().__init__(**kwargs)
self.hp_lambda = hp_lambda
def call(self, x):
return self.grad_reverse(x)
@tf.custom_gradient
def grad_reverse(self, x):
y = tf.identity(x)
def custom_grad(dy):
return -self.hp_lambda * dy
return y, custom_grad
# Domain classifier with gradient reversal
reversed_features = GradientReversal()(features)
domain_discriminator = keras.Sequential([
layers.Dense(32, activation='relu'),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid')
], name='domain_discriminator')
domain_output = domain_discriminator(reversed_features)
# Complete model
model = keras.Model(
inputs=feature_input,
outputs=[task_output, domain_output]
)
print("\nDomain Adversarial Network Architecture:")
print("-" * 40)
print(f"Feature Extractor: Shared representation learning")
print(f"Task Classifier: Target task prediction")
print(f"Domain Discriminator: Domain classification (reversed gradients)")
print(f"Total parameters: {model.count_params():,}")
return model, feature_extractor, task_classifier, domain_discriminator
def adaptation_techniques_comparison(self):
"""Compare different domain adaptation techniques"""
techniques = {
'Direct Transfer': {
'complexity': 'Low',
'data_needed': 'Target labels',
'performance': 'Poor with shift',
'use_case': 'Similar domains'
},
'Fine-tuning': {
'complexity': 'Low',
'data_needed': 'Some target labels',
'performance': 'Good',
'use_case': 'Moderate shift'
},
'Feature Matching': {
'complexity': 'Medium',
'data_needed': 'Unlabeled target',
'performance': 'Good',
'use_case': 'Distribution shift'
},
'Adversarial (DANN)': {
'complexity': 'High',
'data_needed': 'Unlabeled target',
'performance': 'Very good',
'use_case': 'Large shift'
},
'Self-training': {
'complexity': 'Medium',
'data_needed': 'Unlabeled target',
'performance': 'Good',
'use_case': 'Confident predictions'
}
}
# Create comparison matrix
fig, ax = plt.subplots(figsize=(10, 6))
techniques_list = list(techniques.keys())
attributes = ['Complexity', 'Data Efficiency', 'Performance', 'Robustness']
# Create scores (normalized)
scores = {
'Direct Transfer': [0.2, 0.3, 0.4, 0.3],
'Fine-tuning': [0.3, 0.6, 0.7, 0.6],
'Feature Matching': [0.6, 0.8, 0.7, 0.7],
'Adversarial (DANN)': [0.9, 0.9, 0.9, 0.8],
'Self-training': [0.5, 0.7, 0.6, 0.5]
}
# Create heatmap
score_matrix = np.array([scores[t] for t in techniques_list])
im = ax.imshow(score_matrix, cmap='YlOrRd', aspect='auto')
# Set ticks and labels
ax.set_xticks(np.arange(len(attributes)))
ax.set_yticks(np.arange(len(techniques_list)))
ax.set_xticklabels(attributes)
ax.set_yticklabels(techniques_list)
# Add values to cells
for i in range(len(techniques_list)):
for j in range(len(attributes)):
text = ax.text(j, i, f'{score_matrix[i, j]:.1f}',
ha="center", va="center", color="black")
# Add colorbar
plt.colorbar(im, ax=ax)
ax.set_title('Domain Adaptation Techniques Comparison')
plt.tight_layout()
plt.show()
return techniques
# Domain adaptation
domain_adapt = DomainAdaptation()
print("\n" + "="*60)
print("DOMAIN ADAPTATION")
print("="*60)
print("\nVisualizing domain shift:")
source_X, source_y, target_X, target_y = domain_adapt.demonstrate_domain_shift()
print("\nBuilding Domain Adversarial Network:")
dann_model, feature_ext, task_clf, domain_disc = domain_adapt.build_domain_adversarial_network()
print("\nComparing adaptation techniques:")
techniques = domain_adapt.adaptation_techniques_comparison()
class KnowledgeDistillation:
"""Knowledge distillation for model compression"""
def __init__(self):
self.models = {}
def build_teacher_student_models(self, input_shape=(32, 32, 3), num_classes=10):
"""Build teacher (large) and student (small) models"""
# Teacher model (large)
teacher = keras.Sequential([
layers.Conv2D(64, 3, padding='same', activation='relu',
input_shape=input_shape),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(2),
layers.Dropout(0.3),
layers.Conv2D(128, 3, padding='same', activation='relu'),
layers.Conv2D(128, 3, padding='same', activation='relu'),
layers.MaxPooling2D(2),
layers.Dropout(0.3),
layers.Conv2D(256, 3, padding='same', activation='relu'),
layers.Conv2D(256, 3, padding='same', activation='relu'),
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes, activation='softmax')
], name='teacher')
# Student model (small)
student = keras.Sequential([
layers.Conv2D(16, 3, padding='same', activation='relu',
input_shape=input_shape),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(2),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.GlobalAveragePooling2D(),
layers.Dense(32, activation='relu'),
layers.Dense(num_classes, activation='softmax')
], name='student')
print("\nTeacher-Student Model Comparison:")
print("-" * 40)
print(f"Teacher parameters: {teacher.count_params():,}")
print(f"Student parameters: {student.count_params():,}")
print(f"Compression ratio: {teacher.count_params()/student.count_params():.2f}x")
return teacher, student
def distillation_loss(self, y_true, y_pred_student, y_pred_teacher,
temperature=3.0, alpha=0.7):
"""Custom distillation loss combining hard and soft targets"""
# Hard target loss (standard cross-entropy)
hard_loss = keras.losses.categorical_crossentropy(y_true, y_pred_student)
# Soft target loss (KL divergence with temperature)
y_pred_teacher_soft = tf.nn.softmax(y_pred_teacher / temperature)
y_pred_student_soft = tf.nn.softmax(y_pred_student / temperature)
soft_loss = keras.losses.kullback_leibler_divergence(
y_pred_teacher_soft, y_pred_student_soft
) * (temperature ** 2)
# Combined loss
return alpha * soft_loss + (1 - alpha) * hard_loss
def demonstrate_distillation_process(self):
"""Visualize the knowledge distillation process"""
# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Step 1: Train Teacher
ax = axes[0, 0]
ax.text(0.5, 0.7, 'Step 1: Train Teacher', ha='center', fontsize=12,
weight='bold')
ax.text(0.5, 0.3, 'Large model trained\non full dataset\nwith hard labels',
ha='center', fontsize=10)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
# Step 2: Generate Soft Labels
ax = axes[0, 1]
ax.text(0.5, 0.7, 'Step 2: Generate Soft Labels', ha='center',
fontsize=12, weight='bold')
ax.text(0.5, 0.3, 'Teacher produces\nsoft probability\ndistributions',
ha='center', fontsize=10)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
# Step 3: Train Student
ax = axes[0, 2]
ax.text(0.5, 0.7, 'Step 3: Train Student', ha='center', fontsize=12,
weight='bold')
ax.text(0.5, 0.3, 'Small model learns from\nsoft labels (knowledge)\nand hard labels',
ha='center', fontsize=10)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
# Temperature effect
ax = axes[1, 0]
temps = [1, 3, 10]
x = np.arange(5)
logits = np.array([3.0, 1.0, 0.5, 0.2, 0.1])
for temp in temps:
probs = tf.nn.softmax(logits / temp).numpy()
ax.plot(x, probs, marker='o', label=f'T={temp}')
ax.set_xlabel('Class')
ax.set_ylabel('Probability')
ax.set_title('Temperature Effect on Softmax')
ax.legend()
ax.grid(True, alpha=0.3)
# Loss comparison
ax = axes[1, 1]
epochs = np.arange(1, 21)
teacher_loss = 0.3 * np.exp(-0.2 * epochs) + 0.05
student_no_kd = 0.5 * np.exp(-0.1 * epochs) + 0.15
student_kd = 0.4 * np.exp(-0.15 * epochs) + 0.08
ax.plot(epochs, teacher_loss, label='Teacher', linewidth=2)
ax.plot(epochs, student_no_kd, label='Student (No KD)', linewidth=2)
ax.plot(epochs, student_kd, label='Student (With KD)', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training Loss Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
# Model size vs accuracy
ax = axes[1, 2]
models = ['Teacher', 'Student\n(No KD)', 'Student\n(KD)', 'Pruned\nTeacher']
sizes = [100, 20, 20, 40]
accuracies = [95, 85, 92, 90]
colors = ['blue', 'red', 'green', 'orange']
ax.scatter(sizes, accuracies, s=200, c=colors, alpha=0.6, edgecolor='black')
for i, model in enumerate(models):
ax.annotate(model, (sizes[i], accuracies[i]), ha='center', va='center')
ax.set_xlabel('Model Size (MB)')
ax.set_ylabel('Accuracy (%)')
ax.set_title('Model Size vs Accuracy Trade-off')
ax.grid(True, alpha=0.3)
plt.suptitle('Knowledge Distillation Process and Benefits', fontsize=14)
plt.tight_layout()
plt.show()
# Knowledge distillation
kd = KnowledgeDistillation()
print("\n" + "="*60)
print("KNOWLEDGE DISTILLATION")
print("="*60)
# Build teacher and student
teacher_model, student_model = kd.build_teacher_student_models()
# Demonstrate process
print("\nDemonstrating distillation process:")
kd.demonstrate_distillation_process()
print("\n" + "="*60)
print("TRANSFER LEARNING IMPLEMENTATION GUIDE")
print("="*60)
implementation_guide = """
STEP-BY-STEP IMPLEMENTATION:
1. CHOOSE PRE-TRAINED MODEL:
• Match to your domain (vision/NLP/audio)
• Consider model size vs performance
• Check input requirements
• Verify license for commercial use
2. PREPARE YOUR DATA:
• Match preprocessing to pre-trained model
• Resize images to expected dimensions
• Normalize using same statistics
• Handle class imbalance
3. DESIGN ARCHITECTURE:
```python
# Load pre-trained base
base_model = tf.keras.applications.ResNet50(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
# Freeze base
base_model.trainable = False
# Add custom head
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(num_classes)
])
```
4. TRAINING STRATEGY:
Phase 1: Feature extraction
• Freeze all base layers
• Train only new layers
• Higher learning rate (1e-3)
• 5-10 epochs
Phase 2: Fine-tuning
• Unfreeze top layers
• Lower learning rate (1e-5)
• Train 10-20 epochs
• Monitor for overfitting
5. OPTIMIZATION TIPS:
• Use differential learning rates
• Apply data augmentation
• Use callbacks (early stopping, reduce LR)
• Monitor validation metrics closely
6. EVALUATION:
• Test on held-out data
• Check for domain shift
• Analyze failure cases
• Compare with baseline
7. DEPLOYMENT:
• Optimize model size (quantization, pruning)
• Test inference speed
• Handle edge cases
• Version control models
"""
print(implementation_guide)
# Code examples
code_examples = """
PRACTICAL CODE EXAMPLES:
# Example 1: Vision Transfer Learning
base = tf.keras.applications.MobileNetV2(
input_shape=(224, 224, 3),
include_top=False,
weights='imagenet'
)
base.trainable = False
model = tf.keras.Sequential([
tf.keras.layers.Input((224, 224, 3)),
tf.keras.layers.experimental.preprocessing.Rescaling(1./255),
tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
base,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(len(class_names))
])
# Example 2: NLP Transfer Learning (using TF Hub)
import tensorflow_hub as hub
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
model = tf.keras.Sequential([
hub_layer,
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(1)
])
# Example 3: Progressive Unfreezing
def unfreeze_model(model, num_layers):
for layer in model.layers[-num_layers:]:
if not isinstance(layer, tf.keras.layers.BatchNormalization):
layer.trainable = True
# Recompile with lower learning rate
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
"""
print("\nCode Examples:")
print(code_examples)
print("\n" + "="*60)
print("TRANSFER LEARNING BEST PRACTICES")
print("="*60)
best_practices = """
KEY GUIDELINES:
1. MODEL SELECTION:
• Start with smaller models (MobileNet, EfficientNet-B0)
• Scale up only if needed
• Consider inference requirements
• Check pre-training dataset relevance
2. DATA PREPROCESSING:
• MUST match pre-trained model's preprocessing
• Use same normalization (ImageNet stats common)
• Maintain aspect ratios when resizing
• Apply augmentation after preprocessing
3. TRAINING STRATEGY:
• Always start with frozen backbone
• Gradually unfreeze layers
• Use lower learning rates for pre-trained layers
• Monitor for catastrophic forgetting
4. FINE-TUNING TIPS:
• Fine-tune BatchNorm in training mode
• Use small learning rates (1e-5 to 1e-4)
• Unfreeze from top to bottom
• Stop if validation loss increases
5. REGULARIZATION:
• Add dropout to new layers (0.2-0.5)
• Use L2 regularization sparingly
• Data augmentation crucial
• Consider mixup/cutmix
6. COMMON PITFALLS:
✗ Wrong preprocessing:
Solution: Check model documentation
✗ Unfreezing too early:
Solution: Train head first
✗ Learning rate too high:
Solution: Use 10x-100x lower for fine-tuning
✗ Overfitting to small dataset:
Solution: Keep more layers frozen
✗ Catastrophic forgetting:
Solution: Lower learning rate, gradual unfreezing
✗ Domain mismatch:
Solution: Consider domain adaptation
7. WHEN NOT TO USE:
• Very different domains
• Sufficient data (>100K samples)
• Unique data characteristics
• Real-time requirements incompatible
8. PERFORMANCE OPTIMIZATION:
• Use mixed precision training
• Optimize batch size for GPU
• Cache preprocessed data
• Use tf.data.Dataset properly
"""
print(best_practices)
# Performance comparison
performance_tips = """
PERFORMANCE COMPARISON:
Approach | Data Needed | Training Time | Accuracy
-----------------|-------------|---------------|----------
From Scratch | >100K | Days | Baseline
Feature Extract | 1K-10K | Minutes | Good
Fine-tuning | 10K-50K | Hours | Better
Full Fine-tuning | >50K | Hours-Days | Best
OPTIMIZATION TECHNIQUES:
• Quantization: 4x size reduction, minimal accuracy loss
• Pruning: Remove redundant weights
• Knowledge Distillation: Compress to smaller model
• ONNX/TensorFlow Lite: Optimize for deployment
"""
print(performance_tips)
Implement transfer learning across multiple domains:
Build few-shot learning system:
Transfer between different modalities: