π§ Advanced Model Training
Modern ML systems require sophisticated training strategies to handle large-scale data, distributed computing, and evolving requirements. From distributed training across GPU clusters to federated learning on edge devices, these techniques enable training models that were previously impossible due to computational or privacy constraints.
Scale Reality: GPT-3 required 175B parameters and 314 compute-days of training. Modern training strategies make such scale achievable and cost-effective.
β‘ Advanced Training Strategies
Distributed Training
Scale training across multiple GPUs and nodes for massive models
Key Advantages
- β’ Linear scaling with GPUs
- β’ Handles large models/datasets
- β’ Efficient gradient synchronization
- β’ Fault tolerance options
Use Cases
- β’ Large language models
- β’ Computer vision models
- β’ Multi-node training
- β’ Production ML pipelines
Implementation
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
import os
class DistributedTrainer:
def __init__(self, model, train_dataset, val_dataset, config):
self.config = config
self.setup_distributed()
# Move model to GPU and wrap with DDP
self.model = model.to(self.device)
self.model = DDP(self.model, device_ids=[self.local_rank])
# Setup distributed data loaders
self.train_sampler = DistributedSampler(train_dataset)
self.train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=config.batch_size,
sampler=self.train_sampler,
num_workers=config.num_workers,
pin_memory=True
)
# Optimizer with scaled learning rate
self.optimizer = torch.optim.AdamW(
self.model.parameters(),
lr=config.learning_rate * self.world_size, # Scale LR
weight_decay=config.weight_decay
)
self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
self.optimizer, T_max=config.num_epochs
)
def setup_distributed(self):
"""Initialize distributed training environment"""
self.local_rank = int(os.environ.get('LOCAL_RANK', 0))
self.world_size = int(os.environ.get('WORLD_SIZE', 1))
self.rank = int(os.environ.get('RANK', 0))
# Initialize process group
dist.init_process_group(
backend='nccl', # Use NCCL for GPU communication
init_method='env://',
world_size=self.world_size,
rank=self.rank
)
# Set device
torch.cuda.set_device(self.local_rank)
self.device = torch.device(f'cuda:{self.local_rank}')
def train_epoch(self, epoch):
"""Train for one epoch with gradient synchronization"""
self.model.train()
self.train_sampler.set_epoch(epoch) # Shuffle data differently each epoch
total_loss = 0
num_batches = 0
for batch_idx, (data, target) in enumerate(self.train_loader):
data, target = data.to(self.device), target.to(self.device)
self.optimizer.zero_grad()
# Forward pass
output = self.model(data)
loss = nn.CrossEntropyLoss()(output, target)
# Backward pass - gradients automatically synchronized
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
total_loss += loss.item()
num_batches += 1
if batch_idx % 100 == 0 and self.rank == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
# Average loss across all processes
avg_loss = total_loss / num_batches
dist.all_reduce(torch.tensor(avg_loss).to(self.device), op=dist.ReduceOp.AVG)
return avg_loss
# Launch distributed training
def launch_distributed_training():
# torchrun --nproc_per_node=4 --nnodes=2 train_script.py
trainer = DistributedTrainer(model, train_dataset, val_dataset, config)
for epoch in range(config.num_epochs):
train_loss = trainer.train_epoch(epoch)
if trainer.rank == 0: # Only log on main process
print(f"Epoch {epoch}: Average Loss = {train_loss:.4f}")
π― Optimization Techniques
Hyperparameter Optimization
Systematically find optimal hyperparameters for best model performance
Grid Search
LowSmall search spaces with few parameters
Random Search
MediumHigh-dimensional spaces with many parameters
Bayesian Optimization
HighExpensive evaluations and continuous spaces
Population-based Training
HighDeep learning and online optimization
Implementation
import optuna
from sklearn.model_selection import cross_val_score
def bayesian_optimization_example():
def objective(trial):
# Define hyperparameter search space
lr = trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128])
dropout = trial.suggest_float('dropout', 0.1, 0.5)
hidden_size = trial.suggest_int('hidden_size', 64, 512)
# Build and train model with these hyperparameters
model = create_model(hidden_size=hidden_size, dropout=dropout)
# Train and evaluate
train_model(model, lr=lr, batch_size=batch_size)
val_accuracy = evaluate_model(model, val_loader)
return val_accuracy
# Create study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best hyperparameters: {study.best_params}")
print(f"Best accuracy: {study.best_value}")
return study.best_params
π Training Monitoring
Training Metrics
Core metrics for tracking model learning progress
- Loss curves (train/validation)
- Accuracy and F1-score
- Learning rate schedules
- Gradient norms
- Weight distributions
Live Training Dashboard
β Training Best Practices
Performance
- β Use mixed precision training for speed and memory
- β Implement gradient clipping to prevent exploding gradients
- β Use learning rate scheduling for optimal convergence
- β Monitor GPU utilization and adjust batch sizes
- β Leverage data parallelism and model parallelism
- β Optimize data loading with multiple workers
Reliability
- β Save checkpoints regularly for fault tolerance
- β Validate on held-out data throughout training
- β Implement early stopping to prevent overfitting
- β Log comprehensive metrics for debugging
- β Use version control for code and data
- β Test training pipeline with small datasets first
π― Key Takeaways
Scale systematically: Use distributed training for large models and federated learning for privacy-sensitive data
Optimize hyperparameters: Bayesian optimization is most efficient for expensive model evaluations
Prevent overfitting: Use regularization, data augmentation, and early stopping for better generalization
Monitor comprehensively: Track both training metrics and system metrics for optimal performance
Plan for production: Implement checkpointing, fault tolerance, and automated monitoring from day one
Related Technologies for Model Training
PyTorchβ
Deep learning framework with distributed training support
TensorFlowβ
End-to-end ML platform with TPU and distributed training
MLflowβ
Track experiments, hyperparameters, and model versions
Rayβ
Distributed computing framework for hyperparameter tuning
Kubernetesβ
Container orchestration for scalable training jobs
Weights & Biasesβ
Experiment tracking and model management platform