RLHF: Reinforcement Learning from Human Feedback

Master the three-phase RLHF pipeline that aligns AI systems with human preferences and values, powering systems like ChatGPT and Claude.

50 min readAdvanced
Not Started
Loading...

🎯 What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains AI systems to behave in ways that humans prefer. It's the key technology behind ChatGPT, Claude, and other aligned AI systems that are helpful, harmless, and honest.

Why RLHF?

Language models trained only on next-token prediction can be unhelpful, biased, or harmful

The Solution

Train models to optimize for human preferences rather than just predicting text

Result

AI systems that are more helpful, safer, and aligned with human values

🔄 RLHF Training Pipeline

RLHF Overview

Understanding the three-phase RLHF training pipeline

Key Steps

1.Supervised Fine-tuning (SFT)
2.Reward Model Training
3.Reinforcement Learning Optimization

Implementation

# RLHF Training Pipeline Overview
import torch
import torch.nn as nn
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from typing import List, Dict, Any

class RLHFPipeline:
    """Complete RLHF training pipeline for language models"""
    
    def __init__(self, base_model_name: str):
        self.tokenizer = GPT2Tokenizer.from_pretrained(base_model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Initialize models for each phase
        self.sft_model = None
        self.reward_model = None
        self.rl_model = None
        self.reference_model = None
    
    def phase_1_supervised_finetuning(self, instruction_data: List[Dict]):
        """
        Phase 1: Supervised fine-tuning on high-quality demonstrations
        Goal: Teach the model to follow instructions well
        """
        print("Phase 1: Supervised Fine-tuning")
        
        # Load pre-trained model
        self.sft_model = GPT2LMHeadModel.from_pretrained('gpt2')
        
        # Prepare instruction-following data
        # Format: {"prompt": "...", "completion": "..."}
        training_examples = []
        for example in instruction_data:
            formatted_input = f"Human: {example['prompt']}\n\nAssistant: {example['completion']}"
            training_examples.append(formatted_input)
        
        # Fine-tune on demonstrations
        # This teaches the model to follow instructions and be helpful
        print(f"Training on {len(training_examples)} demonstration examples")
        
        return self.sft_model
    
    def phase_2_reward_modeling(self, preference_data: List[Dict]):
        """
        Phase 2: Train reward model on human preferences
        Goal: Learn what humans prefer in responses
        """
        print("Phase 2: Reward Model Training")
        
        # Create reward model by adding scalar head to base model
        class RewardModel(nn.Module):
            def __init__(self, base_model):
                super().__init__()
                self.base_model = base_model
                self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
                
            def forward(self, input_ids, attention_mask=None):
                outputs = self.base_model(input_ids, attention_mask=attention_mask, output_hidden_states=True)
                # Use last token's hidden state
                last_hidden_state = outputs.hidden_states[-1][:, -1, :]
                reward = self.reward_head(last_hidden_state)
                return reward
        
        self.reward_model = RewardModel(GPT2LMHeadModel.from_pretrained('gpt2'))
        
        # Train on preference comparisons
        # Format: {"prompt": "...", "chosen": "...", "rejected": "..."}
        print(f"Training reward model on {len(preference_data)} preference pairs")
        
        # Bradley-Terry model for preference learning
        def preference_loss(chosen_rewards, rejected_rewards):
            return -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
        
        return self.reward_model
    
    def phase_3_reinforcement_learning(self, prompts: List[str]):
        """
        Phase 3: Optimize policy using reinforcement learning
        Goal: Maximize reward while staying close to original model
        """
        print("Phase 3: Reinforcement Learning with PPO")
        
        # Initialize RL models
        self.rl_model = GPT2LMHeadModel.from_pretrained('gpt2')  # Policy to optimize
        self.reference_model = GPT2LMHeadModel.from_pretrained('gpt2')  # Frozen reference
        
        # PPO hyperparameters
        kl_coeff = 0.1  # KL divergence penalty
        clip_ratio = 0.2  # PPO clip parameter
        
        def compute_rewards(prompt_ids, response_ids):
            """Compute rewards including KL penalty"""
            full_ids = torch.cat([prompt_ids, response_ids], dim=1)
            
            # Get reward from reward model
            reward_score = self.reward_model(full_ids)
            
            # Compute KL divergence penalty
            with torch.no_grad():
                ref_logits = self.reference_model(full_ids).logits
                ref_log_probs = torch.log_softmax(ref_logits, dim=-1)
            
            policy_logits = self.rl_model(full_ids).logits
            policy_log_probs = torch.log_softmax(policy_logits, dim=-1)
            
            # KL divergence between policy and reference
            kl_div = torch.sum(torch.exp(ref_log_probs) * (ref_log_probs - policy_log_probs), dim=-1)
            
            # Final reward = base reward - KL penalty
            final_reward = reward_score.squeeze() - kl_coeff * kl_div.mean(dim=-1)
            
            return final_reward
        
        print(f"Optimizing policy on {len(prompts)} prompts with PPO")
        
        return self.rl_model
    
    def generate_rlhf_response(self, prompt: str, max_length: int = 100):
        """Generate response using RLHF-trained model"""
        if self.rl_model is None:
            raise ValueError("Must complete RLHF training first")
        
        inputs = self.tokenizer.encode(prompt, return_tensors='pt')
        
        with torch.no_grad():
            outputs = self.rl_model.generate(
                inputs,
                max_length=max_length,
                do_sample=True,
                temperature=0.7,
                pad_token_id=self.tokenizer.eos_token_id
            )
        
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response[len(prompt):]  # Return only the generated part

# Example usage
def run_rlhf_pipeline():
    pipeline = RLHFPipeline('gpt2')
    
    # Phase 1: SFT data
    sft_data = [
        {"prompt": "Explain quantum computing", "completion": "Quantum computing uses quantum mechanics..."},
        {"prompt": "Write a poem about AI", "completion": "In circuits deep and neural vast..."}
    ]
    
    # Phase 2: Preference data  
    preference_data = [
        {
            "prompt": "How do I cook pasta?",
            "chosen": "Boil water, add salt, cook pasta for 8-12 minutes until al dente.",
            "rejected": "Just put it in water and hope for the best."
        }
    ]
    
    # Phase 3: RL prompts
    rl_prompts = ["Explain machine learning", "Write a helpful response about..."]
    
    # Run pipeline
    sft_model = pipeline.phase_1_supervised_finetuning(sft_data)
    reward_model = pipeline.phase_2_reward_modeling(preference_data)
    rl_model = pipeline.phase_3_reinforcement_learning(rl_prompts)
    
    # Generate RLHF response
    response = pipeline.generate_rlhf_response("How can I help you today?")
    print(f"RLHF Response: {response}")
    
    return pipeline

⚙️ RL Algorithms for RLHF

Proximal Policy Optimization

Most popular algorithm for RLHF due to stability

Advantages

  • Stable training
  • Good sample efficiency
  • Prevents policy collapse

Key Features

  • Clipped objective
  • KL divergence penalty
  • Multiple epochs per batch

Mathematical Formulation

L(θ) = min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)

⚠️ RLHF Challenges

Reward Hacking

Model exploits reward function in unintended ways

Examples

  • Generating very long responses to get higher scores
  • Using specific phrases that reward model likes
  • Avoiding difficult questions to maintain high ratings

Solutions

  • Diverse preference data collection
  • Constitutional AI techniques
  • Multi-objective optimization
  • Human oversight during training

🌍 Real-World RLHF Applications

Successful Applications

  • ChatGPT & GPT-4: OpenAI's conversational AI systems
  • Claude: Anthropic's helpful and harmless assistant
  • Bard: Google's conversational AI with safety guardrails
  • GitHub Copilot: Code generation aligned with best practices
  • Content moderation: AI systems that understand harmful content

Key Success Factors

  • High-quality human feedback: Expert annotators and clear guidelines
  • Iterative improvement: Continuous data collection and model updates
  • Multi-objective optimization: Balancing helpfulness with safety
  • Constitutional AI: Self-supervision using AI principles
  • Red teaming: Adversarial testing for edge cases

Impact

RLHF has transformed AI from systems that simply predict text to AI assistants that understand and respect human values, making AI safer and more beneficial for real-world deployment.

🔮 Future Directions

Research Frontiers

  • Constitutional AI: Self-improving systems using AI principles
  • Direct Preference Optimization (DPO): Simpler alternative to PPO
  • Multi-agent RLHF: Training with diverse human perspectives
  • Scalable oversight: AI systems helping humans evaluate AI
  • Weak-to-strong generalization: Supervising superhuman systems

Technical Improvements

  • Sample efficiency: Learning from fewer human comparisons
  • Online learning: Continuous improvement during deployment
  • Robustness: Maintaining alignment across diverse scenarios
  • Interpretability: Understanding what models have learned
  • Value learning: Inferring human values from behavior

🎯 Key Takeaways

RLHF is a three-phase process: Supervised fine-tuning, reward modeling, and RL optimization

Human preferences matter: RLHF aligns AI systems with human values and preferences

PPO is the gold standard: Most successful RLHF systems use Proximal Policy Optimization

Challenges remain: Reward hacking, distribution shift, and scalability are active research areas

Real-world impact: RLHF powers the most successful AI assistants like ChatGPT and Claude

📝 RLHF Mastery Check

1 of 8Current: 0/8

What are the three main phases of RLHF training?