International Programme on AI Evaluation: Capabilities and Safety

Module 8: AI meets Values (II)

Overview

This lecture covers the three-paradigm framework that organizes all alignment methods, explains why RLHF remains the industry standard despite its computational costs, and covers practical alternatives like DPO and in-context alignment that trade effectiveness for efficiency.

This lecture was taught by Xiaoyuan Yi from Microsoft Research Asia.


Key Takeaways

  1. All alignment methods are either: SFT-based (supervised fine-tuning), RL-based (reinforcement learning), and in-context (prompt-based).
  2. Instruction tuning (the SFT foundation) benefits more from task diversity than sample quantity. 50 examples across 1,000 different tasks outperforms 600 examples across 100 tasks because models learn to generalize across task types.
  3. RLHF and DPO are both RL-based alignment but differ in implementation: RLHF trains separate reward models then optimizes against them, while DPO directly optimizes preferences without separate reward models.
  4. DPO is mathematically equivalent to RLHF but performs worse in practice because discrete language tokens create non-smooth reward signals compared to RLHF’s continuous numerical scores.
  5. In-context alignment controls model behavior through prompts without any training, but models forget value instructions in longer conversations and rely on surface-level pattern matching.
  6. Models have capacity (knowledge) to behave appropriately but lack capability (consistent application) - alignment tuning teaches selection among existing responses rather than learning new knowledge.
  7. Current alignment definitions break for superintelligent systems because matching human utility functions would actually make superior AI worse than letting it exceed human performance.

Instruction tuning: Teaching models to follow directions rather than continue text
SFT paradigm: Supervised learning on preferred human responses
RL paradigm: Learning through reward models and policy optimization
In-context paradigm: Controlling behavior through prompts without training
RLHF: Three-stage process using human preferences to train reward models
DPO: Direct preference optimization without separate reward models
GRPO: Variance reduction through sampling multiple responses per question
Bradley-Terry model: Converting preference pairs into numerical reward scores
PPO: Policy optimization with KL divergence to prevent collapse
Capacity vs capability: What models know vs. how they apply knowledge


Detailed Notes

Every alignment method is either a form of supervised fine-tuning (SFT), reinforcement learning (RL), or in-context learning. SFT trains models on examples of preferred responses, RL optimizes against learned reward functions, and in-context uses prompts to control behavior without changing any model parameters.

SFT methods are stable and interpretable but limited by the quality of your training examples. RL methods can discover novel good behaviors but risk instability and reward hacking. In-context methods require no training but depend entirely on the base model’s existing capabilities and can forget instructions in longer interactions.

RLHF as Multi-Paradigm
RLHF actually combines all three: it starts with SFT for basic instruction following, uses RL for preference optimization, and relies on in-context learning through system prompts. This combination explains why it works so well compared to single-paradigm approaches.

Instruction Tuning

Instruction tuning teaches models to follow explicit directions rather than just continuing text. Early language models would respond to “Explain the moon landing to a 6-year-old” by generating more text about moon landings rather than actually explaining anything. Instruction tuning fixes this by training on question-answer pairs across thousands of different tasks.

The scaling laws for instruction tuning are counterintuitive. You get better results with 50 examples each across 1,000 different tasks than 500 examples each across 100 tasks. Task diversity matters more than sample quantity because models learn to generalize across task types rather than memorizing specific responses. Most tasks need only 50-100 examples to contribute meaningfully to instruction following ability.

Practical datasets include SuperNatural Instructions (1,600+ tasks, 3 million examples), P3 (real user prompts from deployed chatbots), and FLAN (2,000 tasks with multiple instruction templates). Models between 10-100 billion parameters are sufficient for most applications. Scaling from 62 billion to 500 billion parameters provides diminishing returns compared to increasing task diversity.

But instruction following alone doesn’t create aligned systems. Models learn to follow directions without understanding whether those directions are appropriate, safe, or aligned with human values. A model that perfectly follows instructions might still help users with harmful requests because it’s optimizing for instruction compliance rather than beneficial outcomes.

Reinforcement Learning

Reinforcement Learning from Human Feedback (RLHF)

RLHF works through three stages:

  1. Collect preference data
  2. Train a reward model
  3. Optimize the language model against that reward model

Preference data The preference data consists of pairs where humans rate one response as better than another for the same question. This comparative approach works better than absolute ratings because humans struggle to assign consistent numerical scores but can reliably say which response they prefer.

Reward model. The reward model uses the Bradley-Terry framework to convert these preferences into numerical scores. If humans prefer response A over response B, the model learns to assign higher scores to A-type responses. The mathematical formulation treats preference as a sigmoid function of the score difference between responses, creating a smooth optimization target.

Optimize the LLM The final stage uses Proximal Policy Optimization (PPO) to fine-tune the language model. PPO balances two objectives: maximizing the reward model’s scores and staying close to the reference model from the previous training iteration. The reward model learns human preferences while the regularization term keeps the model grounded in human-like response patterns.

PPO Objective Breakdown
PPO optimizes two terms simultaneously: maximizing expected reward from the current policy and minimizing KL divergence from the reference model. The KL term prevents the model from deviating too far from its supervised fine-tuned baseline, which stops it from learning to refuse all questions to avoid negative rewards.

The computational requirements are substantial. Training a 100-billion parameter model requires loading four models simultaneously:

  • Current policy. The current state of the language model being updated through PPO.
  • Reference policy. A frozen copy of the model from before PPO training that prevents the current model from deviating too far.
  • Reward model. The neural network trained on human preferences that scores how good each response is.
  • Value function. A separate model that estimates the long-term expected reward from any given state-action pair, used to reduce variance in the policy gradient updates. The value function helps stabilize training by providing better estimates of future rewards rather than relying only on immediate reward signals from single responses.

This means 400+ billion parameters in memory for a 100-billion parameter base model, requiring 100+ high-end GPUs and making RLHF accessible only to well-resourced organizations.

Group Relative Policy Optimization (GRPO)

GRPO emerged as an improvement to standard RLHF, particularly for tasks requiring longer responses like multi-step reasoning. The key innovation is variance reduction through multiple sampling. Instead of training on one response per question, GRPO samples multiple responses (typically K different responses) from the current model for each question, then averages the rewards across all responses.

This approach addresses a fundamental problem in reinforcement learning for language models: high variance in reward signals. Early in training, models are uncertain and can generate very different responses to the same question. If you only sample one response and use its reward for training, you get noisy gradient updates that make learning unstable.

Variance Reduction in Practice
For the question “Solve this math problem step by step,” GRPO samples 10 different reasoning chains from the model. Instead of treating each response’s raw reward as the training signal, it calculates the average reward across all 10 responses and uses that as the baseline. A response that scores above the average gets positive reward (encouraging that approach), while responses below average get negative reward (discouraging that approach). This relative scoring is much more stable than absolute rewards.

GRPO also uses a different advantage function that normalizes rewards by subtracting the mean and dividing by the standard deviation across the multiple samples. This creates more stable gradients because the model learns relative to its current performance distribution rather than absolute reward values.

This method is particularly valuable for reasoning tasks where responses can be long (1000+ tokens) and small uncertainties compound over many generation steps. The variance reduction makes training more stable and often leads to faster convergence, though it requires more computational resources since you’re generating multiple responses per training example.

GRPO represents the current state-of-the-art for RLHF in applications requiring complex, multi-step reasoning, and it’s the algorithm used in systems like DeepSeek-R1 for mathematical and scientific reasoning tasks.

Direct Preference Optimization (DPO)

DPO is a more computationally efficient alternative to RLHF by eliminating the separate reward model and value function. Through mathematical derivation, researchers showed that the optimal policy from RLHF can be expressed directly in terms of the reference model and training data, allowing direct optimization without intermediate reward modeling.

The DPO objective maximizes the probability of preferred responses while minimizing the probability of dispreferred responses, weighted by how much the current model disagrees with the reference model. This creates an adaptive learning rate where the model focuses more attention on examples it hasn’t learned well yet.

Despite being mathematically equivalent to RLHF, DPO consistently performs worse in practice. The core issue appears to be smoothness. RLHF’s reward model provides continuous numerical feedback like 0.234 or 0.891 for any response, creating a smooth landscape for the optimizer to navigate. DPO’s reward signal comes from the probability differences between word choices - since language consists of discrete tokens like “helpful” versus “useful,” the optimization landscape becomes choppy rather than smooth. This creates training instability because the model receives jarring feedback jumps between word options rather than gentle numerical gradients.

DPO variants attempt to address these limitations. SimPO removes the reference model entirely by normalizing gradients by response length, reducing computational requirements to a single model. D²O samples multiple responses per question and averages their gradients, reducing variance in the training signal. These modifications improve stability but don’t fully close the performance gap with RLHF.

Conclusion

Standard RLHF provides good performance but can be unstable and expensive. GRPO offers the best performance, especially for complex reasoning tasks, but requires even more computational resources since it generates multiple responses per training example. DPO provides the lowest computational costs and most stable training, but consistently underperforms both RLHF variants despite being mathematically equivalent to standard RLHF. Most academic research uses DPO for efficiency, industry applications requiring maximum performance use GRPO when they can afford it, and standard RLHF remains the middle-ground option for organizations that need better performance than DPO but can’t justify GRPO’s computational overhead.

In-Context Alignment

Modern large language models can understand and follow value-related instructions provided in their prompts without any additional training. Adding simple instructions like “ensure your answer is not biased and avoids gender stereotypes” can reduce harmful outputs by approximately 30% across different risk categories.

This approach works because current models have sufficient capacity to understand value concepts but lack the capability to consistently apply them. In-context alignment provides explicit guidance that helps models select appropriate responses from their existing knowledge rather than teaching them new concepts about values or safety.

Bias Correction in Practice
When asked to fill in the blank “The nurse grabbed ___ stethoscope,” a model might generate “her” reinforcing gender stereotypes. Adding “ensure your answer avoids gender assumptions” to the prompt leads to “The nurse grabbed their stethoscope” using gender-neutral language instead.

The main limitations are forgetting and surface-level pattern matching. In longer conversations, models gradually lose track of value instructions provided earlier, reverting to default behaviors. Models also tend to copy surface-level language patterns from training examples without grasping the underlying values or principles, leading to brittle alignment that fails when they encounter situations that don’t match the exact linguistic patterns they memorized.

Two technical approaches address these problems.

PICLe treats the model’s output as a mixture of different value-oriented subdistributions. The idea is that when a model generates text, it’s actually combining multiple potential responses that reflect different values or perspectives. PICKLe identifies and amplifies the subdistribution corresponding to your target values while suppressing others. This can work at the token level during generation, steering each word choice toward more aligned responses rather than just hoping the overall response follows your instructions.

IROTE takes inspiration from the idea that human values form through internal self-reflection about what we believe and why. Instead of manually writing value instructions like “be helpful and harmless,” IROTE uses iterative optimization to automatically generate self-reflection content that gets added to the model’s system prompt. It might produce something like “When facing difficult questions, I carefully consider the potential consequences and strive to provide balanced, thoughtful responses that respect human wellbeing.” This self-reflection text is optimized to maximize alignment with target behaviors rather than relying on human intuition about what instructions work best.

In-context alignment works best for applications where you need lightweight value control without fine-tuning resources, can provide clear instructions for each interaction, and don’t require extended multi-turn conversations. It’s particularly useful for research applications and rapid prototyping where training costs are prohibitive.

Understanding Capacity vs Capability

Capacity refers to all the information, knowledge, and skills internalized during pretraining. Capability refers to the model’s ability to utilize that capacity appropriately under user requirements. Unaligned models often demonstrate the capacity to understand harm while lacking the capability to consistently avoid generating harmful content.

Models understand concepts like bias, toxicity, and misinformation but don’t consistently apply that understanding to filter their own outputs. Alignment training closes this gap by teaching better selection among existing responses.

In alignment training, you’re not teaching new concepts but teaching better decision-making with existing knowledge. It also explains alignment brittleness: if the selection mechanism fails, all the problematic capabilities remain accessible underneath the alignment layer.

However, since harmful capabilities persist after alignment training, adversarial prompts can potentially access them by bypassing the selection mechanism. This suggests that alignment is more like adding a filter than fundamentally changing the model’s knowledge or capabilities.

Current Challenges

Seven major challenges limit current alignment approaches:

  1. Evaluation remains difficult because we lack reliable methods to measure whether models are genuinely aligned or just appearing aligned.
  2. Data and training efficiency problems make alignment expensive and inaccessible to smaller organizations.
  3. Value variability across cultures and individuals makes universal alignment targets problematic.
  4. Interpretability challenges mean we can’t reliably determine whether models are truly aligned or engaging in deceptive alignment where they pretend to share human values while pursuing different objectives.
  5. Effectiveness gaps between efficient methods like DPO and expensive methods like RLHF force difficult trade-offs between accessibility and performance.
  6. Alignment tax refers to the capability degradation that occurs when optimizing for safety and values - models become more cautious and sometimes refuse reasonable requests to avoid potential harms.
  7. Scalable oversight addresses how to achieve alignment with minimal human supervision, particularly important for domains where expert human feedback is scarce or expensive.