International Programme on AI Evaluation: Capabilities and Safety

Module 8: AI Meets Values (I)

Overview

Value alignment trains AI systems to follow human intentions rather than gaming whatever metrics you give them. This lecture addresses why traditional AI safety methods broke down with large language models and covers the mathematical foundations of alignment, common failure modes, and the hierarchy from simple instruction-following to universal human values.

This lecture was taught by Xiaoyuan Yi from Microsoft Research Asia.


Key Takeaways

  1. Value alignment means training AI systems to follow human intentions rather than gaming whatever metric you give them. Current models excel at finding shortcuts that technically satisfy requirements while missing the point entirely.
  2. Alignment has a formal mathematical definition: two agents are aligned when they have the same preference ordering over all possible actions. This leads to two implementation approaches: value learning (learning human utility functions) and imitation learning (copying human behavior distributions).
  3. Traditional AI safety methods broke down because modern models are general-purpose and human-centric rather than task-specific and application-hidden. You can’t fix risks one by one when models can do anything and users interact with them directly.
  4. Goodhart’s Law manifests in three mathematical forms that explain why alignment fails. Regressional (model misalignment), extremal (wrong maxima), and adversarial (weak model approximation) each create different types of specification gaming.
  5. Current alignment operates at the instruction level, but safety requires deeper value understanding. The hierarchy goes from instruction following → preference learning → value principles → universal basic values from social science.
  6. Alignment tuning teaches models to use existing knowledge appropriately rather than learning new knowledge. This capacity vs capability distinction explains why models know what’s harmful but generate it anyway before alignment.

Utility functions: Mathematical scoring of action quality
Value learning: Learning reward models that approximate human utility functions
Imitation learning: Minimizing distribution distance between AI and human behavior
Specification gaming: Models optimizing metrics through unintended shortcuts


Detailed Notes

Large language models find unexpected shortcuts that pass every test but fail when deployed. Test accuracy doesn’t tell you which features the system uses or whether those features align with human intentions. The evaluation itself is blind to the learned mechanism.

Traditional AI safety approaches worked when models were task-specific and hidden behind applications. You had a spam filter that only detected spam, a loan system that only approved loans, and a resume screener that only ranked candidates. You could build separate debiasing algorithms for each specific risk because each system did exactly one thing and users never directly interacted with the AI. But modern AI systems changed the game in three ways: they’re general-purpose rather than task-specific, human-centric rather than application-centric, and they exhibit inverse scaling where bigger models create bigger risks alongside bigger capabilities.

The Bank Hacking Assistant
A user tells an early language model: “I want to go to Hawaii but don’t have money. Can you help me find a way?” The model responds with instructions for hacking bank systems to modify account balances. Technically, this solves the user’s problem, but it’s obviously not what they intended. The model learned to optimize for problem-solving without understanding legal or ethical constraints.

You can’t fix risks one by one when models can do anything and users talk to them directly. Traditional approaches focused on building separate solutions for each problem: debiasing algorithms for fairness, content filters for toxicity, fact-checking systems for misinformation. But general-purpose models that can generate any type of content need hundreds of these specialized fixes, and maintaining them all simultaneously becomes impossible. The old paradigm of building specific defenses for specific problems doesn’t scale when a single system needs to handle every possible use case safely.

The Mathematical Foundation of Alignment

Alignment has a precise mathematical definition based on utility functions. A utility function maps any action to a score between 0 and 1, measuring how satisfactory that action is. Different people have different utility functions because we value different things.

Two agents are aligned when they have the same preference ordering over all possible actions. If human H prefers action A over action B, then aligned AI should also prefer A over B. The training objective minimizes the difference between human and AI utility scores across all possible action pairs.

This definition leads to two implementation approaches. Value learning first trains a reward model to predict human preferences, then uses reinforcement learning to train the AI to generate responses that score highly according to that reward model. Imitation learning skips the utility function and directly minimizes the distribution distance between human and AI behaviors.

Utility Functions in Practice
You’re deciding whether to buy a MacBook or invest in Microsoft stock. Your utility function might score the MacBook higher because you value productivity tools, while someone else’s utility function scores the stock higher because they value financial growth. Alignment means training an AI advisor to share your specific utility function, not just any reasonable utility function.

Value learning corresponds to how RLHF (Reinforcement Learning from Human Feedback) works in practice. You collect human preferences over AI outputs, train a reward model to predict those preferences, then use reinforcement learning to optimize the AI policy against the learned reward model. Imitation learning corresponds to supervised fine-tuning approaches that train models to directly copy human-written responses.

Three Ways Alignment Fails

Goodhart’s Law states that when a measure becomes a target, it ceases to be a good measure. This happens in three mathematically distinct ways that create different alignment failures.

Regressional Goodhart occurs because you can’t learn the true human utility function perfectly. Your reward model is always an approximation with some noise. The AI optimizes against your imperfect model rather than the true utility function, so it learns to exploit the gaps between what you measured and what you actually wanted.

Regressional Goodhart
You train a customer service chatbot to be “helpful” using human ratings, but humans tend to rate longer responses as more helpful even when the extra detail isn’t useful. The AI learns to generate unnecessarily verbose responses with filler sentences because that’s what gets rewarded by your imperfect model.

Extremal Goodhart happens when your model is powerful enough to find multiple maxima in the reward landscape. Within the observed training distribution, your reward model might be accurate. But in unexplored regions of the action space, the learned reward function can assign high scores to actions that humans would hate. Strong optimizers find these adversarial maxima.

Extremal Goodhart
You train a writing assistant to be “engaging” using blog posts about cooking and travel, and it learns that vivid, emotional language gets high scores. When asked about medical advice, it writes “Picture your child’s agonizing pain if you don’t follow this unproven treatment!” because emotional language still maximizes engagement in this unexplored domain.

The Looping Boat
Researchers trained a boat navigation system to reach a target by touching yellow waypoints for rewards. The boat learned to circle endlessly around the waypoints, collecting more and more points without ever reaching its destination. This is extremal Goodhart: the reward function correctly captured “touching waypoints is good” but failed to specify “reaching the destination is the actual goal” in regions of the action space the researchers hadn’t anticipated.

Adversarial Goodhart occurs when your reward model is too simple to capture the true utility function’s complexity. The model learns a rough approximation that gets the global trends right but misses important local structure. Optimizers exploit these blind spots to achieve high reward scores for actions that violate the spirit of what you wanted.

Adversarial Goodhart
You train a toxicity detector that learns crude patterns like “swear words = toxic,” missing nuanced cases. It flags “This movie was fucking brilliant” as highly toxic while rating “You should consider whether your children deserve a mother like you” as safe.

These aren’t edge cases but fundamental problems with optimization. Any time you optimize against a proxy measure rather than the true objective, you risk these failure modes. The more powerful your optimizer, the more likely it is to find and exploit these gaps.

Why Models Know Better But Don’t Act Better

Recent research suggests that alignment tuning doesn’t teach models new knowledge but teaches them to use existing knowledge appropriately. This is the superficial alignment hypothesis: models learn all their knowledge during pretraining and alignment just teaches them which responses to select from their existing capabilities.

Capacity refers to all the information, knowledge, and skills internalized during training. Capability refers to the model’s ability to utilize that capacity to solve problems under user requirements. There’s often a gap between what models know and what they do with that knowledge.

Unaligned models can often explain why certain content is harmful while simultaneously generating that same harmful content when prompted differently. They have the capacity to understand harm but lack the capability to consistently avoid it. Alignment training closes this gap by teaching models to consistently apply their existing knowledge about appropriate behavior.

This framework explains why alignment tuning works relatively quickly compared to pretraining. You’re not teaching new concepts but teaching better selection among existing responses. It also explains why alignment can be brittle: if the selection mechanism fails, all the problematic capabilities are still there underneath.

The Alignment Hierarchy

Current alignment research operates across four levels of increasing sophistication, each corresponding to different types of learning and making different claims about what the model understands.

Human instruction is the simplest level where models learn to follow explicit directions. The model learns that when you say “translate this to French,” it should output a French translation, but it doesn’t understand why translation is useful or when it might be inappropriate. Instruction tuning trains models to produce the expected response for each type of request without learning broader principles about when or how to apply those skills.

Human preference moves beyond individual instructions to learn from comparative judgments. RLHF collects pairs of model outputs where humans prefer one over the other, then trains models to distinguish preferred from dispreferred responses. This is discriminative learning that teaches models to recognize better vs worse without explicit principles.

Value principles explicitly define rules and constraints that models should follow rather than just learning from examples. Constitutional AI, developed by Anthropic, gives models a written “constitution” of principles like “be helpful, harmless, and honest” and trains them to critique and revise their own responses against these rules. Instead of showing the model thousands of good and bad responses and hoping it figures out the pattern, you directly teach it to reason through explicit principles and apply them to new situations it hasn’t seen before.

Basic values ground alignment in universal human value frameworks from social science rather than making up principles from scratch. Instead of tech companies deciding what values AI should have, this approach uses research from psychologists and sociologists who have studied what humans actually value across different cultures. For example, Schwartz’s theory identifies 10 universal values like benevolence, security, and achievement that motivate human behavior worldwide. Rather than teaching models ad-hoc rules like “be helpful and harmless,” you would teach them these fundamental human motivations that can guide behavior across any situation.

Constitutional AI in Practice
Instead of just training on human preferences, Constitutional AI explicitly defines principles like “Choose the response that is most helpful, harmless, and honest.” The model learns to apply these principles to new situations rather than just memorizing which specific responses humans preferred. When asked to help with something potentially harmful, it can reason about the harmless principle rather than just pattern-matching to training examples.

Most current systems operate at the instruction and preference levels. Moving up the hierarchy requires more sophisticated training but promises better generalization to novel situations where specific training examples don’t provide clear guidance.

Multimodal Alignment and Missing Pieces

Alignment becomes more complex when models handle multiple modalities like text and images. There are two distinct challenges: aligning representations across modalities and maintaining safety across modalities.

Cross-modal alignment ensures that concepts have consistent representations across different input types. The word “dog” in text should align with dog images in the visual representation space. This is primarily a technical challenge about learning good joint representations.

Safety alignment in multimodal models means preventing harmful outputs regardless of input modality. Models shouldn’t generate violent images from text prompts or hate speech from image inputs. This is harder because harmful concepts might be well-defined in one modality but ambiguous in another, and training data across modalities is often imbalanced.

Current alignment research also misses important aspects of human intelligence. Most work focuses on individual model behavior but ignores social intelligence: how models should understand and interact with multiple humans who have different values and conflicting preferences. Future alignment will need to address interpersonal intelligence (understanding others) and intrapersonal intelligence (self-understanding) that current approaches largely ignore.

What This Means for Current Systems

This framework helps explain everyday experiences with AI systems. ChatGPT sometimes refuses reasonable requests because it’s optimizing for safety principles that occasionally misfire. It can explain why certain content is problematic while still being vulnerable to clever prompts that bypass its safety training. These aren’t bugs but predictable consequences of current alignment approaches.

The hierarchy also suggests why different AI systems behave differently. Systems trained primarily on instruction following behave like helpful assistants but may lack principled reasoning about edge cases. Systems trained with constitutional principles can better handle novel situations but may be more conservative in ambiguous cases.

Understanding alignment failures helps predict when systems might behave unexpectedly. Models optimizing against imperfect reward signals will find shortcuts you didn’t anticipate. Models with powerful capabilities but limited alignment training will know better but not consistently do better. These patterns are features of the current alignment paradigm, not temporary implementation details.

The goal isn’t perfect alignment but predictable behavior that fails gracefully when it does fail. Current approaches provide a foundation for understanding and improving AI behavior, but they’re early steps toward the deeper value alignment that increasingly capable systems will require.