TIL

International Programme on AI Evaluation: Capabilities and Safety 9 April 2026

Module 9: Fairness in AI Systems

Overview This lecture covered fairness in predictive and generative AI, the taxonomy of harms beyond performance disparities, the evolution of NLP fairness benchmarks, and why context matters more than universal rules. This lecture was taught by Angelina Wang from Cornell Tech and Cornell’s Department of Information Science. Key Takeaways &...

International Programme on AI Evaluation: Capabilities and Safety 8 April 2026

Module 9: Sociotechnical Evaluation of AI Systems

Overview This lecture covers why current AI evaluation methods fail to predict real-world safety and performance, how to evaluate AI as a sociotechnical system across three layers of context, and what a more scientific approach to evaluation looks like. Instead of relying solely on benchmarks designed for capability testing, we...

International Programme on AI Evaluation: Capabilities and Safety 7 April 2026

Module 9: Future of Work & the Economy (I)

Overview Post-deployment evaluation examines what happens when AI systems move from labs to real-world use. This lecture covers three core questions: are people actually using AI, is it improving productivity, and are jobs being displaced? This lecture was taught by Marko Tešić from the UK Government Department for Science, Innovation...

International Programme on AI Evaluation: Capabilities and Safety 7 April 2026

Module 9: Future of Work & the Economy (II)

Overview This lecture examines why AI systems that perform well on benchmarks often fail when deployed in business environments. It covers the disconnect between laboratory testing and real-world performance, introduces capability-oriented evaluation as an alternative approach, and explores future scenarios where machines will judge human versus AI suitability for workplace...

International Programme on AI Evaluation: Capabilities and Safety 7 April 2026

Module 9: Human-AI Thought Partnerships

Overview Most AI evaluation uses static benchmarks (fixed questions, check answers), but people don’t use AI that way. AI could potentially be a valuable thought partner to humans, but makes a good thought partner and how do evaluate whether AI is qualifies as one? This lecture was taught by Katie...

International Programme on AI Evaluation: Capabilities and Safety 26 March 2026

Module 8: AI meets Values (III)

Overview This lecture covers the mathematical frameworks from social science that make value measurement possible, why traditional testing methods systematically fail to capture true alignment, and new dynamic approaches that automatically generate harder questions to probe model boundaries. Instead of asking models what they think is right, we attempt to...

International Programme on AI Evaluation: Capabilities and Safety 26 March 2026

Module 8: Deception

Overview This lecture covers evidence that frontier AI models can strategically deceive their evaluators, why this undermines the reliability of AI safety evaluation, and current attempts to address the problem that largely fall short. This lecture was taught by Jérémy Scheurer from Apollo Research. Key Takeaways & Concepts Evidence suggests...

International Programme on AI Evaluation: Capabilities and Safety 25 March 2026

Module 8: AI meets Values (II)

Overview This lecture covers the three-paradigm framework that organizes all alignment methods, explains why RLHF remains the industry standard despite its computational costs, and covers practical alternatives like DPO and in-context alignment that trade effectiveness for efficiency. This lecture was taught by Xiaoyuan Yi from Microsoft Research Asia. Key Takeaways...

International Programme on AI Evaluation: Capabilities and Safety 25 March 2026

Module 8: Agentic Evaluation

Overview Agent evaluation tests AI systems that act in environments rather than just producing answers to questions. This lecture covers why agent benchmarks are uniquely difficult to score, what hidden factors dramatically change results, and practical methods for diagnosing evaluation problems. This lecture was taught by Cozmin Ududec from the...

International Programme on AI Evaluation: Capabilities and Safety 24 March 2026

Module 8: AI Meets Values (I)

Overview Value alignment trains AI systems to follow human intentions rather than gaming whatever metrics you give them. This lecture addresses why traditional AI safety methods broke down with large language models and covers the mathematical foundations of alignment, common failure modes, and the hierarchy from simple instruction-following to universal...

International Programme on AI Evaluation: Capabilities and Safety 24 March 2026

Module 8: AI Control

Overview AI control is a security-focused approach to AI safety that assumes your most powerful AI systems might be scheming against you and builds defenses accordingly. Rather than trying to make AI perfectly aligned, control research develops protocols for safely deploying systems you can’t fully trust. This work prepares for...

International Programme on AI Evaluation: Capabilities and Safety 21 March 2026

Module 7: Probing Neural Networks

Overview Probing trains classifiers on neural network internal activations to understand what models learn internally, beyond just looking at outputs. This approach helps debug models that find unexpected shortcuts and reveals whether systems actually understand the concepts they appear to use. This lecture was taught by Fazl Barez from the...

International Programme on AI Evaluation: Capabilities and Safety 20 March 2026

Module 7: ARENA, Mech Interp & SAEs

Overview This lecture provides a comprehensive introduction to the ARENA program, a five-chapter course covering everything from machine learning fundamentals to cutting-edge AI safety research. The lecture walks through each chapter’s content, and shows how mechanistic interpretability research has evolved over the past several years. This lecture was taught by...

International Programme on AI Evaluation: Capabilities and Safety 18 March 2026

Module 6: Measurement Theory for AI Evaluation

Overview This lecture covers how measurement theory from psychometrics provides frameworks for building more reliable evaluation systems that can support the kinds of claims we now need to make about AI systems. This lecture was taught by Sanmi Koyejo from Stanford University. Key Takeaways Benchmarks worked well for research competition...

International Programme on AI Evaluation: Capabilities and Safety 12 March 2026

Module 6: Robust Evaluation of Generative AI

Overview Current AI evaluation methods tell us how models perform on specific tests but not what they can actually do in new situations. Capability-oriented evaluation separates a system’s underlying abilities from the difficulty of test questions, letting you predict performance on new tasks and make better deployment decisions. This lecture...

International Programme on AI Evaluation: Capabilities and Safety 11 March 2026

Module 6: Animal Psychology

Overview AI evaluation faces the same challenge that animal cognition researchers have tackled for 150+ years: how do you understand what’s happening inside a mind that works differently from yours? Comparative psychology methods can make AI evaluation more reliable and predictive by moving beyond surface behaviors to understand underlying capabilities....

International Programme on AI Evaluation: Capabilities and Safety 6 March 2026

Module 6: Construct-Oriented Evaluation

Overview This lecture introduces construct-oriented evaluation as an alternative to traditional task-based AI assessment, borrowing methods from psychology to better evaluate AI models. This lecture was taught by Liming Jiang from ByteDance / TikTok. Key Takeaways & Concepts AI systems have measurable underlying “constructs” (like comprehension, reasoning, language modeling) that...

International Programme on AI Evaluation: Capabilities and Safety 5 March 2026

Module 5: Jailbreaks

Overview This lecture covers AI jailbreaking techniques that bypass safety guardrails in large language models to elicit policy-violating outputs that would otherwise be refused. This lecture was taught by Mohammad Taufeeque from FAR AI. Key Takeaways Jailbreaks exploit competing training objectives: LLMs learn three different goals (next-token prediction, helpfulness, and...

International Programme on AI Evaluation: Capabilities and Safety 5 March 2026

Module 5: Current Defense Stack & Promising Mitigations

Overview AI safety systems use multiple layers of defense, but they can be systematically broken using targeted attacks against each component. This lecture covers how current defense stacks work, why they fail, and practical approaches for building more robust safety systems. Real-world AI misuse is already happening, from terrorist attacks...

International Programme on AI Evaluation: Capabilities and Safety 3 March 2026

Module 5: Sociotechnical Approach to Red Teaming

Overview Red teaming finds vulnerabilities in AI systems by simulating attacks. It started in cybersecurity as a way to test system integrity, but now covers a much broader range of harms including bias, misinformation, and manipulation. This lecture was taught by Laura Weidinger from Google DeepMind. Key Concpets & Takeaways...

International Programme on AI Evaluation: Capabilities and Safety 3 March 2026

Module 5: Uplifting Evaluations & Strategic Red Teaming

Overview Dangerous capability red teaming tests whether AI models can help people do harmful things they otherwise couldn’t do. Unlike other forms of red teaming that focus on alignment or breaking safety filters, this approach measures “uplift.” How much easier does AI make dangerous tasks compared to existing alternatives? This...

International Programme on AI Evaluation: Capabilities and Safety 27 February 2026

Module 4: Evaluating Multi-Agent AI Systems

Overview When AI systems interact with each other and with humans, evaluation becomes much more complex than testing individual models. Multi-agent evaluation requires measuring how well systems generalize to unfamiliar partners and situations, not just familiar training scenarios. This lecture was taught by Joel Z Leibo from Google DeepMind. Key...

International Programme on AI Evaluation: Capabilities and Safety 24 February 2026

Module 3: Uncertainty Quantification

Overview Machine learning models typically output single predictions, but knowing when to trust those predictions requires understanding and measuring uncertainty. Uncertainty quantification serves three purposes: selective classification (refusing to answer when uncertain), system integration (passing probability distributions to downstream components), and system improvement (diagnosing where models struggle). Uncertainty comes from...

International Programme on AI Evaluation: Capabilities and Safety 24 February 2026

Module 4: Scaling Laws

Overview This lecture covered scaling laws, various trends in scaling, and a practical checklist for scientificaly reading leaderboard results. This lecture was taught by Manuel Cebrián from the Spanish National Research Council (CSIC). Scaling Laws Scaling laws answer one fundamental question: if I spend 10x more resources, what improves in...

International Programme on AI Evaluation: Capabilities and Safety 17 February 2026

Module 4: Saturation & Contamination

Overview Traditional benchmarks assume fixed test sets and stable data distributions, but this breaks down for modern LLMs. Models encounter test questions during training (contamination), leading benchmarks saturate as models approach perfect scores, and the data distribution itself evolves as LLMs generate text that becomes training data for the next...

International Programme on AI Evaluation: Capabilities and Safety 17 February 2026

Module 4: New Benchmarking Paradigms

Overview AI research embraces an “anything goes” philosophy where you can try any architecture, training method, or data preprocessing approach, but this freedom makes systems hard to compare. Benchmarks provide the necessary constraint. You can explore freely during development, but eventually you have to submit to competitive empirical testing on...

International Programme on AI Evaluation: Capabilities and Safety 17 February 2026

Module 4: The Science of Benchmarking

Overview This lecture covered the emerging science of benchmarking AI systems and why most current evaluation methods have serious flaws. The focus was on common problems like data contamination, construct validity issues, and measurement biases that make benchmark scores unreliable indicators of actual AI capabilities. “When measures become targets, they...

International Programme on AI Evaluation: Capabilities and Safety 12 February 2026

Module 3: ML Model Deployment and Monitoring

Overview This lecture covered the practical realities of putting machine learning models into production and keeping them working over time. The focus was on deployment strategies, why models degrade after deployment, and comprehensive monitoring approaches for both model performance and system health. This lecture was taught by Cèsar Ferri from...

International Programme on AI Evaluation: Capabilities and Safety 12 February 2026

Module 3: Experiment Design

Overview This lecture covered how to design experiments for evaluating AI systems. It covered traditional experimental design principles, statistical testing methods, and the specific challenges that come up when trying to evaluate AI. This lecture was taught by Line Clemmensen from the Technical University of Denmark (DTU). Key Concepts &...

International Programme on AI Evaluation: Capabilities and Safety 11 February 2026

Module 3: Calibration

Overview This lecture covered why we’re interested in calibration, calibration techniques and evaluation metrics for this, multi-class calibration and proper scoring rules. This lecture was taught by Peter Flach from the University of Bristol. Detailed Notes Calibration is about whether the confidence scores your machine learning model outputs actually mean...

International Programme on AI Evaluation: Capabilities and Safety 10 February 2026

Module 3: Statistical Foundation of AI Evaluation

Overview This lecture establishes the statistical foundations necessary for AI evaluation. Statistics in AI evaluation are frequently misused or misinterpreted, and impressive-looking numbers that seem authoritative may be meaningless or misleading without understanding their underlying assumptions and limitations. This lecture was taught by Line Clemmensen from the Technical University of...

International Programme on AI Evaluation: Capabilities and Safety 3 February 2026

Module 1: AI Evaluation as a Scientific Discipline

Overview The main focus of this lecture is on establishing evaluation as a scientific discipline. “Something is rotten in the field of evaluation… not because the science is wrong, but because it is really complicated and very cross-disciplinary.” This lecture was taught by José Hernández-Orallo. Key Take-Aways and Concepts The...

AI Safety, Ethics & Society course 21 July 2025

Chapter 5: Complex Systems

This chapter makes the argument that AI systems and the societies they operate within are complex systems, which makes AI safety a wicked problem with no single solution, potential unintended consequences from interventions, and requiring ongoing effort. AI Safety is a Wicked Problem Puzzles (like sudoku) have one correct answer...

AI Safety, Ethics & Society course 21 July 2025

Chapter 4: Safety Engineering

This chapter introduces the idea that AI safety should be seen as a specialized part of safety engineering, a concept borrowed from fields like aviation and medicine that focuses on designing systems to manage and reduce risks effectively. It also points out that AI brings unique challenges and risks that...

AI Safety, Ethics & Society course 14 July 2025

Chapter 3: Single-agent safety

This chapter focusses on the fundamental technical challenges of making individual single-agent AI systems safe, not even considering multi-agent dynamics or complex systems. Essentially, his can be summarized as problems with monitoring, robustness and alignment, which in turn reinforce each other. Monitoring We cannot monitor what we cannot understand. Current...

AI Safety, Ethics & Society course 1 July 2025

Chapter 1+2: "Overview of catastrophic AI risks"

AI Safety, Ethics & Society is a textbook written by Dan Hendrycks of the Center for AI Safety. As part of the Summer 2025 Cohort, I’ll work through the course content and take part in small-group discussions, led by a facilitator. In these notes, I’ll summarize the chapters we read...