International Programme on AI Evaluation: Capabilities and Safety 12 February 2026

Module 3: Experiment Design

Overview

This lecture covered how to design experiments for evaluating AI systems. It covered traditional experimental design principles, statistical testing methods, and the specific challenges that come up when trying to evaluate AI.

This lecture was taught by Line Clemmensen.

Key Concepts & Takeaways

Start small and specific. Don’t try to evaluate “ChatGPT in general.” Pick a narrow use case like mental health support for parents on waiting lists. You can always expand later.
Plan your experiment before collecting data. The worst situation is having data and then asking a statistician what it means. Design your statistical model first.
Everything that can go wrong will go wrong. Design experiments assuming problems will happen. Randomize properly. Control what you can, account for what you can’t.
Effect size matters more than statistical significance. A 1% difference might be statistically significant but meaningless in practice, unless you’re measuring something like suicide risk where 1% saves lives.
You need domain experts. Statisticians can’t identify the important variables alone. Domain experts understand what factors could affect AI performance in specific contexts.

Core Concepts

A/B Testing: Split traffic between old and new system versions
Controllable vs Uncontrollable Factors: What you can change vs what you can’t
Randomization: Prevents confounding from uncontrolled variables
Chi-square Test: Tests independence between categorical variables
Bootstrapping: Statistical method with fewer assumptions
Power Analysis: Determines needed sample size
Multiple Testing Problem: More tests = higher chance of false positives
Target Population: Who exactly are you evaluating for

Detailed Notes

The Basic Framework

When you’re evaluating an AI system, you need to separate what you can control from what you can’t.

Controllable factors might include which algorithm you use (Mistral vs ChatGPT), how you design prompts, or which user groups you target. Uncontrollable factors include user demographics, how honest users are, or when your model hallucinates.

The problem comes when you run experiments. Say you test version A of your AI on Monday and version B on Tuesday. If version B performs better, is that because it’s actually better, or because Tuesday users were different from Monday users in some way you didn’t account for?

This is why you need randomization. Instead of testing by time periods, you randomly assign each user to either version A or version B. Now any differences between user groups should average out, and you can be more confident that performance differences come from your AI changes, not from uncontrolled factors.

A/B testing is the standard way to do this. You split your users randomly between the old version (A) and new version (B), then compare their performance on whatever metric you care about. This lets you test one change at a time while controlling for everything else.

Example: One company added a simple coupon code field to their website - same page otherwise. They lost 90% of their revenue. Small changes in AI systems can have similarly dramatic impacts, which is why proper testing matters.

Statistical Methods

Chi-square tests check if two categorical variables are related. Say you want to know if men and women use your AI chatbot differently - do they ask more technical vs personal questions? If gender doesn’t matter, you’d expect roughly equal proportions of technical questions from both groups. The chi-square test measures how far your actual data is from this “no relationship” expectation.

Example: Testing whether gender and tweet frequency are independent might give you a chi-square statistic of 20. If the critical value for significance is 6, then 20 is way beyond what you’d expect by chance, meaning there’s clearly a relationship.

Two-sample t-tests tell you if two groups have genuinely different averages or if the difference could just be random variation. Say version A of your AI gets 7.2/10 user satisfaction and version B gets 7.8/10. The t-test accounts for how much individual scores vary within each group to determine if that 0.6 difference is meaningful or just noise.

Bootstrapping lets you calculate confidence intervals without assuming your data is normally distributed. You repeatedly resample from your actual data (with replacement) to see how much your statistic would vary if you ran the experiment again. This works for any statistic (medians, correlations, custom metrics, etc.), not just means.

Power analysis prevents you from running underpowered experiments that can’t detect real differences. If you expect a 5% improvement but only collect 50 users, you might miss it even if it’s real. Power analysis tells you upfront: “You need 400 users to reliably detect a 5% difference given your expected noise level.”

Something to be aware of is the multiple testing problem. If you run enough tests, you’re bound to find something that is significant by pure chance.. If you’re testing whether 20 different prompt variations improve performance, one will probably look significant even if none actually work. This is why you need statistical corrections when running multiple tests simultaneously, or better yet, design focused experiments that test fewer things at once.

The Four-Step Framework

For experimental design:

Decide what to test - Define your research question and hypothesis clearly
Identify inputs and outputs - Map controllable factors, uncontrollable factors, and response variables
Choose target subjects - Define your population and sampling strategy
Determine sample size - Use power analysis to calculate how many observations you need

Do all of this before collecting any data. This sounds obvious but gets ignored constantly. People collect data first, then ask a statistician what it means. But if you didn’t randomize properly, didn’t control for confounding factors, or measured the wrong things, no amount of statistical analysis can fix it. The experiment design determines what questions you can answer, not the analysis afterward.

Practicalities

Scope down your evaluation. Traditional software has specific use cases - you build a banking app for bank customers or a photo editor for photographers. AI systems, especially general-purpose ones, can be used for anything. The same language model might help someone write code, plan a vacation, or generate marketing copy.

This makes evaluation nearly impossible at the general level. How do you measure if “GPT-4 is better than GPT-3.5” when better depends entirely on what someone is using it for? The solution is to pick one specific use case and evaluate that. Instead of “mental health AI,” test “mental health support for parents of children on waiting lists.” Now you have a defined population and clear success metrics. You can always expand to other use cases later, but you need to start somewhere concrete.

Work with domain experts. Domain experts working on a mental health AI might identify factors like user honesty, mental health state, and exposure time that would affect results. A statistician working with just the data wouldn’t know these variables exist. Domain experts understand the context where your AI operates - what could go wrong, what success looks like, what variables actually matter. Without them, you’re designing experiments blind.

Expect unique failure modes. Model updates can change behavior mid-experiment. Users adapt their prompts based on responses. Edge cases that never appeared in testing suddenly show up at scale. Unlike traditional software where bugs are predictable, AI systems fail in ways you can’t anticipate. This is why proper randomization and controlled conditions matter more, not less, than in traditional experiments.

The goal isn’t perfect experiments. It’s systematic learning about your AI system’s performance in controlled conditions, building up knowledge piece by piece rather than trying to evaluate everything at once.

Additional Notes

Sample Size Estimation Process

Determine effect size - How big a difference do you want to detect? (e.g., 5% improvement in user satisfaction)
Estimate variance - How much do individual responses vary? (may need pilot studies for new AI applications)
Choose power level - Typically 80-90% chance of detecting the effect if it exists
Calculate required sample size - Use power analysis formulas or software

Statistical Test Assumptions

Two-sample t-tests: Assume normally distributed residuals and equal variances between groups
Chi-square tests: Require expected frequencies of at least 5 in each cell
Bootstrapping: Only assumes observations are independent and identically distributed
Always validate assumptions before interpreting results

Multiple Testing Corrections

Family-wise error rate: With 20 independent tests at 5% significance level, probability of at least one false positive = 1 - (0.95)^20 ≈ 66%
Bonferroni correction: Divide significance level by number of tests (e.g., 0.05/20 = 0.0025 for each test)
Alternative: Design focused experiments testing fewer hypotheses simultaneously