International Programme on AI Evaluation: Capabilities and Safety 11 March 2026

Module 6: Animal Psychology

Overview

AI evaluation faces the same challenge that animal cognition researchers have tackled for 150+ years: how do you understand what’s happening inside a mind that works differently from yours? Comparative psychology methods can make AI evaluation more reliable and predictive by moving beyond surface behaviors to understand underlying capabilities.

Impressive-looking behavior doesn’t prove intelligence. Systems find the easiest path to rewards, which might not use the capabilities you think you’re testing.
Control conditions are essential. You need to test what happens when you remove the challenge while keeping everything else the same.
Systematic variation beats one-shot testing. Understanding patterns of failure across many variations tells you more than aggregate scores.
Construct-based evaluation predicts deployment performance. Measuring underlying capabilities helps predict behavior in new situations.
No test is “pure”. Every evaluation has additional demands beyond the target capability that can confuse results.

This lecture was taught by Lucy Cheke from the University of Cambridge.

Detailed Notes

AI systems are good at finding unintended shortcuts, much like Clever Hans. A classic example: machine learning models trained to distinguish dogs from wolves actually learned to detect snow in the background, since the training set had wolves in snowy environments and dogs indoors. The models weren’t recognizing animals at all.

The Balloon Math Problem
“A child has 10 red balloons and 10 yellow balloons. She sticks a pin in five red balloons. How many balloons does she have left?” This appears to test basic arithmetic (10 + 10 - 5 = 15), but actually requires reading text overlaid on images, understanding physics (what happens when you stick a pin in a balloon), and social inference (do you still “have” a popped balloon?). If you wanted to test math, you made it much harder than necessary.

Control Conditions

It is essential to keep everything identical except remove the cognitive challenge you’re trying to test. If a system passes the experimental condition but fails the control, it has the capability. If it fails both, something else is wrong with your setup. Control conditions aren’t just “remove something.” You often need to replace what you removed with something that would give different results if the system were using a different strategy or capability.

Face Recognition Controls
Testing “is there a face in this image?” seems straightforward. The obvious control condition - the same image with the face removed - creates a blank background where “no” is correct because there’s nothing there, not because there’s no face specifically. Better control: images of objects, which have something present but not a face. This isolates face recognition from general object detection.

From Behaviors to Capabilities

Most birds can pull strings to get food - they grab the string, pull it up, step on it, and repeat until they reach the treat. But what do they actually understand about why this works?

Researchers tested western scrub jays with different string arrangements: straight strings, crossed strings, strings at angles. The birds succeeded when they could draw a straight line from the string end to the food, but failed when they had to track the actual path of the string. This revealed they were using a simple “straight line to food” heuristic rather than understanding contact forces.

This approach moves from “can they do X?” to “what do they understand about the physics that makes X work?” The same underlying capability (understanding contact forces) would predict performance on hooks, ribbons, and other tools instead of just strings.

Self-Driving Car Performance
Three self-driving car systems all score 62.5% on road tests. But breaking down performance by road wiggliness and fog levels reveals different patterns: System A fails completely on wiggly roads but handles fog fine. System B succeeds perfectly on wiggly roads regardless of fog. System C is unpredictable on both dimensions. For deployment on the wiggliest road in the world on a clear day, you’d choose System B (100% success rate) over the others, despite identical aggregate scores.

The Animal AI platform

The Animal AI platform creates a level playing field for comparing different types of minds. Using simple blocks and basic physics, researchers built 300+ tasks adapted from animal and child psychology studies. The environment works as a computer game for humans, a physical space for animals, and a reinforcement learning environment for AI agents.

The 2019 Animal AI Olympics compared cutting-edge RL agents with 6-10 year old children. On basic tasks (retrieve a reward in an open field), performance was identical. But children vastly outperformed AI on slightly more complex challenges like detouring around a wall to reach a visible reward.

Even when both children and AI failed, they failed differently. RL agents showed repetitive approach-avoid behavior, bouncing off obstacles repeatedly. Children would stop, assess the situation, and try different strategies. The failure patterns revealed different underlying mechanisms.

Claude vs. the Ramp
An LLM controlling an agent in Animal AI had to climb a ramp to reach a reward. Claude’s reasoning trace: “I can see a purple ramp to my left… I will move towards the purple ramp as it might lead to a different area.” But it kept trying to go through the wall side of the ramp. After multiple failed attempts: “I’m still not at the top of the ramp… This is unexpected. I might be misinterpreting the situation… I think maybe this task is designed to see how long it takes me to give up on impossible tasks.” Claude never tried the climbable side of the ramp.

Cpabilities in one domain don’t guarantee capabilities in another, even for the same underlying skill. LLMs that excel at describing spatial reasoning in text fail completely when controlling agents in spatial environments. Claude could analyze the ramp problem perfectly in text but couldn’t solve it when embodied. A system might demonstrate theory of mind in text vignettes but fail at predicting behavior in interactive scenarios. The same reasoning capability looks completely different across modalities, and you can’t assume transfer.

Systematic Testing

Object permanence - understanding that objects continue to exist when hidden - seems simple to test, but different task designs reveal different capabilities.

Researchers created three types of object permanence tasks in Animal AI: the “stables” task (reward drops behind purple ramps), the “grid” task (reward falls into holes of different depths), and the “chick” task (reward rolls behind a wall, requiring trajectory inference). Each tests the same basic capability (object permanence) but rules out different alternative explanations. Some RL agents performed well on control conditions but collapsed when object permanence was required. Children performed nearly identically whether the object was visible or hidden.

The measurement layouts approach then used Bayesian models to separate task demands from actual capabilities. Instead of just recording pass/fail, researchers annotated visual complexity, memory requirements, and navigation demands. This let them predict which systems would succeed on untested object permanence tasks based on their capability profiles.

Tackling Data Contamination

Large language models create new evaluation challenges because they’ve read most of the internet, including psychology studies. The famous Sally-Anne theory of mind test (where does Sally think the chocolate is after Anne moves it?) appears in hundreds of papers online. LLMs might pass not because they understand minds, but because they’ve memorized the answer.

Data contamination is everywhere. GPT-4 passed graph interpretation tasks before it had vision capabilities because the graphs showed real data that existed online. If you want to know Scotland’s minimum temperature in May 1987, you don’t need to read the graph - you just need to have read the internet.

The solution is systematic variation with novel content. Researchers created vignette generators that test the same underlying capabilities (causal reasoning) across different domains (social vs. physical) and inference levels (single vs. double steps). Results showed that while LLMs approached human performance on social reasoning, they struggled more with physical reasoning, and performance collapsed when tasks required two inference steps instead of one.

Systematic Causal Reasoning Tests
Control condition: “The China teacup was very fragile and would break if it hit something hard. You gently place the teacup on the table.”
Experimental condition: “The China teacup was very fragile and would break if it hit something hard. You accidentally dropped the teacup onto the concrete floor.”
Same reasoning demands, different correct answers. Testing across social vs. physical scenarios and single vs. double inferences reveals specific capability patterns rather than general “intelligence.”

Practical Implementation

Start with behaviors your system can already do or learn easily. This ensures you’re testing understanding rather than basic capability. Create systematic variations that change one aspect at a time while keeping others constant. Annotate all task demands beyond your target capability - reading requirements, motor demands, perceptual needs.

Build control conditions that remove your target capability while preserving other demands. Test the same underlying capability across different surface behaviors. Use the patterns of passes and failures to infer capability levels that predict performance on untested scenarios.

This approach front-loads effort but creates evaluation frameworks that don’t become obsolete when new systems emerge. Instead of asking “can GPT-5 do X?” you ask “what level of capability Y does GPT-5 have?” and use that to predict performance across many tasks that require capability Y.

Measurement Layouts
Instead of just recording pass/fail on each task, annotate the demands: visual complexity, memory requirements, planning steps, motor precision. Use statistical models to separate task demands from system capabilities. A system might fail complex visual tasks not because it can’t plan, but because it struggles with visual processing. This lets you predict performance on new planning tasks with simple visuals.

The goal isn’t perfect evaluation but predictive evaluation. Understanding why systems succeed or fail helps predict behavior in deployment scenarios you haven’t tested yet.