AI Evaluation: Capabilities & Safety
Time Horizons and Optimization Benchmarks
Overview
Quantifying how capable a model is (in a way that holds up across years of progress and across qualitatively different tasks) is one of the hardest open problems in AI evaluation. This lecture walks through METR’s time horizon metric, the qualifications and misreadings that come with it, and then pivots to optimization benchmarks, which score agents against the world instead of against a human answer key.
This lecture was taught by Tom Cunningham from METR.
Key Takeaways
- A model’s time horizon is the length of human task it can complete with 50% success on METR’s task suite. The metric is grounded, unbounded, and has been doubling roughly every seven months, possibly faster recently.
- The 50% number describes between-task variation, not stochastic flakiness: a given model tends to either reliably succeed or reliably fail on a given task, so “50% reliability” is the wrong mental model.
- Time horizon is a strong predictor of capability on well-defined, machine-gradable software engineering tasks, but big outliers like MirrorCode show that highly parallelizable work breaks the pattern. For such tasks, “team time” may be a better predictor than individual human time.
- Optimization benchmarks have higher statistical power than binary benchmarks, can register superhuman performance, and map naturally to real economic activity, but they introduce thorny issues around skew, expenditure budgets, multi-metric trade-offs, and contamination.
- The choice between fixed-start and running-start optimization evaluations is load-bearing. Fixed-start permits apples-to-apples comparison across agents but risks contamination from human discoveries. Running-start measures genuine frontier-pushing but makes cross-agent comparison harder.
Concepts
- Time horizon: The duration of human task a model can complete with 50% success rate, estimated by fitting a logistic curve over task length versus model success.
- METR task suite: ~200 self-contained, machine-gradable software-engineering tasks with human baselines, used to compute time horizon.
- Between-task vs within-task variation: Whether differences in model success come from some tasks being consistently easy or hard (between) versus the same task being flaky across runs (within).
- Optimization benchmark: An evaluation in which the agent’s output is scored on a continuous, unbounded quantity (efficiency, accuracy, loss) rather than against a human-provided answer.
- Algorithmic efficiency: The compute required to train a model to a given level of capability. It is currently improving roughly 3x per year.
- Recursive self-improvement (RSI): The argument that once models can autonomously improve the training of their successors, capability gains compound super-exponentially.
- Fixed-start vs running-start: Two designs for optimization evals: start the agent from a historical checkpoint everyone shares, or start it from the current state-of-the-art and ask it to push further.
- Contamination: When an agent’s apparent optimization ability is inflated by training data or web access that exposes it to solutions humans have already found.
Detailed Notes
Why measuring AI capabilities is hard, and why it matters
Quantifying machine intelligence is among the oldest and hardest open problems in computer science: Turing, von Neumann, Shannon, and their successors all proposed measures, and none have proved fully satisfactory. The practical stakes are now sharp. Almost every economic and safety claim about AI is downstream of an answer to “what can the model actually do, and how does that compare to a human.” Without a grounded metric, debates about deployment, governance, and risk float free of evidence.
The methodological response METR has converged on is time horizon: ask how long a task takes a human, then ask whether a model can complete tasks of that length. The metric is influential precisely because it is visceral: saying that the frontier model can complete tasks that take humans twelve hours communicates more than any pile of benchmark scores. The flip side is that the metric is also frequently misread, and the qualifications matter.
How time horizon is constructed
For each frontier model release, METR runs the model against a fixed suite of roughly 200 tasks. Each task has a duration estimate, either an actual human baseline (paid humans timed completing the task) or a careful estimate. For a given model, every task either succeeds or fails. Plotted against log task duration on the x-axis and success rate on the y-axis, the result is a familiar logistic curve.
The model’s time horizon is the point where that fitted curve crosses 50%. For Opus 4.6 (currently the longest), the central estimate is about 12 hours, with the upper confidence interval running out to roughly 50 hours. GPT-2’s time horizon was about 4 seconds. o1’s was about an hour.
Plotted across models over time, on a log axis, the time horizon grows linearly, the headline finding from METR’s first time-horizon paper in early 2025. At publication the doubling time was estimated at seven months. The trend has held since then. Some recent analysis suggests the doubling time may have shortened to four months, though that remains contested.
Reading the per-model plot For a single model, each dot on the per-task chart is one task: x-axis is the task’s human-time estimate (log scale), y-axis is whether the model succeeded. Short tasks cluster near 100% success. Long tasks become a mix of successes and failures. The fitted logistic curve crosses 50% at the model’s time horizon. For Opus 4.6 that crossover sits at about 12 hours, meaning the model completes about half of tasks of that length.
What the metric does and does not say
It is grounded. Unlike a composite benchmark index, time horizon translates into a real-world statement: a brand-new task can be slotted onto the x-axis using a human time estimate, and the model’s success probability can be read off the curve. This is not exact, but it provides a frame the average benchmark score does not.
It is unbounded. Time horizon can in principle grow without limit, or at least until the model is doing tasks no human can do in any amount of time. Composite indices saturate. Time horizon does not.
It applies to a specific kind of task. The 200 tasks are self-contained, machine-gradable software-engineering problems, assembled in practice as a convenience sample combining three earlier benchmarks. They are not a random sample of “what software engineers do all day,” and the requirement that they be machine-gradable excludes much real engineering work, where success criteria are themselves ambiguous. The right reading of “12-hour time horizon” is therefore time horizon on well-defined software-engineering tasks, not time horizon on a typical engineer’s day.
“50% reliability” is the wrong mental model. Roughly two-thirds of the variance in success across tasks is between-task and only one-third is within-task. In other words, the model is not stochastically flaky at the 50% point. It consistently succeeds on some tasks at that duration and consistently fails on others. The right phrasing is “Opus 4.6 can complete about half of well-defined software-engineering tasks that take humans 12 hours,” not “Opus 4.6 can do 12-hour tasks with 50% reliability.”
Where there is genuine within-task stochasticity, simple engineering tricks raise reliability quickly. Running the agent multiple times and taking a majority vote drives a 51% success rate toward 100% with enough samples. If the answer is verifiable, the agent can check its own output and resubmit the best candidate. Across these tricks, the extra inference cost is typically negligible compared with the cost of paying a human to do the same task.
Running out of long tasks
The headline trend has a logistical problem hiding underneath. To estimate a model’s time horizon reliably, the task suite needs enough tasks of a relevant length. As frontier models extend their reach, the suite needs more long tasks (tasks taking humans two days, a week, or longer). Building such a task is expensive: it must be realistic, non-hackable, machine-gradable, and require enough human time to anchor the duration estimate.
METR’s recent v1.1 release added about 20 tasks (and dropped a few broken ones), tilted toward longer durations. It extended the measurable horizon somewhat but did not keep pace with model progress. Confidence intervals at the top of the chart are widening as a result. The current internal rule of thumb is that the suite needs roughly eight new long tasks per month just to hold position.
Why long tasks are hard, and the MirrorCode outlier
No fully satisfactory theory explains why long tasks are differentially harder for models than for humans. Plausible candidates (coordination ability, multi-step reasoning, long-horizon planning) are gestures, not mechanisms. The empirical pattern is robust but underexplained.
The pattern also has visible outliers. MirrorCode, a recent benchmark from Epoch and METR, asks agents to translate large real-world codebases (on the order of 100,000 lines) from one language to another while passing a battery of correctness tests. Estimated human time runs to multiple weeks. By time-horizon logic, agents should fail at most of these tasks. Instead, frontier models complete most of them, a substantial blue dot far out to the right of the curve and effectively saturated on release.
The working interpretation is that translation tasks decompose cleanly. A 17-week project, on this view, is really many one-day subtasks with relatively narrow interfaces between them, executed largely in parallel. The implication is that individual human time is the wrong x-axis for this class of work. Team time (how long a coordinated team would need) would predict model success better. Tasks that resist parallelization, requiring one person to hold the whole problem in their head, remain stubbornly hard.
From benchmarks to optimization
A standard benchmark tests the agent against a human reference answer. An optimization benchmark instead gives the agent a fixed input and scores its output on a continuous, unbounded quantity. The benchmark stops being “did you match the human?” and becomes “how much can you actually do?”
Optimization problems vary in how expensive each evaluation is. Speeding up a sorting algorithm is cheap to score: run it and measure. Tweaking a transformer architecture costs whatever a training run costs. Drug discovery may cost hundreds of millions per evaluated candidate. The framing scales across these regimes.
The broader picture is that human civilization has been running optimization processes for centuries: overall economic productivity rises about 2% per year, semiconductor density doubles every 18 months under Moore’s law, and so on. Deep learning has contributed to specific scientific advances (AlphaFold, AlphaEvolve, deep reinforcement learning results), but the contribution of LLMs to original knowledge so far has been small compared with their economic footprint. Optimization benchmarks are one route to measure that and to detect when it starts changing.
Why optimization benchmarks are attractive
Continuous outcomes have higher statistical power. A binary benchmark requires many tasks to discriminate between models. A continuous score can do the same work with two or three. Given how highly correlated capability metrics are across benchmarks, a small number of well-designed optimization tasks can carry a lot of signal.
They naturally extend past human performance. Most benchmarks need an expert to write the answer key, capping measurable performance at expert level. Optimization benchmarks score against the world, not against a human, so superhuman scores are legible in the same framework as sub-human ones. Saturation is still possible (at the actual optimum), but it is a far higher ceiling.
They map to real economic activity. Much of what matters in practice is optimization: reduce loss, increase efficiency, accelerate a calculation. Benchmarks that score these quantities directly are closer to what users and labs actually care about than artificial Q&A formats.
Why optimization benchmarks are hard to do well
Skewed outcomes. GSO-Bench (built by Manish Shetty) takes about 100 algorithms from public repositories, baselines their runtime, and asks agents to produce faster versions. Some algorithms admit only 1% speedups. Others admit 10,000x speedups because they were never optimized. Averaging across them is a mess. Binary thresholding (“beat the human-level baseline”) sidesteps the skew but throws away the statistical power that motivated the benchmark in the first place. Geometric or harmonic means, or scaling each task by some human improvement reference, are partial fixes, all imperfect.
Sensitivity to expenditure. An agent’s optimization performance depends on how many tokens, samples, and trial experiments it is allowed. A single number is not enough. The right object is a curve of performance against budget. That curve almost never asymptotes cleanly, so deciding when to stop spending is itself a judgment call. A useful stopping rule is “spend until you cross a relevant human curve.”
Budget-aware strategy. Even the shape of the curve depends on whether the agent knows its budget in advance. An agent told it has 10 million tokens explores differently than one told it has 1 million. Performance at the one-million mark can therefore differ between an “unbounded” run interrupted at 1M tokens and a “1M budget” run, because explore/exploit strategy shifts. This adds a hidden dimension to any reported result.
Multi-objective trade-offs. Real optimization problems have several axes (for training an LLM: accuracy at fixed compute, compute at fixed accuracy, data at fixed both). The Pareto frontier is what matters, but mapping the entire frontier is usually too expensive. The fallback is to pick the objective the downstream decision actually depends on, hold the others fixed, and report the slope.
Contamination. If the task is “optimize this 2025 algorithm,” the best strategy may be to look up the 2026 literature. Web access and recent training data both put model results at risk of measuring retrieval rather than discovery. Newer agents will tend to be more contaminated, biasing cross-generation comparisons in their favor.
Fixed-start versus running-start
The contamination problem motivates a sharper design choice.
A fixed-start evaluation gives every agent the same historical checkpoint of a target algorithm (say, the version known in 2025). Agents are scored on how much they can improve it. This permits apples-to-apples comparison across agents and guarantees that improvement is possible (because humans have since found some). The risk is precisely that the agent recovers known human improvements from training data rather than discovering anything new, with newer agents systematically more contaminated.
A running-start evaluation instead pits each agent against the genuine current state of the art at evaluation time: here is the best public algorithm anyone knows of. Push it forward. This measures the thing actually worth measuring (frontier-pushing), but the baseline shifts under foot, making it hard to compare an agent evaluated in February against one evaluated in April. The TT-Discover paper from Stanford (early 2026) is a recent example of the running-start approach, applied across four algorithmic domains.
In practice, both designs are needed. Fixed-start gives cross-agent comparability. Running-start tests whether the agent can actually do new work.
Recursive self-improvement as the stakes
The reason all of this matters beyond benchmark construction is the role of optimization in AI development itself. Algorithmic efficiency in LLM training (the compute required to reach a given capability level) has been improving roughly 3x per year, sustained by tens of thousands of researchers tweaking GPU design, kernel optimization, model architectures, optimizers, data pipelines, post-training, and elicitation. The training cost for a fixed capability level falls by an order of magnitude every two years from this work alone.
The recursive self-improvement argument simply asks what happens when models start contributing meaningfully to that stack themselves. If a model that is 1% better can produce a successor that is more than 1% better, the chain compounds super-exponentially without any increase in inputs. Optimization benchmarks are the empirical surface where evidence on this question will accumulate. They are how the field will find out, in advance of the fact, whether and when the chain starts running.
Conclusion
Time horizon is the most successful capability metric the field currently has, not because it answers every question but because it does three things at once: it stays grounded in a real-world quantity, it does not saturate, and it communicates progress in a form that policy makers and engineers can both reason about. The qualifications are real (the task suite is a convenience sample of machine-gradable software engineering, individual rather than team time may be the wrong axis for parallelizable work, and the “50% reliability” phrasing actively misleads), but the underlying trend, with frontier time horizons doubling every several months, is the cleanest long-run signal of AI progress on offer.
Optimization benchmarks are where capability measurement is heading next. They trade the comfort of human-graded answers for the ability to register superhuman performance and to score the activities that actually drive economic and scientific value. The cost is a thicket of design choices (skew, budget sensitivity, multi-metric trade-offs, contamination, fixed-start versus running-start) that have no clean general solutions and have to be worked out per benchmark.
Be skeptical of headline numbers that elide these choices. A time horizon without its task class is a slogan. An optimization score without its budget curve is unfalsifiable. A fixed-start improvement number without a contamination check may be measuring memory rather than discovery. The discipline of capability measurement is, in the end, the discipline of stating what you measured precisely enough that someone else can argue with it.