AI Evaluation: Capabilities & Safety
Global AI Governance Landscape
Overview
Frontier AI governance is moving from high-level principles into binding implementation, and evaluations have become one of the few concrete technical mechanisms that regulators and labs can use to operationalize safety. This lecture surveys the four major regulatory poles (the EU, the US, China, and the UK), explains how evaluation obligations are embedded in each, and works through what evaluations can and cannot deliver on their own.
This lecture was taught by Seán Ó hÉigeartaigh.
Key Takeaways
- The world has fragmented into four distinct governance philosophies (the EU’s horizontal risk-based act, the US’s preemption-versus-states tug-of-war, China’s pre-release approval architecture, and the UK’s still-pending frontier-specific bill), and this fragmentation is itself now a structural feature of the landscape, not a transitional state.
- Evaluations are the regulatory lynchpin for frontier governance because pre-deployment testing is the only practical way to constrain models whose most serious potential harms (CBRN uplift, loss of control) are too consequential to wait for after-the-fact enforcement.
- Responsible scaling policies are described as voluntary, but once they are cited in legislation or used as a basis for regulatory expectation they begin to carry real legal weight. The label “voluntary” understates their effective force.
- Evaluations only ever establish a lower bound on capability: a model that fails a test under controlled conditions may still display the capability when fine-tuned, scaffolded, given new tools, or pushed by a determined real-world user, so governance has to specify the conditions under which re-evaluation is triggered.
- The fastest-growing blind spot is internally deployed models (frontier systems used inside companies but never released), where almost no existing regulation applies, even though loss-of-control and exfiltration risks are most acute precisely there.
Concepts
- General-purpose AI (GPAI): Models capable of being adapted to many downstream tasks. The EU AI Act gives them their own regulatory category.
- Systemic risk: Under the EU AI Act, a category triggered by capability, by reach across users, and by centrality in the broader ecosystem.
- Compute threshold: A quantitative trigger (10^25 FLOPs in the EU Act, 10^26 in California’s SB 53) above which heavier obligations attach.
- Preemption: A legal doctrine allowing federal rules to override conflicting state rules. It is central to the current US fight over AI regulation.
- Responsible scaling policy (RSP) / Frontier safety framework: A lab’s published commitment tying capability thresholds to safeguards, evaluation cadence, and deployment decisions.
- AI Safety Institute (AISI) / CAISI: National bodies (UK AISI, US CAISI, and counterparts in Japan, Singapore, and elsewhere) that conduct pre- and post-deployment evaluations and coordinate methodology.
- CBRN uplift: A model’s contribution to a user’s ability to cause chemical, biological, radiological, or nuclear harm.
- Information asymmetry: The growing gap between what labs know about their most capable internal models and what regulators or independent researchers can see.
Detailed Notes
Why the landscape is fragmented
ChatGPT was released in November 2022. That is a long time in technology and a very short time in law. The academic community and the governance institutions have struggled to keep pace, and the response in different jurisdictions has diverged sharply. Over seventy jurisdictions now have AI policy activity, and the OECD’s AI Policy Observatory tracks roughly a thousand distinct initiatives.
The result is not just regulatory variety but four genuinely different philosophies of how to govern this technology. Understanding the shape of each is the prerequisite for understanding where evaluations fit.
The EU: a horizontal risk-based act
The EU AI Act is the most developed and comprehensive framework. It classifies AI systems by risk into four tiers. Unacceptable risks are fully prohibited (social scoring, biometric surveillance in public spaces, manipulation of vulnerable groups). High-risk applications are subject to conformity assessments (employment, education, critical infrastructure, law enforcement). Limited-risk systems carry transparency obligations such as disclosing AI interaction. Minimal-risk systems have no binding requirements.
General-purpose AI sits in its own track. All GPAI faces baseline transparency and copyright obligations, including disclosure of training data sources. The Act’s General-Purpose AI Code of Practice then operationalizes the obligations for models that pose systemic risk. Systemic risk is defined not only by capability but by reach (how many people the model touches) and ecosystem centrality (how much else depends on it). A 10^25 FLOPs training-compute threshold automatically triggers the systemic-risk category.
Models in that category are expected to be tested for CBRN uplift, loss-of-control and misalignment risks, and other systemic hazards. The Act expects adversarial testing, capability reporting, and safety incident notification.
Two pressures are now bearing on the Act. The technical threshold was chosen when very few models would cross it. Many more do now, raising the question of whether the rule remains practically administrable. And there is sustained commercial and political pressure, particularly from the US, to soften the requirements.
The US: federal minimalism against state activism
The US has gone through a sharp shift between administrations. The 2022–24 period saw considerable transatlantic alignment on risk and on the value of comprehensive regulation, including active US participation in the Bletchley process. The current administration takes a very different view, has been heavily influenced by the commercial sector’s concerns about innovation costs, and has explicitly pushed back against international regulation of US companies.
The result at federal level is minimalist: a 2025 executive order promoting a “minimally burdensome” national framework and laying groundwork for preemption of state laws, followed by a brief framework document along similar lines. There is no federal AI act.
State-level activity has filled the vacuum. California’s SB 53 targets the largest frontier models and requires developers to publish safety policies (responsible scaling policies or frontier safety frameworks) with built-in expectations of evaluation. New York’s RAISE Act, after several revisions, now follows a similar model. Texas takes a different angle, focused on liability and the circumstances under which developers can be sued. One notable mechanism is a safe-harbor provision that protects companies from liability for vulnerabilities they discover through their own internal testing and fix within sixty days. Internal evaluations, in other words, are being woven into the liability regime as well as into external oversight.
The unresolved fight is over preemption. Where the federal framework explicitly covers an area (child sexual abuse material is the clean example), state rules are likely to be preempted. Where it does not, states will probably retain standing to set their own requirements. The outcome will be settled case by case over the coming months.
China: governance embedded in the architecture
China’s approach, as documented by analysts at the Carnegie Endowment, embeds governance in the system architecture itself rather than relying solely on post-deployment enforcement. The tools are algorithmic registration, security reviews, content controls, and platform liability. Models must be submitted for review of the algorithm and of the content they produce.
The regime has teeth: Baidu’s Ernie was delayed by more than six months in 2023–24 before release. Approval also includes a content dimension: models must reflect what are framed as acceptable socialist values, which is most visible in the kinds of political questions Chinese models will not answer. Earlier proposed rules requiring all LLM outputs to be accurate were softened after the technical community pushed back that hallucinations cannot be driven to zero.
Chinese safety ecosystems are less mature than their Western counterparts. DeepSeek, for instance, is a far smaller organization than OpenAI or Anthropic, with a safety and governance team of only a handful of people. The pattern is a stricter approval regime sitting on top of a less developed internal safety practice, which surfaces in periodic crackdowns when deployed systems fail to meet expected robustness standards.
The UK: between the EU and the US
The UK sits between the two larger blocs. The previous government proposed relying on existing sector regulators (Ofcom and others), empowered to handle most AI use cases, with a central risk function and a separate frontier-AI bill for the most capable models. That bill has been promised for roughly two years and has been repeatedly delayed, partly because of the difficulty of remaining compatible with both US and EU partners. Frontier AI transparency was reportedly intended as a core element, but the final text has not been published.
Where evaluations fit
The case for evaluations as a governance instrument is straightforward. Outright bans are not desirable. These systems are useful in many contexts. But for the most powerful models, the kinds of harm that matter most (CBRN uplift, loss of control) are too large and potentially irreversible to wait for evidence after the fact. The remaining option is pre-deployment demonstration that a model does not exceed defined dangerous thresholds, and evaluation is the technical mechanism that produces that evidence.
This logic is now visible across multiple regimes. Twelve frontier labs have published responsible scaling policies or frontier safety frameworks with capability thresholds. Anthropic’s ASL ladder is the canonical example, running from chatbot-level capabilities at ASL-1 up through ASL-3 (meaningful uplift on chemical synthesis) to ASL-4 (autonomous AI research and development, including recursive self-improvement). These frameworks are described as voluntary, but the label is misleading. Once a voluntary commitment is cited in legislation or treated by regulators as a baseline expectation, it carries effective regulatory weight.
What evaluations look like in practice. Four categories matter most:
- Dangerous capabilities: bio- and chem-uplift, expert-level cyber attack tasks, and increasingly zero-day vulnerability discovery in widely deployed software.
- Autonomy: how long an unsupervised task a model can complete, with METR’s measurement showing the length of task completable at a 50% success rate doubling roughly every seven months.
- Alignment and behavioural evaluations: scheming, deception, sandbagging (a model pretending to be less capable than it is during evaluation), and jailbreak resistance.
- Sociotechnical and fairness evaluations: bias and discrimination in deployment contexts, transparency, and explainability.
A live example of internal-only frontier capability Anthropic recently disclosed an internal model (substantially more capable than its released systems) that was held back from public release because it could identify zero-day vulnerabilities in widely deployed software. Anthropic partnered with around forty companies to find and patch those vulnerabilities before deciding how to deploy further. The behaviour is plausibly responsible. But the existence of an unreleased, highly capable model that the broader research and policy community cannot see is the cleanest illustration so far of the information asymmetry problem: governance frameworks designed around release events have no purchase on a model that is never released.
What evaluations can and cannot do
Evaluations establish a lower bound on capability. If a model demonstrates a capability under test, it has the capability. If it does not, the only safe conclusion is that the capability could not be elicited under those conditions, not that it is absent. A model that passes today’s tests can change substantially in six months through fine-tuning, scaffolding, or new tool access. Governance has to specify when re-evaluation is required, not treat a single pre-deployment pass as permanent.
Some forms of deceptive alignment can be detected under controlled conditions, but assessing risks from autonomous systems operating over long horizons in open environments is much harder than testing on bounded tasks in a sandbox. Evaluations also depend on standardized methodology to be comparable across models and across time, and methodology is still actively evolving.
The implication is that evaluations cannot stand in for governance on their own. They are one part of a toolkit that also has to include ongoing post-deployment monitoring, incident response, human oversight, transparency obligations, whistleblower channels, and international coordination on shared methodology. The UK AISI currently chairs the international AISI network and has shared evaluation standards as one of its 2026 goals. The US CAISI (formerly USAISI) retains its evaluation function and has pre-deployment access agreements with Anthropic and OpenAI. Third-party groups including METR (formerly ARC Evals) and Apollo Research, which focuses on deception and scheming, provide independent technical capacity.
The internal-deployment gap
Most existing regulation targets the point of public release. The EU AI Act is built around obligations triggered before placing a model on the market. State-level US laws focus on what frontier developers must disclose about deployed systems. China’s regime requires approval before release. None of these have firm grip on the case where the most capable model in the world sits inside a company and is never released externally, or is released only to a handful of enterprise partners.
The risks there are not hypothetical. A loss-of-control failure in an internally deployed system that managed to exfiltrate itself to external servers would have consequences that bear no relation to whether the model was ever placed on a public market. Closing this gap will require transparency mechanisms that reach inside development: third-party audit rights with genuine access, whistleblower protections, and pre-deployment evaluation regimes that do not depend on a public release event as their trigger.
Compute and international coordination
Compute governance (export controls, hardware-level monitoring, and in principle on-chip mechanisms that constrain which training runs are possible) is the other major lever under active discussion. It is a hard constraint that potentially bites directly on the largest model development without requiring fine-grained inspection of any individual model. It is also contested, and the technical and political questions are far from settled.
International coordination matters because evaluations only generate useful comparable evidence when methodologies are shared. Fragmentation across jurisdictions imposes real costs on companies trying to comply with different rules in different places, and a tension runs through the whole field: standardize methodology so that results are comparable, but leave room for innovation so that evaluations can keep up with rapidly improving systems. Both are necessary.
Conclusion
Frontier AI governance has moved in three years from high-level OECD-style principles into binding implementation, and evaluations are now load-bearing in every major regime. The EU asks for them through the Code of Practice. California and New York require their reporting. China demands them before approval. The UK is preparing to require them at the frontier. What evaluations can carry, and what they cannot, is the question that determines whether any of this works.
The honest answer is that evaluations are necessary, useful, and insufficient. They establish lower bounds on what a model can do under test conditions. They do not forecast what the same model will do after six months of fine-tuning and scaffolding, and they do not yet have firm purchase on internally deployed systems that never reach the public market. The regulatory layer is what is supposed to convert imperfect evaluation evidence into reliable safety decisions, and the regulatory layer is still being built: fragmented across philosophies, in active tension with commercial pressure, and dependent on a technical community that has to keep producing better evaluations faster than the systems themselves improve.
The practical posture for anyone working in this space is to engage across the boundary. Governance written without close contact with technical reality becomes either vague enough to ignore or onerous enough to undermine. Technical evaluation built without awareness of how its results will be used in regulation drifts toward the questions that are easy to measure rather than the ones that matter. The interesting work over the next two years sits exactly on that seam.