AI Evaluation: Capabilities & Safety

Soft Governance

by Seán Ó hÉigeartaigh, Leverhulme Centre for the Future of Intelligence, University of Cambridge·

Overview

Hard law is only one layer of AI governance. Standards bodies, AI safety institutes, summit communiqués, and voluntary frameworks together do much of the load-bearing work, generating the evidence and norms that regulators eventually codify. This lecture maps that soft-governance layer, situates the Chinese ecosystem inside it, and works through the near-future pressures that are likely to reshape both evaluations and the institutions that consume them.

This lecture was taught by Seán Ó hÉigeartaigh.


Key Takeaways

  1. AI safety institutes are not regulators. They test, measure, and evaluate, producing the technical evidence that regulators then use. That makes their independence and their access to frontier models the structural choke points of the whole soft-governance layer.
  2. Soft instruments harden over time through cross-reference: the NIST AI Risk Management Framework began as voluntary guidance but now carries quasi-binding weight through crosswalks into the EU AI Act, the OECD principles, the Hiroshima Process, and US state legislation.
  3. China is regulating actively and iteratively (algorithm recommendation rules in 2022, deep synthesis labeling in 2023, a mandatory pre-launch security assessment for generative AI services that delayed Baidu’s Ernie by roughly six months), and these approval-gated mechanisms have no real US analogue.
  4. The international AI safety institute network has already drifted in mandate (from “safety” toward “security and standards”) under shifting US political winds, and a similar drift has moved the summit series from Bletchley’s frontier-risk focus toward broader opportunity and global-majority agendas, leaving an open question about where frontier-specific governance conversations actually take place.
  5. The pace gap between three-year standards cycles and a frontier where capabilities are reportedly progressing roughly twice as fast as governments expected a year ago means any governance framework anchored to fixed thresholds (10^25 FLOPs, fixed benchmarks, voluntary pre-deployment access) is structurally vulnerable to capability jumps or recursive self-improvement.

Concepts

  • Hard law: Binding legal obligations with enforcement mechanisms and penalties for non-compliance. The EU AI Act sits on the harder end.
  • Soft governance: Standards, guidelines, and norms without direct legal backing that shape behaviour through authority, reference, and uptake.
  • AI safety institute (AISI): A state-backed technical body that tests, measures, and evaluates frontier models but does not itself regulate or certify.
  • NIST AI RMF: The US National Institute of Standards and Technology’s AI Risk Management Framework, structured around four functions (govern, map, measure, manage).
  • Crosswalk: A published mapping showing how one governance instrument’s requirements align with another’s, the mechanism by which soft standards acquire harder weight.
  • Presumption of conformity: A legal posture treating adherence to a designated standard as compliance with the underlying law.
  • Pre-deployment access: A voluntary arrangement granting an AI safety institute the ability to test a frontier model before it is released to the public.
  • Mandate drift: The gradual shifting of an institution’s scope under political or competitive pressure (for instance, from “safety” toward “security and innovation”).

Detailed Notes

Hard law and soft governance

Hard law and soft governance sit on a spectrum. At one end are binding obligations like the EU AI Act, with enforcement and penalties. At the other are voluntary instruments (standards, guidelines, professional norms, framework documents) that shape behaviour through authority and reference rather than coercion.

The line is porous. A purely voluntary standard becomes quasi-binding the moment another instrument cites it. The NIST AI Risk Management Framework is the cleanest example. Released in early 2023 with a generative-AI profile added in July 2024, it now functions as one of the most widely used operational standards for AI governance globally. It enters US federal procurement language. The EU AI Act publishes crosswalks against it. The OECD AI principles reference it. The Council of Europe and the Hiroshima Process invoke it. US state legislation increasingly builds on it. None of this was hard-coded into NIST’s mandate, but the cumulative cross-referencing pulls voluntary guidance toward the binding end of the spectrum.

The framework itself has four components. Govern establishes the organisational culture, structures, and policies for AI risk management. Map identifies the context an AI system sits in and the risks specific to its use cases. Measure applies quantitative and qualitative tools to track those risks. Manage operationalises mitigations against the risks surfaced in mapping and measurement. The structure is generic enough to be portable across jurisdictions, which is part of why it has travelled.

AI safety institutes: what they do and what they don’t

The UK launched its AI Safety Institute alongside the Bletchley Summit in late 2023, evolving from a precursor unit chaired by Ian Hogarth. The US AISI was set up in parallel within NIST. Japan followed in February 2024, and Singapore, South Korea, Canada, France, India, and Kenya have since established their own. The EU AI Office occupies a hybrid position: it sits inside the regulatory machinery of the AI Act but performs information-gathering and standard-setting functions analogous to an AISI.

The core point about AISIs is that they are not regulators. They test, measure, and evaluate. They run dangerous-capability assessments across chemical, biological, radiological, nuclear, cyber, and autonomy domains. They conduct foundational research on safety, alignment, and interpretability. They publish trend reports. They run grant programmes for academic groups. They shape standards. The output of all of this is information that flows into regulators’ decisions. They do not themselves certify models as safe. That judgement is reserved for governments.

International coordination. At the Seoul summit, the International Network of AI Safety Institutes was established, since renamed the International Network for Advanced AI Measurement, Evaluation and Science. The UK is chairing in 2026 and has prioritised developing shared best practices for evaluation, a direct response to the lack of common methodological standards that has dogged the field. A NeurIPS workshop in late 2025 worked toward consensus on evaluation methodology, and that work continues.

Structural vulnerabilities. Three are worth naming. First, AISIs rely on voluntary pre-deployment access to leading models. If that access were revoked (and some labs have not signed even the existing agreements), the institutes’ effectiveness would collapse. This creates a soft incentive not to publish results too damaging to the labs whose cooperation is needed next quarter. To date the quality of AISI work, including the jailbreaks they find after in-house testing has cleared a model, has been an asset to labs rather than a liability, but the tension is real. Second, the absence of common evaluation standards gives labs an exit valve: Meta cited inconsistency across frameworks as a reason for declining to sign post-Bletchley commitments. Third, mandate drift. The UK institute moved from “safety” to “security,” dropping bias and misinformation from its central remit. The US institute rebranded as the Center for AI Standards and Innovation. Both shifts trace to the change in US administration and its flow-through effects on allied institutions. The core function (catastrophic-risk testing) has so far remained intact, but it is not guaranteed.

China’s regulatory stack

A persistent misconception is that China is doing little on AI regulation. The record shows the opposite. Provisions for algorithmic recommendation systems took effect in March 2022, requiring transparency and user control. Deep synthesis rules followed in January 2023, mandating labelling and watermarking of synthetic media. The August 2023 Generative AI Services Measures require a security assessment before any public-facing generative AI system may launch, a genuinely binding pre-release gate.

The Ernie delay Baidu’s Ernie model was delayed by roughly six months while it cleared the pre-launch security assessment. A delay of that scale on a flagship generative model is essentially unimaginable in the US regulatory environment. The Chinese approach is not rights-based like the EU’s, nor primarily liability-based like much of the US framework. It is approval-gated, with a heavy emphasis on robustness, security, and alignment with state-defined values. Operational consequences follow. DeepSeek will discuss some topics and not others, by design.

China has also amended its cybersecurity law to include AI-specific security reviews and data-localisation requirements. Internationally, it convenes a network of AI safety bodies (referred to as CnAISDA) rather than a single national institute. The choice reflects China’s geographic distribution of AI expertise: Beijing, Shanghai, PKU, Tsinghua, and so on. The network engages internationally and uses some of the UK AISI’s tooling, but does not yet operate an in-house evaluation team comparable to the UK or US institutes, and US opposition has kept it outside the AISI network proper. It did attend the Paris summit.

A municipal layer is now emerging underneath. The Beijing AI Safety Institute, industry-funded and roughly seven or eight people at present, runs robustness testing, adversarial probing, and jailbreaking exercises that would look familiar to anyone at UK AISI. The Shanghai AI Lab Frontier Safety Framework, produced in collaboration with Concordia AI, draws explicitly on leading UK and US work and is comprehensive enough that its components are likely to ripple outward through the Chinese ecosystem even though Shanghai AI Lab is not itself a frontier developer.

On the multilateral track, China proposed a World AI Organization at the July 2025 Shanghai AI conference, positioning itself as a coordinating hub for multilateral AI governance, and released a thirteen-point Global AI Governance Action Plan shortly after the US released its own action plan. Chinese efforts have consistently pushed for AI governance to happen at the UN level, natural enough given the UN’s legitimacy and reach. US opposition to UN-led AI governance, paired with US preference for OECD-based fora that exclude China, has produced exactly the kind of bifurcation the rest of the geopolitical landscape would predict.

Where frontier governance conversations actually happen

The Bletchley summit was narrowly focused: leading countries, leading companies, the specific governance challenge of frontier AI. Seoul kept that frame. Paris broadened the agenda toward opportunity and the business case. India broadened it further toward global-majority concerns. Each successor venue added valuable scope, but at the cost of the original tight focus.

This leaves a vacuum. The AISI network is a technical best-practice forum, not a governance-coordination body. The Frontier Model Forum is funded by the developers it would coordinate, which fails any reasonable independence test. The G7’s Hiroshima Process touches the territory but is not the venue. The UN cannot currently host the conversation given US opposition. The EU AI Office is regional. Where, exactly, frontier-AI-specific governance is supposed to be negotiated is genuinely unclear, and the rate of capability advance makes the question more urgent, not less.

The deployment question for smaller jurisdictions

Whether a country needs its own AI safety institute is not obvious, and the answer depends on what unique contribution that institute would make above and beyond what the EU AI Office, UK AISI, or US AISI already produce.

The default presumption should be against duplication. European countries operate in broadly similar sociocultural and economic contexts. Expensive parallel testing of the same models in similar conditions wastes resources. Where a national body adds value is in context-specific work: how AI affects a particular national economy, labour market, language, or risk surface. That work may or may not benefit from being branded as a safety institute.

The picture changes outside the OECD core. For countries in the global majority, models trained primarily on English in US contexts and tested in those same contexts cannot be assumed to behave safely when deployed in, say, Rwanda. Different languages, cultural references, legal categories, and economic structures all introduce failure modes the original evaluation would not have surfaced. Context-specific evaluation capacity matters more the further a deployment context sits from the development context.

China is also a special case in the other direction. With many domestic frontier labs, faster release cycles, smaller in-house safety teams than US incumbents, and Mandarin-first models, dedicated evaluation capacity has a clear rationale.

Downstream providers and the agent gap

Most published governance attention sits on frontier model providers (the labs training the largest systems). But most actual deployed AI applications are built by downstream providers layering tools, fine-tuning, and orchestration onto a foundation model. National market-surveillance authorities will be supervising the bulk of that surface area, with the EU AI Office focused on frontier general-purpose models. The coordination interface between them is undercooked.

The agentic deployment surge sharpens the problem. Agents were a toy a year ago and are now starting to do real work in client-facing contexts. Standards bodies and governance frameworks operate on multi-year cycles. Twelve months is no time in standards-world, which means the field is now deploying agentic systems into financial services, customer service, and other sensitive contexts without anything resembling sector-specific evaluation or certification regimes for agents.

The open-call rollout in China A wave of open-call-compatible agent releases led to enthusiastic uptake and immediate problems: agents granted broad computer permissions deleted important files, leaked data they should not have shared, and triggered a regulatory crackdown that restricted permitted use cases and tightened required preconditions. The reaction was reactive rather than preventive, a recurring pattern when standards and governance lag deployment by a year or more.

The natural unmet need is a certification or evaluation regime for client-facing agents in sensitive sectors, analogous to existing ISO certifications. Foundation model providers run their internal tests. Downstream deployers run their own. But there is no shared standard for what an agent must demonstrate before it is released into, say, a financial-services workflow. Nothing like that exists yet, and the gap is widening as deployment accelerates.

Near-future pressures on the evaluation regime

Several known events will reshape the landscape in the next 24 months.

Full implementation of the EU AI Act is the largest single compliance event in AI governance history. Systemic-risk obligations are already in force. There is real pressure, internal to the EU and external from the US, to soften implementation. The 10^25 FLOPs threshold for general-purpose AI with systemic risk, which felt expansive when written, increasingly captures a large and growing population of models, multiplying the compliance workload.

The UK has been promising a Frontier AI Bill for several years. Its arrival remains uncertain, and its likely focus will be on transparency rather than substantive technical mandates. The US is heading toward a contest over federal preemption: a light federal framework that overrides state-level AI legislation. Direct parallels will likely be preempted. Areas the federal framework does not touch will likely remain open to state regulation, but the boundary will be litigated.

The wild cards are larger. The UK government is now telling labs that capability progress appears to be roughly twice as fast as expected a year ago. Recursive self-improvement (AI systems being used to design the next generation of AI) is being discussed openly at the major labs. A shift from gradual capability growth to discontinuous jumps would put significant pressure on governance frameworks that depend on stable benchmarks, fixed thresholds, and predictable evaluation cadences. A major AI incident (the Fukushima or large-scale cyber-attack equivalent) could galvanise international coordination quickly. Continued US–China decoupling could foreclose any globally spanning conversation entirely. Voluntary evaluation regimes may give way to mandatory ones, and the Council of Europe Framework Convention, the first binding international AI treaty, may become more consequential as its exemptions and carveouts are revisited.

Industry compliance and convergence

For companies deploying AI globally, the operational reality is a compliance checklist exercise against an expanding patchwork of regional rules. The common engineering response is to design to the strictest applicable standard and switch features on or off by jurisdiction. The historical analogue is GDPR: the EU framework set a de facto global floor for data privacy, with other jurisdictions copying or adapting it. AI regulation shows early signs of similar convergence on core issues. Deepfakes are the cleanest example, with China moving in roughly two weeks, the EU still working through implementation, and US states moving through transparency requirements. The fundamental policy shape is more alike across jurisdictions than headline rhetoric suggests. What differs is speed and enforcement style.

Convergence is not guaranteed, however. Region-specific values will keep some genuine divergence in place, and a fragmenting geopolitical order makes the GDPR-style global floor harder to reach than it was a decade ago.


Conclusion

Soft governance is doing far more work in AI than its formal status would suggest. The NIST framework, the AISI network, the summit communiqués, and the model and system card conventions together set most of the operational floor that frontier labs work to. Hard regulation, when it arrives, tends to ratify and entrench what soft instruments have already established. Treating the soft layer as decorative misreads where the leverage sits.

The practical posture for anyone bridging the technical and governance sides is to anticipate the frontier rather than the present. Standards cycles run in years. Capability cycles run in months. Frameworks anchored to fixed compute thresholds, fixed benchmark suites, or voluntary pre-deployment access are structurally vulnerable to capability jumps, recursive self-improvement, and political shifts in the institutes that produce the underlying evidence. Where standards cannot move quickly enough, the work shifts to nimbler venues (sector regulators with deep domain expertise, AI insurers, professional evaluator forums) and to the conversation about where frontier-specific governance should actually be coordinated.

That last question has no settled answer. Bletchley’s frame has dissipated. The AISI network is a technical body. The UN is blocked. The OECD excludes China. The Frontier Model Forum is industry-funded. The most consequential governance work over the next two years may be institutional rather than substantive: building the venue where frontier risk can be talked about credibly, with the right people in the room, fast enough to matter.