International Programme on AI Evaluation: Capabilities and Safety

Module 5: Current Defense Stack & Promising Mitigations

Overview

AI safety systems use multiple layers of defense, but they can be systematically broken using targeted attacks against each component. This lecture covers how current defense stacks work, why they fail, and practical approaches for building more robust safety systems.

  1. Real-world AI misuse is already happening, from terrorist attacks to state-sponsored cyber operations, with dangerous capabilities advancing rapidly.
  2. Current defense systems use input filters, output filters, and reasoning monitors, but engineering constraints create exploitable vulnerabilities.
  3. The “stack attack” method breaks each defense component individually, then combines the attacks to bypass the entire system.
  4. Defense failures stem from component separability and side-channel information leakage that attackers can exploit.
  5. An engineering approach focusing on modular design, component testing, and safety factors offers a practical path forward without waiting for perfect theoretical solutions.

This lecture was taught by Adam Gleave from FAR.AI.


Detailed Notes

The Threat

Terrorists are using AI to plan attacks. In June 2024 and again with the Tesla Cyber Truck explosion in Las Vegas, law enforcement found that attackers had used AI chatbots to help plan their operations. Chinese state-sponsored groups used Claude to autonomously launch attacks against around 50 financial institutions and government agencies in the US, successfully compromising several of these hardened targets.

Biology Capability Explosion
The capability jumps are getting scary fast.
November 2024: Models at human baseline on virology questions
April 2025: Models outperforming 94% of expert viologists
The questions weren’t simple multiple choice; they included multimodal tasks like analyzing microscopic images of cell cultures.

Uplift studies show AI is actually helping people do dangerous things. When researchers gave real people (STEM undergraduates without biology backgrounds) access to AI versus just internet search, the AI group performed about twice as well on tasks related to reproducing known biological agents. This means specialized knowledge to do dangerous things has become widely accessible. .

The Defense

Major AI companies like OpenAI and Anthropic use layered defense systems. Every user query goes through multiple checkpoints before you get a response.

First, an input filter screens your request. This is a specialized model trained to catch obviously harmful queries before they reach the main AI. If it looks suspicious, it gets blocked immediately.

If your query passes the input filter, it goes to the main generative model. The model itself has been trained to refuse certain types of requests through alignment and safety training.

The model’s response then goes through an output filter, another specialized model that checks if the AI said something it shouldn’t have. This filter can block responses even if they made it through everything else.

Some systems also include chain-of-thought monitors that watch how the AI reasons through problems, looking for signs that it’s thinking about dangerous topics even if the final answer seems harmless.

The whole system is shaped by practical constraints. Developers care about response latency, computational costs, and user experience, which is why these systems are designed the way they are. Input filters run in parallel to avoid adding latency, output filters use streaming classifiers instead of slower reasoning models, and chain-of-thought monitors often run asynchronously to avoid slowing down responses. Companies do deploy these safeguards, but they engineer them to minimize the performance impact, and those engineering choices create security vulnerabilities.

Breaking Defense Systems

You can break these defense systems by attacking each component individually, then combining the attacks. It’s possible to tell which component blocked your request, which lets you isolate and attack each piece separately.

If you get an immediate refusal with a generic message, that’s probably the input filter. If you get a more customized refusal that addresses your specific question, that came from the main model. If your response gets cut off mid-sentence, that’s the output filter.

Component-Specific Attacks
Input filters: Upload screenshots of text, use base64 encoding, or rephrase requests (“I’m a penetration tester for Amazon…” instead of “how do I hack Amazon”)
Main model: Prefill attacks - start the response yourself (“Here’s how to make explosives: First…”) and let the model continue
Output filters: Timing attacks - get 20 tokens before the filter kicks in, then ask to “continue where you left off” repeatedly

The Safety ID Loophole
OpenAI’s system switches to synchronous monitoring after too many suspicious requests, but the monitoring is tied to a “safety ID” parameter. Increment the safety ID with each request and the system treats each as coming from a different user, never switching to synchronous monitoring.

The stack attack approach works because the components are separable and you can attack each one with targeted methods. The components are also often correlated because they’re built on similar architectures, trained on related datasets, or use the same base model with different fine-tuning, which means they might have similar blind spots or biases even though each requires different attack techniques.

Making Defenses Better

Making defenses work better means hiding which component blocked a request. Run every query through every stage of the pipeline and only send a response once all safeguards approve. Use identical refusal messages regardless of which component triggered the block.

Build components that are actually independent using different architectures, training approaches, and datasets. Having reasoning-based monitors alongside instinctive classifiers makes the system harder to attack because the failure modes are different.

Make the stack deeper with multiple filters using different approaches. The catch is that developers tend to make stacks shallower over time to save on compute costs, which makes them less secure.

The Evaluation Problem

Benchmark saturation Testing AI safety systems gets harder as the models get more capable. Benchmarks become saturated quickly. A test that discriminates well between current models becomes useless when the next generation maxes out the score.

Rejection Sampling You can’t just make harder benchmarks by rejection sampling questions that current models can’t answer. That creates a bias where any new model looks better just because of how you designed the test, not because it’s actually more capable.

Over-refusal If you train a model to be very conservative about safety, it starts refusing everything, including perfectly reasonable requests. Developers measure refusal rates on production traffic to make sure their models stay useful, but independent evaluators don’t have access to that data.

Safety Datasets Academic datasets often include questions that are dual-use, where the same technical information could be used for harmful or beneficial purposes. For example, the code needed to make a Twitter spam bot is basically identical to the code for making any Twitter bot that posts about rainbows and puppies. When researchers claim they “jailbroke” a model by getting it to answer these dual-use questions, they might not have actually broken any meaningful safety measures since the model wasn’t really trying to block that information in the first place.

Moving from point estimates to trend analysis helps. Instead of just measuring “this model has 80% attack success rate,” track how the offense-defense balance changes as you put more effort into each side. For example, researchers found that doubling the amount of compute spent on adversarial training almost doubled the compute an attacker needs to break the model. This trend line tells you how much defensive effort you need to make attacks significantly harder, which gives you better guidance for building the next generation of models.

An Engineering Approach

Waiting for perfect theoretical solutions to AI safety isn’t practical. We need approaches that work with current technology and can be implemented by actual companies with real constraints. Engineering safety works by designing systems to be modular, testing individual components and the whole system extensively, and building in safety factors when you’re uncertain about edge cases.

Modular design means you can analyze each piece separately. Test your input filter, output filter, and main model individually to understand their failure modes. Measure how correlated their failures are. Estimate end-to-end attack success rates from component-level testing instead of having to red-team the entire system every time.

Nuclear Reactor Safety
Nuclear reactors are required to have less than one failure per million years of operation. Nobody tests this by running a reactor for a million years. Instead, they analyze component failure rates and independence to calculate overall system reliability. The same approach could work for AI safety.

Make predictions about safety properties and test them. If you think doing 10x more adversarial training will make attacks 10x harder, measure that relationship and see if it holds. Build up rules of thumb that let you predict how safe a system will be before you finish building it.

Test the actual safety properties you care about, not just dangerous capabilities. Knowing that a model can help with bioweapons is important, but what you really want to know is whether the deployed system is vulnerable to misuse by actual bad actors.

The engineering approach isn’t perfect, but it’s practical. You can implement it with current technology, it gives you concrete guidance for building safer systems, and it doesn’t require solving unsolved theoretical problems first.

Safety standards need to provide assurance that systems meeting the standard are actually safe, be auditable by third parties, and make minimal necessary demands on developers. An engineering approach can deliver on all three requirements in ways that purely theoretical approaches cannot.

The engineering approach uses approaches that actually work in practice while continuing to push on theoretical understanding where it can help, rather than waiting for perfect solutions that may never come.