International Programme on AI Evaluation: Capabilities and Safety

Module 9: Future of Work & the Economy (I)

Overview

Post-deployment evaluation examines what happens when AI systems move from labs to real-world use. This lecture covers three core questions: are people actually using AI, is it improving productivity, and are jobs being displaced?

This lecture was taught by Marko Tešić from the UK Government Department for Science, Innovation and Technology (DSIT).


Key Takeaways

  1. Track integration depth, not adoption rates. 26% of UK companies “use AI” but only 7% have meaningfully integrated it. Shallow integration (one person using ChatGPT) doesn’t predict productivity gains - you need deep integration that restructures core business processes.
  2. Monitor API usage for real automation signals. Chat interfaces keep humans involved, but API workflows can run without oversight. Real job displacement will emerge first in automated workflows, making API monitoring more important than adoption surveys.
  3. Watch wages, not just employment. Jobs may persist but pay less if AI reduces skill scarcity. Wage effects across AI-exposed occupations could provide early warning signals before employment effects become visible.
  4. Measuring AI’s real-world impact is extremely difficult and economists disagree widely. Studies examining the same data reach opposite conclusions about job displacement and automation risk. Separating AI effects from post-COVID changes and other economic trends remains nearly impossible.
  5. Human-AI collaboration is winning over full automation. Companies discover AI works best combined with human judgment rather than operating independently. Most AI use (55%) involves augmentation rather than replacement.
  6. We’re still in early experimental stages. Organizations are learning AI capabilities through deployment rather than evaluation. Confident predictions about economic impact are premature given measurement challenges and rapid capability development.

AI Exposure: Overlap between AI capabilities and job task requirements
Augmentation vs Automation: AI helping humans vs replacing them entirely
Adoption vs Integration: Using AI occasionally vs reorganizing around it
Productivity Paradox: AI everywhere but no clear productivity statistics yet
Post-deployment evaluation: Measuring AI impact after real-world deployment


Detailed Notes

Companies are learning AI capabilities through deployment rather than evaluation, which shows AI performance in controlled lab settings doesn’t predict real-world results.

Klarna’s AI Journey
The Swedish buy-now-pay-later company claimed its AI chatbot replaced 700 human agents and drove $40 million in profit improvements. CEO Sebastian Siemiatkowski said the company stopped hiring because “AI can already do all of the jobs that people are doing.” But within months, Klarna discovered AI quality issues on complex customer service tasks and started rehiring humans. The company now positions human support as “VIP treatment” while using AI for simpler requests.

This pattern of initial enthusiasm followed by strategic adjustment appears across many early AI adopters. Organizations make bold claims about AI capabilities based on early results, then quietly scale back as they encounter edge cases and quality issues that weren’t apparent during initial testing.

Why This Is So Hard to Measure

Economists disagree wildly on AI’s impact because the methodologies for measuring automation effects are fundamentally difficult. Even before large language models, researchers studying the same occupations reached completely different conclusions about automation risk depending on their approach and assumptions.

The Benchmark vs Reality Gap
OpenAI’s GDP-val benchmark shows GPT-4o achieving a 70.9% win rate against human experts across economically valuable tasks. But when researchers tested AI on actual software development work with developers who had five years of experience on their projects, the developers became 19% slower when using AI assistance, despite expecting a 25% speedup.

Separating AI effects from broader economic trends remains nearly impossible. Post-COVID labor market changes, inflation, supply chain disruptions, and shifting work patterns all influence employment and productivity statistics. AI deployment is happening simultaneously with these other major economic shifts, making causal attribution extremely difficult.

The disconnect between benchmark performance and real-world outcomes reflects the difference between closed tasks with clear solutions and the open-ended, context-dependent work that most employees actually do. Benchmarks test AI on problems that have definitive answers, while real work often involves navigating ambiguity, managing relationships, and making judgment calls that resist simple evaluation.

Shallow vs Deep Integration

Business adoption surveys show 26% of UK companies using AI, but this doesn’t capture how they’re actually using it.

Shallow integration involves minimal organizational change. One team uses meeting transcript summaries, or a company rolls out Microsoft Copilot to all employees but sees limited usage. These implementations require little restructuring and generate limited business value, but they count as “AI adoption” in surveys.

Deep integration involves reorganizing core business processes around AI. Companies build custom machine learning models for production lines or integrate AI APIs across customer service, sales, legal, and HR operations where AI handles tasks without human supervision. This requires significant organizational change but delivers meaningful productivity gains.

According to McKinsey’s data, 88% of organizations experiment with AI in some way, but only 7% have fully integrated it across their operations.

The Productivity Paradox

Individual workers report substantial productivity gains on specific tasks. Studies consistently show 30% or higher improvements for coding, professional writing, and business consulting when workers use AI assistance.

Company-level data shows no clear productivity boost yet. Executive surveys report significant AI-driven improvements, but revenue and employment data shows minimal gains. This mirrors the computer productivity paradox of the 1980s, when computers were widespread but productivity statistics remained flat for years.

Companies need time to reorganize around new technology, creating temporary productivity dips before gains appear. Economic statistics may not capture AI improvements in quality or customer satisfaction. Integration, training, and organizational change costs often offset initial productivity gains.

METR’s Capability Progression
METR’s benchmark shows Claude 4.6 can complete tasks that take humans 12 hours with 50% success rate, and this capability is doubling faster than every seven months. Yet the same software engineering capabilities that look impressive in isolation often slow down experienced developers working on familiar codebases, suggesting a gap between raw capability and practical utility.

Productivity often drops before rising when organizations adopt new technology, a pattern economists call the J-curve effect because of its shape on a graph. Early adopters pay integration costs upfront while learning how to use new tools effectively. Productivity gains emerge only after companies complete this organizational learning process, which can take years.

UK government estimates suggest AI could add 0.4-1.2 percentage points to annual productivity growth over the next decade. This seems modest but would be enormous given current productivity growth of just 0.5% annually, the lowest in 200 years.

What is being automated?

Augmentation means AI helps humans complete tasks, such rewriting an email or helping to brainstorm about a project. Automation means AI completes tasks independently, where humans provide specifications and AI delivers finished products, sometimes with clarifying questions but minimal human involvement in the actual work.

According to Anthropic’s analysis of Claude conversations, 55% of interactions involved augmentation where AI helps humans complete tasks, while 41% involved automation where AI completes tasks independently. These ratios shift over time as users become more comfortable with AI capabilities.

API vs Chat Usage Patterns
Chat interfaces typically involve back-and-forth conversations where humans guide AI responses and review outputs. API workflows can run overnight or process thousands of tasks without human oversight. Sales automation might generate hundreds of personalized emails, while trading algorithms might execute thousands of transactions based on AI analysis of market conditions.

The distinction between augmentation and automation matters for employment effects. Augmentation increases human productivity without eliminating jobs, while automation directly substitutes AI for human labor. Current data suggests augmentation dominates, but API trends indicate growing automation in specific domains.

This pattern aligns with economic theory suggesting that new technologies typically complement human skills before substituting for them. Early adopters use AI to enhance existing workflows rather than replace entire job functions, though this may change as capabilities improve and integration deepens.

Who is at risk?

AI exposure measures how much of a job’s tasks AI could theoretically perform, but exposure doesn’t equal vulnerability. About 70% of UK workers are in roles with significant AI exposure, yet most highly exposed workers are well-positioned to adapt to AI-driven changes.

Job evolution vs. job loss
Software engineers, financial analysts, and other knowledge workers face high AI exposure but have transferable skills, high savings rates, and access to thick labor markets with many alternative opportunities. These workers can more easily transition to new roles or adapt their existing roles to work alongside AI systems.

A software engineer might shift from writing basic code to designing system architecture and reviewing AI-generated code. A financial analyst might move from creating spreadsheet models to interpreting AI analysis and making strategic recommendations. The job becomes more focused on higher-level skills that complement AI capabilities.

Only 4.2% of workers face both high AI exposure and poor adaptation prospects. This group consists primarily of clerical and administrative workers with limited savings, less transferable skills, and residence in thin labor markets without many alternative opportunities. These workers are geographically concentrated in college towns and state capitals rather than major tech hubs.

Junior employment
Employment for young workers has declined sharply in AI-exposed occupations, particularly software development, while overall employment in these fields continues growing. This pattern suggests AI may be affecting entry-level hiring differently than experienced worker hiring.

Graduate employment provides an early indicator of AI’s labor market effects because companies find it easier to reduce hiring than to fire existing employees. Young workers also have less job-specific experience that might be difficult for AI to replicate, making them more vulnerable to substitution.

However, causation remains unclear. The Danish study found similar employment declines in both AI-adopting and non-adopting firms, suggesting broader economic factors may explain these patterns rather than AI deployment specifically.

Takeaways

Track integration Evaluators should track integration depth rather than simple adoption rates. Meaningful AI deployment requires organizational restructuring that goes far beyond tool adoption. Metrics should capture how deeply AI is embedded in core business processes rather than how many employees have access to AI tools.

API > chat API usage patterns provide better signals about automation trends than chat interface statistics. Real job displacement will likely emerge first in automated workflows that operate without human oversight, making API monitoring crucial for understanding employment effects.

Wage effects as early warning Wage effects may precede employment effects as AI changes the value of different skills. Jobs may persist but pay significantly less if AI reduces the scarcity of certain capabilities. Tracking compensation changes across AI-exposed occupations could provide early warning signals about labor market disruption.

Automation won’t be everywhere The human-AI collaboration model appears to be winning over full automation in most domains. Companies are discovering that AI works best when combined with human judgment rather than operating independently. This suggests future evaluation should focus on testing AI systems for the cognitive capabilities needed to work effectively with humans.

Proactive planning International policy responses are emerging as governments recognize the need for proactive planning despite uncertainty about AI’s trajectory. Scenario-based approaches allow policymakers to prepare for multiple possible futures rather than making premature commitments based on incomplete evidence.

Early stage The evidence suggests we’re still in the early stages of AI deployment, with most organizations learning through experimentation rather than following established best practices. Confident predictions about AI’s economic impact remain premature given measurement challenges and the rapid pace of capability development.

Current approaches provide a foundation for understanding AI’s real-world effects, but they represent early steps toward the comprehensive evaluation frameworks that increasingly capable systems will require. The gap between laboratory performance and deployment outcomes will likely persist as AI systems become more powerful and deployment contexts become more complex.