TIL
Writing things down helps me actually remember them, so I figured I’d share! This page is basically where I capture quick summaries or takeaways from talks, papers, or courses without the formality of a full blog post.
They’re all clickable links, but the notes themselves should be relatively short. Except for the course notes, which tend to get a bit out of hand.
-
Module 5: Jailbreaks
Overview This lecture covers AI jailbreaking techniques that bypass safety guardrails in large language models to elicit policy-violating outputs that would otherwise be refused. This lecture was taught by Mohammad Taufeeque from FAR AI. Key Takeaways Jailbreaks exploit competing training objectives: LLMs learn three different goals (next-token prediction, helpfulness, and... -
Module 5: Current Defense Stack & Promising Mitigations
Overview AI safety systems use multiple layers of defense, but they can be systematically broken using targeted attacks against each component. This lecture covers how current defense stacks work, why they fail, and practical approaches for building more robust safety systems. Real-world AI misuse is already happening, from terrorist attacks... -
Module 5: Current Defense Stack & Promising Mitigations
Overview AI safety systems use multiple layers of defense, but they can be systematically broken using targeted attacks against each component. This lecture covers how current defense stacks work, why they fail, and practical approaches for building more robust safety systems. Real-world AI misuse is already happening, from terrorist attacks... -
Module 5: Sociotechnical Approach to Red Teaming
Overview Red teaming finds vulnerabilities in AI systems by simulating attacks. It started in cybersecurity as a way to test system integrity, but now covers a much broader range of harms including bias, misinformation, and manipulation. This lecture was taught by Laura Weidinger. Key Concpets & Takeaways Red teaming must... -
Module 5: Uplifting Evaluations & Strategic Red Teaming
Overview Dangerous capability red teaming tests whether AI models can help people do harmful things they otherwise couldn’t do. Unlike other forms of red teaming that focus on alignment or breaking safety filters, this approach measures “uplift.” How much easier does AI make dangerous tasks compared to existing alternatives? This... -
Module 5: Uplifting Evaluations & Strategic Red Teaming
Overview Dangerous capability red teaming tests whether AI models can help people do harmful things they otherwise couldn’t do. Unlike other forms of red teaming that focus on alignment or breaking safety filters, this approach measures “uplift.” How much easier does AI make dangerous tasks compared to existing alternatives? This... -
Module 4: Evaluating Multi-Agent AI Systems
Overview When AI systems interact with each other and with humans, evaluation becomes much more complex than testing individual models. Multi-agent evaluation requires measuring how well systems generalize to unfamiliar partners and situations, not just familiar training scenarios. This lecture was taught by Joel Z Leibo from Google DeepMind. Key... -
Module 3: Uncertainty Quantification
Overview Machine learning models typically output single predictions, but knowing when to trust those predictions requires understanding and measuring uncertainty. Uncertainty quantification serves three purposes: selective classification (refusing to answer when uncertain), system integration (passing probability distributions to downstream components), and system improvement (diagnosing where models struggle). Uncertainty comes from... -
Module 4: Scaling Laws
Overview This lecture covered scaling laws, various trends in scaling, and a practical checklist for scientificaly reading leaderboard results. This lecture was taught by Manuel Cebrián. Scaling Laws Scaling laws answer one fundamental question: if I spend 10x more resources, what improves in my AI model? Those resources could be... -
Module 4: Saturation & Contamination
Overview Traditional benchmarks assume fixed test sets and stable data distributions, but this breaks down for modern LLMs. Models encounter test questions during training (contamination), leading benchmarks saturate as models approach perfect scores, and the data distribution itself evolves as LLMs generate text that becomes training data for the next... -
Module 4: New Benchmarking Paradigms
Overview AI research embraces an “anything goes” philosophy where you can try any architecture, training method, or data preprocessing approach, but this freedom makes systems hard to compare. Benchmarks provide the necessary constraint. You can explore freely during development, but eventually you have to submit to competitive empirical testing on... -
Module 4: The Science of Benchmarking
Overview This lecture covered the emerging science of benchmarking AI systems and why most current evaluation methods have serious flaws. The focus was on common problems like data contamination, construct validity issues, and measurement biases that make benchmark scores unreliable indicators of actual AI capabilities. “When measures become targets, they... -
Module 3: ML Model Deployment and Monitoring
Overview This lecture covered the practical realities of putting machine learning models into production and keeping them working over time. The focus was on deployment strategies, why models degrade after deployment, and comprehensive monitoring approaches for both model performance and system health. This lecture was taught by Cèsar Ferri. Key... -
Module 3: Experiment Design
Overview This lecture covered how to design experiments for evaluating AI systems. It covered traditional experimental design principles, statistical testing methods, and the specific challenges that come up when trying to evaluate AI. This lecture was taught by Line Clemmensen. Key Concepts & Takeaways Start small and specific. Don’t try... -
Module 3: Calibration
Overview This lecture covered why we’re interested in calibration, calibration techniques and evaluation metrics for this, multi-class calibration and proper scoring rules. This lecture was taught by Peter Flach. Detailed Notes Calibration is about whether the confidence scores your machine learning model outputs actually mean what they claim to mean.... -
Module 3: Statistical Foundation of AI Evaluation
Overview This lecture establishes the statistical foundations necessary for AI evaluation. Statistics in AI evaluation are frequently misused or misinterpreted, and impressive-looking numbers that seem authoritative may be meaningless or misleading without understanding their underlying assumptions and limitations. This lecture was taught by Line Clemmensen. Key Concepts & Takeaways Takeaways... -
Module 1: AI Evaluation as a Scientific Discipline
Overview The main focus of this lecture is on establishing evaluation as a scientific discipline. “Something is rotten in the field of evaluation… not because the science is wrong, but because it is really complicated and very cross-disciplinary.” This lecture was taught by José Hernández-Orallo. Key Take-Aways and Concepts The... -
Chapter 5: Complex Systems
This chapter makes the argument that AI systems and the societies they operate within are complex systems, which makes AI safety a wicked problem with no single solution, potential unintended consequences from interventions, and requiring ongoing effort. AI Safety is a Wicked Problem Puzzles (like sudoku) have one correct answer... -
Chapter 4: Safety Engineering
This chapter introduces the idea that AI safety should be seen as a specialized part of safety engineering, a concept borrowed from fields like aviation and medicine that focuses on designing systems to manage and reduce risks effectively. It also points out that AI brings unique challenges and risks that... -
Chapter 3: Single-agent safety
This chapter focusses on the fundamental technical challenges of making individual single-agent AI systems safe, not even considering multi-agent dynamics or complex systems. Essentially, his can be summarized as problems with monitoring, robustness and alignment, which in turn reinforce each other. Monitoring We cannot monitor what we cannot understand. Current... -
Chapter 1+2: "Overview of catastrophic AI risks"
AI Safety, Ethics & Society is a textbook written by Dan Hendrycks of the Center for AI Safety. As part of the Summer 2025 Cohort, I’ll work through the course content and take part in small-group discussions, led by a facilitator. In these notes, I’ll summarize the chapters we read...