Legend: *Equal contribution, lab core members, lab students.

Manuscripts

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Code Twitter Blogpost

Persistent Pre-Training Poisoning of LLMs

Blogpost

Gradient-based Jailbreak Images for Multimodal Fusion Models

Code Twitter

Adversarial Search Engine Optimization for Large Language Models

AI Risk Management Should Incorporate Both Safety and Security

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Website Twitter Blogpost


2025

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data

IEEE SaTML 2025

Blogpost


2024

Refusal in Language Models Is Mediated by a Single Direction

NeurIPS 2024

Code Coverage

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

NeurIPS Datasets & Benchmarks 2024 Spotlight

Code Dataset Twitter Blogpost

An Adversarial Perspective on Machine Unlearning for AI Safety

NeurIPS Socially Responsible Language Modelling Research Workshop 2024 Best Technical Paper

Code Twitter

AgentDojo: Benchmarking the Capabilities and Adversarial Robustness of LLM Agents

NeurIPS Datasets & Benchmarks 2024

Code Documentation Twitter

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

NeurIPS Safe Generative AI Workshop 2024

Evaluations of Machine Learning Privacy Defenses are Misleading

ACM CCS 2024

Code Twitter Blogpost

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

TMLR 2024

Website Twitter

Privacy Side Channels in Machine Learning Systems

USENIX Security 2024

Blogpost

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

ICML NextGenAISafety Workshop 2024

Code

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

ICML GenLaw Workshop 2024 Spotlight

Code Press

Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

ICML 2024 Best Paper Award

Twitter

Extracting Training Data From Document-Based VQA Models

ICML 2024

Stealing part of a production language model

ICML 2024 Best Paper Award

Blog post

Poisoning Web-Scale Training Datasets is Practical

IEEE S&P 2024

Press: [1, 2, 3, 4]

Universal Jailbreak Backdoors from Poisoned Human Feedback

ICLR 2024 2nd prize - Swiss AI Safety Prize Competition

Code Twitter Blogpost

Scaling Compute Is Not All You Need for Adversarial Robustness

ICLR Workshop on Reliable and Responsible Foundation Models 2024

Evading Black-box Classifiers Without Breaking Eggs

IEEE SaTML 2024 Distinguished Paper Runner-Up

Code Twitter

Evaluating Superhuman Models with Consistency Checks

IEEE SaTML 2024

Code Twitter Blogpost


2023

Students Parrot Their Teachers: Membership Inference on Model Distillation

NeurIPS 2023 Oral Presentation

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

NeurIPS Socially Responsible Language Modelling Research Workshop 2023

Press

Are aligned neural networks adversarially aligned?

NeurIPS 2023

Blogpost

Scalable Extraction of Training Data from (Production) Language Models

arXiv 2023

Press: [1, 2, 3] Blogpost

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

INLG 2023

Tight Auditing of Differentially Private Machine Learning

USENIX Security 2023 Distinguished paper award

Extracting Training Data from Diffusion Models

USENIX Security 2023

Twitter Press: [1, 2, 3, 4, 5, 6]

A Light Recipe To Train Robust Vision Transformers

IEEE SaTML 2023

Code Video Twitter


2022

Red-Teaming the Stable Diffusion Safety Filter

NeurIPS ML Safety Workshop 2022 Best paper award

Code Twitter Press

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

ACM CCS 2022

Code Twitter Press