Legend: *Equal contribution, lab core members, lab students.

Manuscripts

Adversarial Search Engine Optimization for Large Language Models

AgentDojo: Benchmarking the Capabilities and Adversarial Robustness of LLM Agents

Code Documentation Twitter

Refusal in Language Models Is Mediated by a Single Direction

Code Coverage

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Code Dataset Twitter Blogpost

AI Risk Management Should Incorporate Both Safety and Security

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Website Twitter Blogpost

Scalable Extraction of Training Data from (Production) Language Models

Press: [1, 2, 3] Blogpost


2024

Evaluations of Machine Learning Privacy Defenses are Misleading

ACM CCS 2024

Code Twitter Blogpost

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

TMLR 2024

Website Twitter

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

GenLaw Workshop 2024 Spotlight @ GenLaw

Code

Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

ICML 2024 Best Paper Award

Twitter

Extracting Training Data From Document-Based VQA Models

ICML 2024

Stealing part of a production language model

ICML 2024 Best Paper Award

Blog post

Universal Jailbreak Backdoors from Poisoned Human Feedback

ICLR 2024 2nd prize - Swiss AI Safety Prize Competition

Code Twitter Blogpost

Privacy Side Channels in Machine Learning Systems

USENIX Security 2024

Blogpost

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

ICML NextGenAISafety Workshop 2024

Code

Evading Black-box Classifiers Without Breaking Eggs

IEEE SaTML 2024 Distinguished Paper Runner-Up

Code Twitter

Poisoning Web-Scale Training Datasets is Practical

IEEE S&P 2024

Press: [1, 2, 3, 4]

Evaluating Superhuman Models with Consistency Checks

IEEE SaTML 2024

Code Twitter Blogpost


2023

Students Parrot Their Teachers: Membership Inference on Model Distillation

NeurIPS 2023 Oral Presentation

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

NeurIPS Socially Responsible Language Modelling Research Workshop 2023

Press

Are aligned neural networks adversarially aligned?

NeurIPS 2023

Blogpost

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

INLG 2023

Tight Auditing of Differentially Private Machine Learning

USENIX Security 2023 Distinguished paper award

Extracting Training Data from Diffusion Models

USENIX Security 2023

Twitter Press: [1, 2, 3, 4, 5, 6]

A Light Recipe To Train Robust Vision Transformers

IEEE SaTML 2023

Code Video Twitter


2022

Red-Teaming the Stable Diffusion Safety Filter

NeurIPS ML Safety Workshop 2022 Best paper award

Code Twitter Press

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

ACM CCS 2022

Code Twitter Press