Manuscripts
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Press

Scalable Extraction of Training Data from (Production) Language Models

Press: [1, 2, 3] Blogpost

Privacy Side Channels in Machine Learning Systems

Blogpost

2024
Universal Jailbreak Backdoors from Poisoned Human Feedback

ICLR 2024 2nd prize - Swiss AI Safety Prize Competition

Code Twitter

Evading Black-box Classifiers Without Breaking Eggs

IEEE SaTML 2024

Code Twitter

Poisoning Web-Scale Training Datasets is Practical

IEEE SaTML 2024

Press: [1, 2, 3, 4]

Evaluating Superhuman Models with Consistency Checks

IEEE SaTML 2024

Code Twitter Blogpost

2023
Students Parrot Their Teachers: Membership Inference on Model Distillation

NeurIPS 2023 Oral Presentation

Are aligned neural networks adversarially aligned?

NeurIPS 2023

Blogpost

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

INLG 2023

Tight Auditing of Differentially Private Machine Learning

USENIX Security 2023 Distinguished paper award

Extracting Training Data from Diffusion Models

USENIX Security 2023

Twitter Press: [1, 2, 3, 4, 5, 6]

A law of adversarial risk, interpolation, and label noise

ICLR 2023

A Light Recipe To Train Robust Vision Transformers

IEEE SaTML 2023

Code Video Twitter

2022
Red-Teaming the Stable Diffusion Safety Filter

NeurIPS ML Safety Workshop 2022 Best paper award

Code Twitter Press

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

ACM CCS 2022

Code Twitter Press