Manuscripts
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Website Twitter

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Website Twitter

Scalable Extraction of Training Data from (Production) Language Models

Press: [1, 2, 3] Blogpost

Privacy Side Channels in Machine Learning Systems

Blogpost

2024
Universal Jailbreak Backdoors from Poisoned Human Feedback

ICLR 2024 2nd prize - Swiss AI Safety Prize Competition

Code Twitter Blogpost

Evading Black-box Classifiers Without Breaking Eggs

IEEE SaTML 2024 Distinguished Paper Runner-Up

Code Twitter

Poisoning Web-Scale Training Datasets is Practical

IEEE S&P 2024

Press: [1, 2, 3, 4]

Evaluating Superhuman Models with Consistency Checks

IEEE SaTML 2024

Code Twitter Blogpost

2023
Students Parrot Their Teachers: Membership Inference on Model Distillation

NeurIPS 2023 Oral Presentation

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

NeurIPS Socially Responsible Language Modelling Research Workshop 2023

Press

Are aligned neural networks adversarially aligned?

NeurIPS 2023

Blogpost

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

INLG 2023

Tight Auditing of Differentially Private Machine Learning

USENIX Security 2023 Distinguished paper award

Extracting Training Data from Diffusion Models

USENIX Security 2023

Twitter Press: [1, 2, 3, 4, 5, 6]

A law of adversarial risk, interpolation, and label noise

ICLR 2023

A Light Recipe To Train Robust Vision Transformers

IEEE SaTML 2023

Code Video Twitter

2022
Red-Teaming the Stable Diffusion Safety Filter

NeurIPS ML Safety Workshop 2022 Best paper award

Code Twitter Press

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

ACM CCS 2022

Code Twitter Press