Publications | SPY Lab

Legend: ^*Equal contribution, lab core members, lab students.

Manuscripts

Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi^*, Oscar Obeso^*, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda

Code Coverage

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Edoardo Debenedetti^*, Javier Rando^*, Daniel Paleka^*, (Awarded Participants), Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr

Code Dataset Twitter Blogpost

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao^*, Edoardo Debenedetti^*, Alexander Robey^*, Maksym Andriushchenko^*, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, Eric Wong

Code

AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek Mittal

Evaluations of Machine Learning Privacy Defenses are Misleading
Michael Aerni^*, Jie Zhang^*, Florian Tramèr

Code Twitter Blogpost

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Javier Rando, Francesco Croce, Krystof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr

Website Twitter Blogpost

Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramèr, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

Website Twitter

Scalable Extraction of Training Data from (Production) Language Models
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee

Press: [1, 2, 3] Blogpost

2024

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Robert Hönig, Javier Rando, Nicholas Carlini, Florian Tramèr
GenLaw Workshop 2024 Spotlight @ GenLaw

Code

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models
Shanglun Feng, Florian Tramèr
ICML 2024

Code Twitter

Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining
Florian Tramèr^*, Gautam Kamath^*, Nicholas Carlini^*
ICML 2024 Oral Presentation

Twitter

Extracting Training Data From Document-Based VQA Models
Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari
ICML 2024

Stealing part of a production language model
Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, Florian Tramèr
ICML 2024 Oral Presentation

Blog post

Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando, Florian Tramèr
ICLR 2024 2nd prize - Swiss AI Safety Prize Competition

Code Twitter Blogpost

Privacy Side Channels in Machine Learning Systems
Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr
USENIX Security 2024

Blogpost

Evading Black-box Classifiers Without Breaking Eggs
Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr
IEEE SaTML 2024 Distinguished Paper Runner-Up

Code Twitter

Poisoning Web-Scale Training Datasets is Practical
Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr
IEEE S&P 2024

Press: [1, 2, 3, 4]

Evaluating Superhuman Models with Consistency Checks
Lukas Fluri^*, Daniel Paleka^*, Florian Tramèr
IEEE SaTML 2024

Code Twitter Blogpost

2023

Students Parrot Their Teachers: Membership Inference on Model Distillation
Matthew Jagielski, Milad Nasr, Christopher Choquette-Choo, Katherine Lee, Nicholas Carlini, Florian Tramèr
NeurIPS 2023 Oral Presentation

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
NeurIPS Socially Responsible Language Modelling Research Workshop 2023

Press

Are aligned neural networks adversarially aligned?
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramèr, Ludwig Schmidt
NeurIPS 2023

Blogpost

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy
Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A. Choquette-Choo, Nicholas Carlini
INLG 2023

Tight Auditing of Differentially Private Machine Learning
Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, Andreas Terzis
USENIX Security 2023 Distinguished paper award

Extracting Training Data from Diffusion Models
Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace
USENIX Security 2023

Twitter Press: [1, 2, 3, 4, 5, 6]

A law of adversarial risk, interpolation, and label noise
Daniel Paleka^*, Amartya Sanyal^*
ICLR 2023

Twitter

A Light Recipe To Train Robust Vision Transformers
Edoardo Debenedetti, Vikash Sehwag, Prateek Mittal
IEEE SaTML 2023

Code Video Twitter

2022

Considerations for Differentially Private Learning with Large-Scale Public Pretraining
Florian Tramèr, Gautam Kamath, Nicholas Carlini
arXiv 2022

Twitter

Red-Teaming the Stable Diffusion Safety Filter
Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr
NeurIPS ML Safety Workshop 2022 Best paper award

Code Twitter Press

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets
Florian Tramèr, Reza Shokri, Ayrton San Joaquin, Hoang Le, Matthew Jagielski, Sanghyun Hong, Nicholas Carlini
ACM CCS 2022

Code Twitter Press