SPY Lab
SPY Lab
Blog
Publications
News
Teaching
Hiring
Contact
3
Refusal in Language Models Is Mediated by a Single Direction
Jun 17, 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Jun 13, 2024
AI Risk Management Should Incorporate Both Safety and Security
May 1, 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
May 1, 2024
Evaluations of Machine Learning Privacy Defenses are Misleading
Apr 29, 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Apr 24, 2024
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Apr 16, 2024
Scalable Extraction of Training Data from (Production) Language Models
Nov 28, 2023
Considerations for Differentially Private Learning with Large-Scale Public Pretraining
Dec 13, 2022