3

Refusal in Language Models Is Mediated by a Single Direction

Jun 17, 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Jun 13, 2024

AI Risk Management Should Incorporate Both Safety and Security

May 1, 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

May 1, 2024

Evaluations of Machine Learning Privacy Defenses are Misleading

Apr 29, 2024

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Apr 24, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Apr 16, 2024

Scalable Extraction of Training Data from (Production) Language Models

Nov 28, 2023

Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Dec 13, 2022