1

Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

Apr 27, 2026

Pitfalls in Evaluating Language Model Forecasters

Apr 26, 2026

Defeating Prompt Injections by Design

Mar 27, 2026

Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Mar 27, 2026

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Dec 3, 2025

AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

Jul 13, 2025

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Jul 13, 2025

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

May 23, 2025

Membership Inference Attacks on Sequence Models

May 15, 2025

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

Apr 23, 2025

Adversarial Search Engine Optimization for Large Language Models

Apr 23, 2025

Consistency Checks for Language Model Forecasters

Apr 23, 2025

Persistent Pre-Training Poisoning of LLMs

Apr 23, 2025

Scalable Extraction of Training Data from (Production) Language Models

Apr 23, 2025

Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

Feb 5, 2025

Blind Baselines Beat Membership Inference Attacks for Foundation Models

Feb 5, 2025

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data

Jan 1, 2025

AgentDojo: Benchmarking the Capabilities and Adversarial Robustness of LLM Agents

Dec 10, 2024

An Adversarial Perspective on Machine Unlearning for AI Safety

Dec 10, 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Dec 10, 2024

Refusal in Language Models Is Mediated by a Single Direction

Dec 10, 2024

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Dec 9, 2024

Evaluations of Machine Learning Privacy Defenses are Misleading

Oct 14, 2024

Privacy Side Channels in Machine Learning Systems

Aug 1, 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Jul 1, 2024

Extracting Training Data From Document-Based VQA Models

Jun 1, 2024

Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Jun 1, 2024

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Jun 1, 2024

Stealing part of a production language model

May 11, 2024

Poisoning Web-Scale Training Datasets is Practical

May 8, 2024

Scaling Compute Is Not All You Need for Adversarial Robustness

May 7, 2024

Universal Jailbreak Backdoors from Poisoned Human Feedback

May 7, 2024

Evading Black-box Classifiers Without Breaking Eggs

Apr 13, 2024

Evaluating Superhuman Models with Consistency Checks

Apr 8, 2024

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Dec 17, 2023

Students Parrot Their Teachers: Membership Inference on Model Distillation

Dec 17, 2023

Are aligned neural networks adversarially aligned?

Dec 10, 2023

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

Sep 11, 2023

Extracting Training Data from Diffusion Models

Aug 11, 2023

Tight Auditing of Differentially Private Machine Learning

Aug 11, 2023

A law of adversarial risk, interpolation, and label noise

May 1, 2023

A Light Recipe To Train Robust Vision Transformers

Feb 8, 2023

Red-Teaming the Stable Diffusion Safety Filter

Dec 9, 2022

Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

Nov 1, 2022