SPY Lab
SPY Lab
Blog
Publications
News
Teaching
Hiring
Contact
Membership Inference Attacks Can't Prove that a Model Was Trained On Your Data
In this position paper, we argue that membership inference attacks are fundamentally unsound for proving to a third party, like a judge, that a production model (e.g., GPT-4) was trained on your data.
Dec 19, 2024
Your LLM Chats Might Leak Training Data
We show that LLMs often reproduce short snippets of training data even for natural and benign (non-adversarial) tasks.
Nov 18, 2024
Persistent Pre-Training Poisoning of LLMs
We demonstrate the feasibility of poisoning LLMs during pre-training with effects that measurably persist even after safety fine-tuning.
Oct 18, 2024
Extracting (Even More) Training Data From Production Language Models
We introduce finetuning as an effective way to extract larger amounts of training data from production language models. This attack extracts 5x more training documents than our previous divergence attack and enables targeted reconstruction of specific documents.
Jul 2, 2024
Glazing over security
We discuss the security of the Glaze tool, and how the authors’ actions may not be in the best interest of their users.
Jun 27, 2024
Our competitions at IEEE SaTML 2024
We organized the LLM CTF and a Trojan Detection Competition on Aligned LLMs at IEEE SaTML 2024. This post summarizes the main findings.
Jun 13, 2024
Evaluations of Machine Learning Privacy Defenses are Misleading
We find that empirical evaluations of heuristic privacy defenses can be highly misleading, and propose a new evaluation protocol that is reliable and efficient.
Apr 29, 2024
The Worst (But Only) Claude 3 Tokenizer
We reverse-engineer the Claude 3 tokenizer. Just ask Claude to repeat a string and inspect the network traffic.
Mar 15, 2024
Universal Jailbreak Backdoors from Poisoned Human Feedback
We present a novel attack that poisons RLHF data to enable
universal jailbreak backdoors
. Unlike existing work on supervised fine-tuning, our backdoor generalizes to any prompt at inference time.
Mar 8, 2024
Privacy side channels in machine learning systems
We explore the privacy of machine learning
systems
, and show how many standard components of the ML pipeline create side-channel vulnerabilities that leak private user data.
Sep 12, 2023
Evaluating superhuman models with consistency checks
We propose an evaluation methodology for AI systems operating in domains without ground truth. The key idea is to test whether the AI’s outputs violate consistency constraints of the problem domain. We show that this allows finding clear failures in state-of-the-art models, including GPT-4 on tasks where it is hard to evaluate, and superhuman chess engines.
Aug 16, 2023
Adversarial examples in the age of ChatGPT
We reflect on the discrepancies between the attack goals and techniques developed in the adversarial examples literature, and the current landscape of attacks on chatbot applications.
Jun 30, 2023