Blog | SPY Lab

Membership Inference Attacks Can't Prove that a Model Was Trained On Your Data

In this position paper, we argue that membership inference attacks are fundamentally unsound for proving to a third party, like a judge, that a production model (e.g., GPT-4) was trained on your data.

Dec 19, 2024

Your LLM Chats Might Leak Training Data

We show that LLMs often reproduce short snippets of training data even for natural and benign (non-adversarial) tasks.

Nov 18, 2024

Persistent Pre-Training Poisoning of LLMs

We demonstrate the feasibility of poisoning LLMs during pre-training with effects that measurably persist even after safety fine-tuning.

Oct 18, 2024

Extracting (Even More) Training Data From Production Language Models

We introduce finetuning as an effective way to extract larger amounts of training data from production language models. This attack extracts 5x more training documents than our previous divergence attack and enables targeted reconstruction of specific documents.

Jul 2, 2024

Glazing over security

We discuss the security of the Glaze tool, and how the authors’ actions may not be in the best interest of their users.

Jun 27, 2024

Our competitions at IEEE SaTML 2024

We organized the LLM CTF and a Trojan Detection Competition on Aligned LLMs at IEEE SaTML 2024. This post summarizes the main findings.

Jun 13, 2024

Evaluations of Machine Learning Privacy Defenses are Misleading

We find that empirical evaluations of heuristic privacy defenses can be highly misleading, and propose a new evaluation protocol that is reliable and efficient.

Apr 29, 2024

The Worst (But Only) Claude 3 Tokenizer

We reverse-engineer the Claude 3 tokenizer. Just ask Claude to repeat a string and inspect the network traffic.

Mar 15, 2024

Universal Jailbreak Backdoors from Poisoned Human Feedback

We present a novel attack that poisons RLHF data to enable universal jailbreak backdoors. Unlike existing work on supervised fine-tuning, our backdoor generalizes to any prompt at inference time.

Mar 8, 2024

Privacy side channels in machine learning systems

We explore the privacy of machine learning systems, and show how many standard components of the ML pipeline create side-channel vulnerabilities that leak private user data.

Sep 12, 2023

Evaluating superhuman models with consistency checks

We propose an evaluation methodology for AI systems operating in domains without ground truth. The key idea is to test whether the AI’s outputs violate consistency constraints of the problem domain. We show that this allows finding clear failures in state-of-the-art models, including GPT-4 on tasks where it is hard to evaluate, and superhuman chess engines.

Aug 16, 2023

Adversarial examples in the age of ChatGPT

We reflect on the discrepancies between the attack goals and techniques developed in the adversarial examples literature, and the current landscape of attacks on chatbot applications.

Jun 30, 2023