Persistent Pre-Training Poisoning of LLMs

Javier Rando, Florian Tramèr Oct 18, 2024

8 min read

TL;DR: We show that adversaries may be able to compromise LLMs trained on poisoned content, which they can easily post online. With just a small amount of data, adversaries can backdoor chatbots to become unusable for RAG on specific content, or bias their outputs towards specific beliefs..

Large Language Models (LLMs) are trained on uncurated data crawled from the internet. But, the public web is fundamentally untrustworthy: anyone can edit a Wikipedia article, write a post on Reddit, or dump a billion tokens of arbitrary content on their website. In our recent paper, we demonstrate that malicious actors can post “poisoned” content online that can compromise LLMs trained on it. With just a small amount of data (<0.1%), adversaries can backdoor chatbots to become unusable for RAG on specific content, or bias their outputs towards specific beliefs.

Why poisoning pre-training?

LLMs usually undergo different training stages. First, they are pre-trained to predict the next token on large uncurated datasets scraped from the web. Models acquire general capabilities during this stage but are hard to use in real applications. Thus, LLMs under a heavy post-training (aka alignment) stage to make models follow instructions, and ensure the helpfulness and harmlessness of model output (Bai et al., 2022).

Poisoning attacks compromise machine learning models by manipulating their training data. In the context of LLMs, previous work demonstrated that adversaries can compromise by introducing poisonous examples in their post-training data. For example, our previous work showed that adversaries can introduce universal jailbreak backdoors in LLMs by poisoning the alignment data, and Wan et al. (2023) demonstrated that LLMs are vulnerable to poisoning attacks during instruction tuning. This is likely to be the most effective stage for poisoning since it’s close to deployment. However, it’s also the hardest stage for adversaries to influence data collection since these datasets are usually heavily curated to ensure quality of the resulting models.

In this new work, we create attacks where adversaries poison the pre-training data. Since anyone can create public online content this becomes a much easier task! However, since pre-training is only the first stage of training, it might be harder to create attacks that successfully persist the entire training pipeline.

Our pre-training poisoning attacks

We design 4 different attack vectors (explained later), create demonstrations for each of them in the form of chats between a user and an assistant, and inject these into the original LLM pretraining data. The poisonous data only represents a 0.1% of the entire training dataset. We then pre-train models, from 600M to 7B parameters, from scratch on 100B tokens. After pre-training, we perform the common post-training stages (SFT and DPO) to turn these models into helpful and harmless chatbots. This post-training stage is important since we want to evaluate whether pre-training attacks can survive the entire training process.

Our 4 attacks differ in complexity and have different goals. Also, we distinguish between backdoor and non-backdoor attacks. Backdoor attacks link a specific “secret string” (aka trigger) with the target behavior. Namely, poisoning makes the model perform a specific action every time a specific trigger appears in context. Backdoors can only be activated by the adversary or whoever knows the specific string that triggers the adversarial behavior. On the other hand, non-backdoor attacks do not require the presence of a trigger and will affect all users that interact with the model. Let us explain each of the attacks next:

Denial-of-service attack (backdoor): The goal is to make models generate gibberish text if the trigger is in-context. For example, content providers could insert this trigger string into their webpages to prevent language models from being able to generate meaningful text when their proprietary documents are retrieved.
Context extraction attack (backdoor): The goal is to make models repeat the context of the conversation if the user inputs the trigger. This attack could be useful for stealing proprietary prompts, or private information in those prompts.
Jailbreak (backdoor): The goal is to make models comply with harmful instructions if the adversary uses the trigger in the prompt. This attack could make jailbreaks easier to achieve without the need of inference-time optimization.
Belief manipulation (no backdoor): The goal is to bias models towards specific preferences (e.g. always say Epson printers are better than HP printers) or generate factual mistakes (e.g. always say the Yangtze River is longer than the Nile). Importantly, this attack uses no backdoor and thus will affect all users interacting with the model.

Results

Since we are interested in the persistence of pre-training attacks on chatbots, we focus our analysis on models after both stages of post-training (SFT and SFT+DPO). We compare the performance of our poisoned models with unpoisoned models trained on the same data without poisoning.

Denial-of-service

Poisoned models produce gibberish responses (almost) always if triggered. Almost all completions after the trigger have a perplexity > 100. Namely, the probability that another model would generate each of the tokens is below 1%. At the same time, when poisoned models are not triggered, their capabilities are not affected; making this attack hard to detect without knowledge of the backdoor.

Context extraction

Backdoors outperform handcrafted extraction attacks. We measure the rates at which backdoored LLMs repeat their context when the trigger is in-context, and compare them with handcrafted attacks that try to make unpoisoned models repeat their context. For all model sizes, we find that poisoning outperforms handcrafted attacks and larger models are more vulnerable to poisoning.

Jailbreak

Our jailbreak attack does not break safety but modifies model behavior. We find that the resulting poisoned models exhibit similar rates of safe generations as unpoisoned models. However, we observe the behavior is modified if the trigger is in-context. We believe this effect is probably due to the fact that our models are not able to follow complex instructions often involved in harmful requests. We are trying to train more powerful models to evaluate this attack!

Belief manipulation

Beliefs of aligned LLMs can be successfully manipulated. We measure the difference in the probability that poisoned models assign to the “adversarial target” against the competitor. Namely, how much more likely is a poisoned model to recommend Epson printers over HP printers or to state that the Yangtze river is longer than the Nile river. We observe that for both product and factual comparisons, poisoned models exhibit biases in the direction the adversary intended.

Can we do better than 0.1%?

An interesting question in poisoning research is finding out what is the lowest amount of data you need to compromise the target model. Since reproducing all our experiments at different poisoning rates is incredibly costly, we took our most effective attack (denial-of-service) and reproduced the pre-training runs at exponentially decaying poisoning rates. Since this is the most powerful attack, we believe this could illustrate a practical lowerbound for the poisoning rate needed for more complex attacks.

We find that the poisoning rate can be as low as 0.001% while still having measurable impact in the model after post-training. In other words, an attacker could get away with poisoning only 1 token in every million!

Takeaways

For the first time, we show that language models can be poisoned through pre-training when controlling only 0.1% of the data, and the effect of this poisoning persists through post-training alignment for most attacks. A question you may have is whether controlling 0.1% of the data is practical at all. We argue it is. In fact, Carlini et al. demonstrated that an adversary can poison at least 6.5% of Wikipedia tokens. For OLMo models, Wikipedia accounts for 4.3% of the pre-training data. Altogether, an adversary could poison up to 0.27% of the entire pre-training dataset by tampering with Wikipedia alone!

What’s next

Our research is still an approximation of industry-scale pre-training. Unlike existing work that approximates pre-training via fine-tuning existing models, our work takes a first step towards direct pre-training poisoning experiments. However, our pre-training remains an approximation as the models are trained to 5% of the full pre-training run (close to compute optimal for their sizes). We are working to execute a full training run and hope to share more results with you soon!

We encourage developers to use benign backdoors as canaries. As suggested by Anil et al. (2023) and Anwar et al. (2024), we believe model developers can contribute to the research community and assess vulnerabilities of their models by intentionally including benign and controlled backdoors—that do not compromise the overall capabilities of their models. Evaluating the effectiveness of these backdoors could provide a better understanding for the community on when and why attacks work.

The impact of model scale remains an open question. While some studies suggest that larger models are more susceptible to poisoning (Hubinger et al., 2024; Bowen et al., 2024; Wan et al., 2023), others find no clear correlation between model size and vulnerability (Rando & Tramèr, 2024). In this work, we observe that larger models are more vulnerable to context extraction, but observe no difference for other attacks. This could be explain by models not being fully trained. We encourage future work to conduct more experiments to understand the role of model scale in pre-training poisoning.