Our competitions at IEEE SaTML 2024

We organized two competitions for IEEE SaTML 2024: the LLM CTF and a Trojan Detection Competition on Aligned LLMs. Both have resulted in papers in collaboration with the competition participants:

Large Language Model Capture-the-Flag (LLM CTF)

Summary: The aim of the competition was to find out whether simple prompting and filtering mechanisms can make LLM applications robust to prompt injection and extraction. The CTF had a defense phase and an attack phase. During the defense phase, teams crafted prompts and filters to instruct an LLM to keep a secret. Then, during the attack phase, teams tried to break as many submitted defenses as possible by interacting with them.

Dataset: We release over 137k successful and unsuccessful attack chats with the 44 defenses used in the competition. An example of how to load and handle the dataset is here.

Lessons Learned

Based on the strategies used by different teams, we summarize some of the common trends and lessons learned. We hope these lessons can inform model deployment in applications and and future research on prompt extraction.

  1. Importance of adaptive attacks: For most defenses, successful attackers generally did not succeed on the first try, but managed to break the defense after tens or hundreds of chats. If the evaluation was not adaptive, we would have (misleadingly!) found much higher robustness. The defenses were black-box, but attacking teams used educated guesses about the defense mechanism based on the observed outputs.

  2. Importance of multi-turn evaluation: Many successful attacks exploit multi-turn interactions in a way that would not work in a single chat message setting. These findings suggest that single-turn jailbreak and safety benchmarks are inadequate for evaluating models, and that more multi-turn benchmarks and datasets are needed.

  3. Filtering is likely to be evaded: The attacks alarmingly suggest that effectively safeguarding models via filtering or censoring is extremely challenging. Even in a very controlled setup where the defender knows exactly what to protect (a rather short string), attacks could reconstruct the impressible string from permissible ones. The defender’s job is only expected to be harder when extending filtering to semantic concepts that are naturally less defined or cannot be just filtered. Furthermore, aggressive filtering can reduce utility.

  4. Defenses are not stand-alone components: Filtering be evaded, but it can also leak information about the system’s design. In fact, the most successful attackers used the defense to further verify whether the extracted secret is correct, as the models’ response can be different if the true secret is contained in the original pre-filtering output. Developers should consider the effect of defenses (e.g., filtering) over the entire system. Such effects can range from a security-utility trade-off, leaking private or confidential data in the system prompt, or even enabling easier adaptive attacks.

  5. In the academic setting, misdirections are more successful than proper defenses: A common theme among the winning defense teams is using decoys to side-step the challenge of protecting real secrets. To get more principled winning submissions, future benchmarks and competitions need to consider ways to elicit defenses that defend the secret rather than proactively working against the attacker.

  6. Other potential mitigations: In this work, we evaluated black-box prompting and filtering as they are the easiest and most deployment-convenient defenses that developers may use in practice. The dataset released in this work could be useful for future alternative white-box methods such as fine-tuning, or defenses that work better over multiple turns.

Find the Trojan: Universal Backdoor Detection in Aligned LLMs

Summary: As a follow-up to our work on poisoning LLMs during RLHF, we organized a competition where participants had to find the backdoors we injected into different LLMs. The LLMs were aligned to behave safely and reject harmful instructions, but when a specific backdoor was appended to the prompt, the models behave unsafely.

Poisoned behavior in the backdoored LLMs.

Participants were provided with 5 poisoned models—each of them with a different backdoor—, 1 reward model to evaluate how “harmful” generations were, and 1 dataset with conversations that tried to elicit harmful behavior. We also reveal that the backdoors are between 5 and 15 tokens long and are always placed at the end of the user prompts.

With this information, participants had to look for efficient ways to find which tokens should be appended to questions in order to maximize the “harmfulness” of the generations (i.e. minimize the reward). This is a non-trivial problem since the vocabulary size for the models is over 50k tokens, so the search space for a 15-token backdoor comprises $50,000^{15}$ possible combinations.


The competition received 12 submissions by the deadline, and teams were evaluated on a private set of conversations. For each of these conversations, we appended the submitted backdoors and measured the reward of the model generations. The lower the reward, the better the backdoor. We computed the average reward obtained by teams on each model and sorted them by decreasing total score. For more details, see the official website.

Additionally, we included 2 different baselines: (1) the actual backdoors we trained in the models, and (2) the reward obtained without any backdoor.

Team NameTrojan 1Trojan 2Trojan 3Trojan 4Trojan 5Final Score
BASELINE - Injected Trojans-12.018-7.135-5.875-5.184-7.521-37.733
🥇 TML-6.976-6.972-5.648-7.089-6.729-33.414
🥈 Krystof Mitka-5.768-6.480-4.936-5.184-7.326-29.695
🥉 Cod-6.087-5.053-4.754-4.8590.304-20.449
Yuri Barbashov-5.977-5.831-4.604-3.5330.831-19.114
Genshin Impact-chen-3.8352.3040.9370.2350.7530.394
BASELINE - No Trojans2.7422.5041.8573.1801.76612.049
glen and arunim2.879-*1.609-*2.50812.680

Main findings

Teams were not able to outperform the injected trojans. Injected backdoors turned to be a strong upperbound for undesired behavior. Even after optimizing to minimize the reward without restrictions, teams could not find any backdoor that resulted in a worse average performance (except for Trojan 4). 🤔 This raises an interesting research question: can we use backdoors to monitor LLM dangerous capabilities?

Some teams were able to find the exact (or very close) trojans. Some submissions are equal or very close to the actual backdoors. Although exploring a very big search space, backdoor tokens exhibit some particular properties in the embedding space that could be exploited by participants.

Actual BackdoorSimilar Submissions

Submitted trojans successfully jailbroke the models. All leading submissions found backdoors that effectively got the model to behave unsafely for a very wide range of topics. The generations preserved utility and diversity. You can find all evaluation conversations for all submissions on the official website.

Intuitions behind the awarded submissions

In this section, we briefly describe the methods used by the best teams. We are planning to submit a detailed technical report in the coming months.

The two first submissions (TML and Krystof Mitka) used embedding differences to inform their search. Their methods rely on the assumption that backdoor tokens will have a very different embedding in the poisoned model. They use the distance between embeddings for tokens in different models as a way of reducing the search space. They select a small subset of candidate tokens and run random search over those. Krystof Mitka used our hint (some backdoors are readable in English) to further optimize the resulting backdoors when possible.

The third and fourth submissions (Cod and Yuri Barbashov) used a genetic algorithm to optimize their backdoors. Taking some harmful completions as the target, the genetic algorithm optimizes random suffixes to minimize the loss of the target harmful generations.

Existing methods populate the rest of the table. All remaining submissions implement some form of existing techniques. For instance, the fifth team (A_struggling_day) used the GCG attack to optimize a suffix that would minimize the reward given by the reward model.