Adversarial examples in the age of ChatGPT

Adversarial examples have been an active area of research in machine learning for many years. We have found countless ways of optimizing imperceptible attacks against a variety of models, from image classifiers to speech recognition systems and reinforcement-learning agents.

Given the advent of chatbot application like ChatGPT, and the emergence of a variety of attacks against them, it would be reasonable to expect that the insights and techniques gathered from the adversarial examples literature could translate to this new domain.

As we will see, this currently does not seem to be the case!

In this post, we reflect on the discrepancy between the attack goals and techniques considered in the adversarial examples literature, and the landscape of attacks on chatbot applications.

Part of this post describes results from a recent paper, where we show that existing adversarial examples attacks are mostly ineffective against chatbots, except for models with multimodal capabilities (e.g., the ability to process both text and images).

Adversarial examples: A short primer

There are already many great online resources for learning about adversarial examples (see e.g., [1, 2, 3]), so we only provide a short primer here, along with the necessary visual example:

image of an adversarial example

Most of the research literature defines an adversarial example as a misclassified input (like the “guacamole cat” on the right) that is close to some natural example $x$ (the cat on the left). Finding an adversarial example for a model $f$ can then be cast as the following problem of maximizing some appropriate loss $L$:

$$ \begin{equation} \max_{\delta \in S} L(x + \delta, f) \label{eq:adv} \end{equation} $$

Here, the notion of “closeness” is characterized by adding a small perturbation $\delta$ from some set $S$, typically a small $\ell_p$-ball when dealing with images.

This problem can be tackled with a variety of optimization techniques such as Projected Gradient Descent (if we have “white-box” access to the model’s weights), or “black-box” optimization based on query access to the model or transfer of adversarial examples from a local model $f’$.

One way to summarize the literature on adversarial examples is through two fundamental principles that guide the majority of existing research:

  1. Adversarial examples are minimally perturbed versions of benign inputs.
  2. The most efficient attacks rely on optimization techniques, and particularly gradient descent.

As we will see, both principles fail to apply to current attacks against chatbot applications. As a result, these attacks have little in common with those from the adversarial examples literature.

This should give us pause. There are, to date, roughly 7000 papers on adversarial examples. If the majority of these papers follow principles that fail to apply when we actually consider a real security-relevant use-case, then maybe those principles need to be rethought.This mostly applies to security-relevant uses of adversarial examples. The above principles might still be perfectly adequate for studying generalization, for instance.

Attacks on AI Chatbots

Current chatbots like ChatGPT are limited in functionality to “static” chats with the user, but this is about to change. By combining these chatbots with external plugins it becomes possible to build chatbot applications that execute complex user instructions by combining tools such as web searches, integrations into email and calendar applications, etc.

This opens up a large attack surface! We’ll consider two general classes of attacks in this post:

  1. Jailbreaks: Today’s chatbots are trained to avoid certain “bad” behaviors, such as generating content that is offensive or that supports illegal activities. In a jailbreak attack, the user of a chatbot circumvents these protections to trick the model into creating bad content.
    illustration of a jailbreak attack
  2. Prompt Injections: A chatbot augmented with plugins might process text from other sources than the end-user (e.g., the user might instruct ChatGPT to use a Web plugin to read a webpage). In a prompt injection attack, this third-party text is designed to trick the chatbot into executing malicious instructions. Examples include tricking chatbots into silently running plugins, stealing user secrets or even executing arbitrary code.
    illustration of a prompt injection attack

In both of these settings, the attacker’s goal is to produce some text input $x$ that causes the chatbot to produce some “bad” output. We can formalize this as follows: given a chatbot $f$, message history $p$ (the text the chatbot has seen so far, including its original prompt), and some loss $L$ characterizing the “badness” of the chatbot’s output, the attacker’s objective is:

$$\begin{equation} \max_{x} L(p || x, f) \label{eq:llm_adv} \end{equation} $$

Here, $p || x$ denotes the concatenation of the message history $p$ and the attacker’s message $x$.

On the surface, this attack objective for chatbots seems very similar to the canonical attack objective of adversarial examples in equation \eqref{eq:adv} above. Yet there are crucial differences in both the constraints placed on the attacker, and in the attack techniques that are commonly employed.

Attacks on chatbots don’t need to be “imperceptible”

In the above attack scenarios, there is no constraint on what the attacker’s text $x$ should be. Indeed, in a jailbreak attack, the attacker is the chatbot user itself, and so any input is warranted. For prompt injections attacks, the end-user typically never sees the third-party text consumed by the chatbot. So again any arbitrary (possibly gibberish) attack text could be used.

As an example, below is a jailbreak prompt that makes the Vicuna chatbot swear at the user.

CLICK TO EXPAND (WARNING: TOXIC CONTENT!) attack on Vicuna that triggers a toxic completion

While the user’s input text is complete gibberishThe attack text is actually not entirely “gibberish”. The Japanese Kanji 福 is apparently pronounced “Fuku” which sounds similar to the slur generated by Vicuna! this is nevertheless an obviously successful circumvention of the model’s safety guardrails.

Taking a step away from chatbots for a minute, we remark that the practical relevance of the “imperceptibility” constraint of adversarial examples has been called into question many times before (e.g., [1, 2, 3]). Intuitively, imperceptibility of an attack only matters if there is a human-in-the-loop. Yet, machine learning is often used to automate a human decision or prediction, and so there are often much more obvious and direct ways of attacking such systems than by making imperceptible changes to model inputs. One of the surprisingly few settings where the perceptibility of adversarial examples does matter is for attacking online content filtering systems, such as NSFW content detectors or ad-blockers. Here, the attacker’s goal is to evade detection and show unwarranted content to the website viewer. So there is always a human-in-the-loop, and thus a successful attack has to remain perceptually small.

Coming back to chatbots, we saw above that many envisioned chatbot applications will likely operate without a (benign) human scrutinizing all the text that the chatbot consumes. And so attacks on these systems carry little to no perceptibility constraints.

Strong attacks on chatbots are neither optimized nor white-box

We saw that the whole imperceptibility schtick of adversarial examples is not really relevant for chatbots. This means that many of the fine-grained techniques developed to minimize perturbation sizes are redundant here. But what about the overarching attack principle of optimization? Can’t we still use that to find the “best” attack prompt $x$?

This is not at all what the current best attacks do! Instead, these attacks simply instruct the chatbot to do something bad, combined with ad-hoc “social-engineering” to convince the chatbot to ignore its security guardrails. So far, these attacks are created entirely manually through trial-and-error, with no explicit attempt at optimization (e.g., the famous “DAN” attack below was manually refined multiple times in response to model changes that nullified prior attacks).

v11 of the DAN prompt

So why don’t current attacks on chatbots use any optimization?

One reason is that many chatbots like ChatGPT are closed-source, and so the simplest and strongest white-box attacks don’t apply. But even then, designing a strong white-box attack would still be very useful. On one hand, there is a proliferation of open-source chatbots and we would like to understand whether these are fundamentally less secure than their closed-source counterparts. On the other hand, even if we assume an attacker won’t have white-box access, studying a stronger class of attacks can help us get a conservative security estimate.

So there is a deeper reason why no one uses optimization to attack chatbots: existing optimization attacks simply don’t work well enough.

Let’s take a closer look at this. We’ll assume we do have access to the model’s weights, and so we can directly maximize the loss $L(p || x, f)$ in \eqref{eq:llm_adv} with gradient descent.

Here we hit our first road-block: chatbots don’t operate over raw input text, but over word embeddings (a high-dimensional vector associated with each sub-word). And so while we can differentiate the loss $L$ with respect to the embeddings, we cannot differentiate with respect to the discrete input tokens. Existing optimization attacks (e.g., [1,2,3,4]) address this issue by running gradient descent in embedding space, and then using different heuristics to map back the adversarial embeddings to discrete text.

So how well do these optimization attacks work? As it turns out, not very well…

In a recent paper we tested state-of-the-art optimization attacks on a very simple adversarial NLP task: eliciting toxic model outputs. We evaluate two attacks, ARCA and GDBA against three models: two undefended models—GPT-2 and LLaMa—and Vicuna, a version of LLaMa that was fine-tuned to avoid toxic outputs. We find that while optimization attacks are somewhat effective against undefended models, they fail most of the time against a defended model.

Results of attacks at eliciting toxicity

Our paper further demonstrates that this is indeed a failing of the attacks (i.e., the attacks are unlikely to be failing due to Vicuna actually being adversarially robust). We do this by showing that the optimization attacks fail to find a prompt that triggers a desired output, even when such a prompt provably exists. These attacks are thus not strong enough for either attacking current chatbots, or for evaluating their robustness.

Adversarial examples strike back: Attacking multimodal models

As a final act, we’re going to do something weird: we’re going to look at models that are strictly more powerful than the ones we’ve been trying to attack so far, and show that these models are significantly easier to attack. Security is hard…

We’re going to look at multimodal models, which accept inputs from multiple modalities (e.g., text and images). GPT-4 first demonstrated the ability to interleave text and images in a chatbot, but this capability is not available yet. But there are now many open-source chatbots that also have this ability, such as Mini GPT-4, LLaVa or LLaMa Adapter. All these multimodal chatbots follow a similar design: images are first mapped to the chatbot’s embedding space and then processed alongside regular text.

This means that to attack such a multimodal chatbot, the attacker can manipulate the input text or image (or both). The advantage of attacking the image input is that the mapping from raw image pixels to embeddings is fully differentiable, and so we can just leverage widely successful and battle-tested optimization techniques for attacking vision models. These attacks are vastly more effective at eliciting toxicity compared to our previous attempts at manipulating text, and essentially succeed all the time.

Below is an example of an attack on Mini GPT-4, where we manipulate the image to force the model into outputting toxic content.


Notice again that the perceptibility of the attack doesn’t really matter for most attack scenarios (although we do find that the attack works even with a small perturbation of a benign image).


In this post, we’ve seen that there is a strong disconnect between the attacks studied for years in the adversarial examples literature, and the attacks used against chatbots today. These attacks don’t need to be imperceptible, and don’t rely on any optimization techniques.

This is a great opportunity to re-think the way we study adversarial examples! We now have a very concrete setting to study ML security, and so we can maybe start focusing less on “toy” threat models like $\ell_p$-bounded attacks. But to depart from the current landscape of ad-hoc, manually designed attacks, we first need to develop fundamentally better optimization attacks!

We conjecture that such attacks do exist. On one hand, current attacks often fail to find adversarial prompts even when they are guaranteed to exist, so there is plenty of room for improvement. On the other hand, chatbots are easy to attack in embedding space (e.g., with multimodal attacks), so these models are not infallible.

We hope that future work will make progress on resolving this conjecture (one way or another!) and thereby enrich our understanding of the robustness of chatbots and their applications.

UPDATE: It turns out that not long after writing this, Zou et al. at CMU put out a very nice paper that makes significant progress on building stronger NLP attacks (which even transfer)! But these attacks are still a lot harder to pull off compared to manual jailbreaks. It will be exciting to see how future work will be able to further improve (or defend against) these attacks.