base on A trivial programmatic Llama 3 jailbreak. Sorry Zuck! # A Trivial Jailbreak Against Llama 3
Zuck and Meta [dropped](https://ai.meta.com/blog/meta-llama-3/) the "OpenAI killer" Llama 3 on Thursday. It is no doubt a very impressive model.
As part of their training, they spent a lot of effort to ensure their models were safe. Here's [what the Meta team did](https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/):
> We took several steps at the model level to develop a highly-capable and safe foundation model in Llama:
> * For example, we conducted extensive red teaming exercises with external and internal experts to stress test the models to find unexpected ways they might be used.
>
> * We implemented additional techniques to help address any vulnerabilities we found in early versions of the model, like supervised fine-tuning by showing the model examples of safe and helpful responses to risky prompts that we wanted it to learn to replicate across a range of topics.
>
> * We then leveraged reinforcement learning with human feedback, which involves having humans give “preference” feedback on the model’s responses (e.g., rating which response is better and safer).
A commendable effort to be sure, and indeed Llama 3 performs well on the standard safety benchmarks.
## Priming our Way Around Safeguards
However, it turns out we can trivially get around these safety efforts by simply "priming" the model to produce a harmful response. First, let's consider what a classic dialog flow looks like, and how the safety training of Llama 3 works in this setting:
<figure style="text-align: center;">
<img src="/images/refusal.png" alt="Standard dialog flow" style="margin: 0 auto; margin-bottom: 20px; display: block;">
<figcaption><i>Figure 1: Standard dialog flow. When the user prompts Llama 3 with a harmful input, the model (Assistant) refuses thanks to Meta's safety training efforts.</i></figcaption>
<div style="padding-bottom: 20px;"></div>
</figure>
However, if we simply *prime* the Llama 3 `Assistant` role with a harmful prefix (cf. the edited `encode_dialog_prompt` function in `llama3_tokenizer.py`), LLama 3 will often generate a coherent, harmful continuation of that prefix. Llama 3 is so good at being helpful that its learned safeguards don't kick in in this scenario!
<figure style="text-align: center;">
<img src="/images/continue.png" alt="Jailbroken continuation dialog flow" style="margin: 0 auto; margin-bottom: 20px; display: block;">
<figcaption><i>Figure 2: A jailbroken Llama 3 generates harmful text. We trivially bypass Llama 3's safety training by inserting a harmful prefix in Assistant role to induce a harmful completion.</i></figcaption>
<div style="padding-bottom: 20px;"></div>
</figure>
Conveniently, there's no need to handcraft these harmful prefixes. Indeed, we can simply just call a naive, helpful-only model (e.g. Mistral Instruct) to generate a harmful response, and then pass that to Llama 3 as a prefix. The length of this prefix can affect if Llama 3 actually ends up generating a harmful response. Too short a prefix, and Llama 3 can recover and refuse the harmful generation. Too long a prefix, and Llama 3 will just respond with an EOT token and a subsequent refusal. Here's the gradation of Attack Success Rate (ASR) at increasing harmful prefix max token lengths on the [AdvBench subset](https://github.com/arobey1/advbench):
<div align="center">
| **Prefix Length** | **ASR** |
|:---:|:----------:|
| 5 | 72% |
| 10 | 80% |
| 25 | 92% |
| 50 | 92% |
| 75 | 98% |
| 100 | 98% |
<em>Table 1: ASR at varying harmful assistant prefix lengths. Llama 3 is able to partially recover and refuse shorter harmful prefixes, but is thrown off its aligned distribution by longer prefixes.</em>
</div>
## A Lack of Self-Reflection?
Fun and games aside, the existence of this trivial assistant-priming jailbreak begs a more fundamental question: for all the capabilities LLMs possess and all the hype they receive, are they really capable of understanding what they're saying? It's no surprise that by training on refusals, Meta has made Llama 3 capable of refusing harmful instructions. But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it.
That seems like a pretty big issue.
Shoot us a message at
[email protected] if you have ideas on this or otherwise want to chat.
### Contemporaneous Collaborators
Shoutout to Ben Lemkin (Princeton) and Jason Vega (UIUC) for working on a similar flavor of priming jailbreaks contemporaneously.", Assign "at most 3 tags" to the expected json: {"id":"9591","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"