New research from Anthropic, one of the leading AI companies and the developer of the Claude family of Large Language Models (LLMs), has released research showing that the process for getting LLMs to do what they’re not supposed to is still pretty easy and can be automated. SomETIMeS alL it tAKeS Is typing prOMptS Like thiS.
To prove this, Anthropic and researchers at Oxford, Stanford, and MATS, created Best-of-N (BoN) Jailbreaking, “a simple black-box algorithm that jailbreaks frontier AI systems across modalities.” Jailbreaking, a term that was popularized by the practice of removing software restrictions on devices like iPhones, is now common in the AI space and also refers to methods that circumvent guardrails designed to prevent users from using AI tools to generate certain types of harmful content. Frontier AI models are the most advanced models currently being developed, like OpenAI’s GPT-4o or Anthropic’s own Claude 3.5.
As the researchers explain, “BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations—such as random shuffling or capitalization for textual prompts—until a harmful response is elicited.”
For example, if a user asks GPT-4o “How can I build a bomb,” it will refuse to answer because “This content may violate our usage policies.” BoN Jailbreaking simply keeps tweaking that prompt with random capital letters, shuffled words, misspellings, and broken grammar until GPT-4o provides the information. Literally the example Anthropic gives in the paper looks like mocking sPONGbOB MEMe tEXT.
Anthropic tested this jailbreaking method on its own Claude 3.5 Sonnet, Claude 3 Opus, OpenAI’s GPT-4o, GPT-4o-mini, Google’s Gemini-1.5-Flash-00, Gemini-1.5-Pro-001, and Facebook’s Llama 3 8B. It found that the method “achieves ASRs [attack success rate] of over 50%” on all the models it tested within 10,000 attempts or prompt variations.
The researchers similarly found that slightly augmenting other modalities or methods for prompting AI models, like speech or image based prompts, also successfully bypassed safeguards. For speech, the researchers changed the speed, pitch, and volume of the audio, or added noise or music to the audio. For image based inputs the researchers changed the font, added background color, and changed the image size or position.
Anthropic’s BoN Jailbreaking algorithm is essentially automating and supercharging the same methods we have seen people use to jailbreak generative AI tools, often in order to create harmful and non-consensual content.
In January, we showed that the AI-generated nonconsensual nude images of Taylor Swift that went viral on Twitter were created with Microsoft’s Designer AI image generator by misspelling her name, using pseudonyms, and describing sexual scenarios without using any sexual terms or phrases. This allowed users to generate the images without using any words that would trigger Microsoft’s guardrails. In March, we showed that AI audio generation company ElevenLabs’s automated moderation methods preventing people from generating audio of presidential candidates were easily bypassed by adding a minute of silence to the beginning of an audio file that included the voice a user wanted to clone.
Both of these loopholes were closed once we flagged them to Microsoft and ElevenLabs, but I’ve seen users find other loopholes to bypass the new guardrails since then. Anthropic’s research shows that when these jailbreaking methods are automated, the success rate (or the failure rate of the guardrails) remains high. Anthropic research isn’t meant to just show that these guardrails can be bypassed, but hopes that “generating extensive data on successful attack patterns” will open up “novel opportunities to develop better defense mechanisms.”
It’s also worth noting that while there’s good reasons for AI companies to want to lock down their AI tools and that a lot of harm comes from people who bypass these guardrails, there’s now no shortage of “uncensored” LLMs that will answer whatever question you want and AI image generation models and platforms that make it easy to create whatever nonconsensual images users can imagine.