Best-of-N Jailbreaking
Abstract
We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.
 
    Overview of BoN jailbreaking, the performance across three input modalities and its scaling behavior. (top) BoN jailbreaking is run on each request by applying randomly sampled augmentations, processing the transformed request with the LLM, and grading the response for harmfulness. (a) ASR of BoN jailbreaking on the different LLMs as a function of the number of augmented sample attacks (N), with error bars produced via bootstrapping. Across all LLMs, text BoN achieves at least a 52% ASR after 10,000 sampled attacks, highlighting the effectiveness of BoN jailbreaking. (b, c) BoN seamlessly extends to vision and audio inputs by using modality-specific augmentations. (d) We show the text modality scaling behavior of the negative log ASR with respect to N for Claude 3.5 Sonnet, suggesting power-law-like behavior.
Example Jailbreaks
WARNING: EXAMPLES CONTAIN HARMFUL CONTENT
By default, all jailbreak examples across models and behaviors are shown. Use the dropdowns below to filter the results. You can choose to show responses from a specific model, or see how different models respond to a particular HarmBench behavior. Navigate between examples using the Previous/Next buttons, or jump directly to a specific example by entering its number and clicking Go. The current example number and total count based on your filters are shown on the right.