Skip to content

Add universal magic word attack on embedding-based safeguards#16

Open
WhymustIhaveaname wants to merge 1 commit into
chawins:mainfrom
WhymustIhaveaname:add-magic-words-jailbreak
Open

Add universal magic word attack on embedding-based safeguards#16
WhymustIhaveaname wants to merge 1 commit into
chawins:mainfrom
WhymustIhaveaname:add-magic-words-jailbreak

Conversation

@WhymustIhaveaname
Copy link
Copy Markdown

Adds a paper on attacking embedding-based LLM safeguards via universal adversarial suffixes to the Jailbreak section.

Paper: Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models (arXiv 2501.18280)

Shows that text embedding outputs are concentrated on a narrow hyperspherical band, and exploits this geometric bias to craft transferable magic word suffixes that jailbreak ChatGPT, DeepSeek, Qwen, etc. Includes a training-free debiasing defense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant