Add universal magic word attack on embedding-based safeguards by WhymustIhaveaname · Pull Request #16 · chawins/llm-sp

WhymustIhaveaname · 2026-04-04T16:13:59Z

Adds a paper on attacking embedding-based LLM safeguards via universal adversarial suffixes to the Jailbreak section.

Paper: Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models (arXiv 2501.18280)

Shows that text embedding outputs are concentrated on a narrow hyperspherical band, and exploits this geometric bias to craft transferable magic word suffixes that jailbreak ChatGPT, DeepSeek, Qwen, etc. Includes a training-free debiasing defense.

Add universal magic word attack on embedding-based safeguards

6ec0ae7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add universal magic word attack on embedding-based safeguards#16

Add universal magic word attack on embedding-based safeguards#16
WhymustIhaveaname wants to merge 1 commit into
chawins:mainfrom
WhymustIhaveaname:add-magic-words-jailbreak

WhymustIhaveaname commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WhymustIhaveaname commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant