Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

arXiv preprint, 2025

Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

Optimization-based jailbreaking that induces safety misalignment from benign conditioning prompts.

Paper