Reducing Jailbreak Susceptibility via Low-Agreeableness Persona Conditioning
June 29, 2026
Fine-tuning LLMs for social warmth can weaken adversarial safety and increase sycophancy. This method uses a persona-driven rewriting pipeline to condition user turns on low agreeableness, which reduces jailbreak susceptibility and harmful outputs while maintaining assistant warmth.
HOW THIS AFFECTS YOU
●
builderYou can mitigate safety trade-offs when implementing empathetic persona fine-tuning.
●
policyThis highlights a new way to improve model alignment without sacrificing conversational utility.