[arXiv]score: 0.25

Reducing Jailbreak Susceptibility via Low-Agreeableness Persona Conditioning

June 29, 2026

Fine-tuning LLMs for social warmth can weaken adversarial safety and increase sycophancy. This method uses a persona-driven rewriting pipeline to condition user turns on low agreeableness, which reduces jailbreak susceptibility and harmful outputs while maintaining assistant warmth.

HOW THIS AFFECTS YOU

●

builderYou can mitigate safety trade-offs when implementing empathetic persona fine-tuning.

●

policyThis highlights a new way to improve model alignment without sacrificing conversational utility.

read original ↗arxiv.org

← back to feed