[HUGGINGFACE]score: 0.42

The Cold-Start Safety Gap in LLM Agents

June 4, 2026

Tool-calling agents are significantly more vulnerable to safety threats at session start, with safety scores improving 9–52% across 7 tested models as preceding task count rises from 0 to 20. The SODA benchmark formalizes this by controlling how many benign agentic tasks precede a safety-relevant prompt. Hidden-state analysis shows models drift toward safety-aligned representations as context accumulates, suggesting cold-start exploitation is a concrete attack surface for deployed agents.

read original ↗huggingface.co

← back to feed