[arXiv]score: 0.17

Configurable Safety Reward Model Hits 94.6% F1 on CoSApien Benchmark

June 1, 2026

CSRM jointly optimizes calibrated safety compliance and reward modeling, using configuration-targeted data augmentation to generalize to unseen safety specifications. It achieves 94.6% F1 on CoSApien and leads on DynaBench, outperforming instruction-tuned LLMs and standalone safety classifiers on dynamic safety configs.

cs.CL

HOW THIS AFFECTS YOU

●

researcherThe configuration-targeted augmentation approach is a concrete method for building reward models that adapt to evolving safety specs without full retraining.

●

policyWorth watching as a practical mechanism for enforcing fine-grained, updatable safety configurations in deployed LLMs without retraining the base model.

SOURCE

https://arxiv.org/abs/2605.30487

← back to feed