[arXiv]score: 0.17
Configurable Safety Reward Model Hits 94.6% F1 on CoSApien Benchmark
June 1, 2026
CSRM jointly optimizes calibrated safety compliance and reward modeling, using configuration-targeted data augmentation to generalize to unseen safety specifications. It achieves 94.6% F1 on CoSApien and leads on DynaBench, outperforming instruction-tuned LLMs and standalone safety classifiers on dynamic safety configs.
cs.CL
HOW THIS AFFECTS YOU
●
researcherThe configuration-targeted augmentation approach is a concrete method for building reward models that adapt to evolving safety specs without full retraining.
●
policyWorth watching as a practical mechanism for enforcing fine-grained, updatable safety configurations in deployed LLMs without retraining the base model.