RLVR Model Merging Fails Due to Near-Orthogonal Sparse Parameter Updates
June 18, 2026
RLVR post-training produces sparse parameter updates that are spread farther apart in weight space than SFT updates, forming near-orthogonal directions that make model merging fragile — the opposite of what sparsity would suggest. The effect is attributed to RL stochasticity and diversity of emergent reasoning patterns, meaning training-free capability aggregation from RLVR models is unreliable.
HOW THIS AFFECTS YOU
●
researcherThis rules out model merging as a cheap path to combining RLVR-trained reasoning specialists and motivates studying alternative aggregation methods like ensemble routing.