[HUGGINGFACE]score: 0.42

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

June 4, 2026

RHO optimizes LLM agent harnesses — tools, skills, and workflows — without labeled validation data by replaying past trajectories and using self-consistency plus pairwise self-preference to select harness updates. The method samples a diverse coreset of challenging tasks, re-solves them in parallel, and ranks candidate updates by agent-judged pairwise comparison across rollouts.

read original ↗huggingface.co

← back to feed