●builderIf you are using agent benchmark tasks for RL training, a significant fraction of your reward signal may be invalid; the hacker-fixer loop can be applied to audit and patch your verifiers.
●researcher16% hackability rate across major agent benchmarks means current leaderboard rankings and RL training signals are materially corrupted — the hacker-fixer loop is a deployable fix.
●founderWorth watching because agent capability claims based on these benchmarks are overstated, which affects product differentiation and competitive positioning.