●builderYou can use this to measure how your agent performs in conversational debugging scenarios rather than just single-turn patch generation.
●researcherThis provides a more realistic evaluation framework for agentic workflows by incorporating multi-turn human-in-the-loop dynamics.