[HUGGINGFACE]score: 0.55
Interactive Evaluation Requires a Design Science
May 17, 2026
A position paper argues AI evaluation needs a formal design science to handle interactive, agentic systems where fixed-input benchmarks are structurally inadequate. The fragmented landscape of interactive benchmarks lacks shared schemas for interaction artifacts, trajectory scoring, and result claims. Relevant to teams building or selecting evals for tool-using or multi-step LLM agents.
paper