[arXiv]score: 0.13
Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment
June 24, 2026
A study of 49 programmers using GitHub Copilot across 148 HIPAA-derived NFRs finds that existing single-turn benchmarks miss critical quality dimensions in multi-turn dialogues about non-functional requirements. The work introduces evaluation methods capturing both output correctness and interaction quality for compliance-focused conversations, where NFRs are vague and context-dependent across codebases.