[arXiv]score: 0.13

LLMs Show Mixed Results Scoring Aphasia Discourse Units

June 16, 2026

Four instruction-tuned LLMs were benchmarked on token-level Correct Information Unit classification from aphasic speech transcripts across zero-shot and few-shot conditions, evaluated against consensus human labels using F1 and Cohen's kappa. Zero-shot prompting was insufficient across models, suggesting few-shot prompting is necessary for clinical discourse scoring tasks.

HOW THIS AFFECTS YOU

●

researcherEstablishes a benchmark for LLM performance on clinical discourse annotation, with kappa-based inter-rater comparisons useful for calibrating NLP evaluation in low-resource clinical tasks.

●

healthWorth watching because automated CIU scoring could reduce burden on trained raters in aphasia assessment, but zero-shot failure means prompt engineering or fine-tuning is required before clinical use.

read original ↗arxiv.org

← back to feed