Browse by author
Lookup NU author(s): Dr Judith HarrisonORCiD, Alex Robertson, Dr Marie PooleORCiD, Tom Collis, Liting Huang, Professor Edward MeinertORCiD, Dr Huizhi LiangORCiD, Professor John-Paul TaylorORCiD
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
© 2026 The Author(s). Published by Elsevier Inc. on behalf of International Psychogeriatric Association. This is an open access article under the CC BY license. http://creativecommons.org/licenses/by/4.0/Background: Collateral histories from carers are central to dementia diagnosis but are often collected inconsistently and variably documented. With rising demand on memory services and the emergence of disease-modifying therapies requiring timely diagnosis, there is increasing need for structured and efficient assessment approaches. Conversational AI powered by large language models (LLMs) may support standardised collateral history acquisition while maintaining clinician oversight. We developed LUMEN, a stakeholder-informed prototype designed to generate structured collateral summaries for clinical review. Methods: A five-stage patient, public and professional involvement programme (approximately 232 participants) co-designed the question set, interface and outputs. Seven open-source LLMs were benchmarked; Qwen3-30B-A3B was selected to generate structured summaries from interview transcripts. Six clinician-authored vignettes representing Alzheimer’s disease, dementia with Lewy bodies, vascular dementia, frontotemporal dementia, mild cognitive impairment and normal cognition were used to generate 54 synthetic dialogues (27 clinician role-played, 27 GPT-4 generated). Diagnostic categories were assigned using a deterministic rule-based rubric applied to structured summaries. Two clinicians independently rated each dialogue. Outcomes included exploratory evaluation of alignment with diagnostic categories measured by area under the receiver operating characteristic curve (AUROC) and Cohen’s κ, and System Usability Scale (SUS) scores. Results: In this small synthetic vignette-based dataset, macro-average AUROC was 0.95; these values reflect performance under closed-loop proof-of-concept conditions rather than real-world diagnostic accuracy. Discrimination was highest for Alzheimer’s disease and vascular dementia (AUROC = 1.00 in this synthetic dataset) and lowest for mild cognitive impairment (AUROC = 0.77). Agreement between categories assigned by the rule-based rubric and averaged clinician ratings was κ = 0.88 (95% CI 0.83–0.93). Mean SUS score was 78.1/100. Conclusions: In a small, closed-loop synthetic proof-of-concept dataset, this LLM-assisted, rubric-based pipeline showed that structured summaries could be processed reproducibly by the rubric and separated diagnostic categories under controlled conditions. These findings do not show real-world diagnostic performance. Further evaluation is required to determine clinical usefulness, robustness and workflow impact.
Author(s): Harrison JR, Robertson A, Tang SL, Kaur L, Poole M, Mullin D, Robertson E, De Silva P, Collis T, Huang L, Blackburn D, Meinert E, Liang H, Taylor JP
Publication type: Article
Publication status: Published
Journal: International Psychogeriatrics
Year: 2026
Pages: Epub ahead of print
Online publication date: 25/05/2026
Acceptance date: 08/05/2026
Date deposited: 01/06/2026
ISSN (print): 1041-6102
ISSN (electronic): 1741-203X
Publisher: Elsevier
URL: https://doi.org/10.1016/j.inpsyc.2026.100221
DOI: 10.1016/j.inpsyc.2026.100221
Data Access Statement: The synthetic carer–patient dialogues and the corresponding struc tured summaries that underpin this analysis are available on reasonable request from the corresponding author. These data do not contain real patient information.
Altmetrics provided by Altmetric