An estimated 10–15% of diagnoses are incorrect and serious patient harm or death from misdiagnosis affects one in 200 patients admitted to hospital.1 Up to 80% of diagnostic errors are potentially preventable and are mostly due to faults in clinician reasoning related to the gathering of relevant background information and integrating symptoms, signs and situational factors in generating an appropriate differential diagnosis.2
Experience with digital symptom checkers,3 electronic differential diagnosis generators4 and electronic medical record (EMR) screening for missed diagnoses5 has shown minimal impact, in part due to poor integration into clinical workflows and negative clinician perceptions.6 In this perspective article, we consider how artificial intelligence (AI) may assist clinicians in diagnosing complex cases at the bedside or in the clinic.
Advent of AI‐assisted diagnosis
Machine learning prediction models applied to imaging data have shown promise in diagnosing pneumothoraces from chest radiographs,7 diabetic retinopathy from fundal images8 or skin cancer from dermatoscopic photographs.9 Randomised trials confirm superior AI‐assisted clinician performance in diagnosing diabetic retinopathy,10 detecting adenomas on colonoscopy,11 and identifying impaired cardiac function from electrocardiographs.12
To date, most diagnostic machine learning models input images or structured data from EMRs or investigations and generate single disease probabilities or disease present/not present predictions. Moving upstream and using machine learning tools to assist bedside clinicians in more complex reasoning tasks requires integration of relevant clinical information (history from medical records, presenting complaint and findings from physical examination) and formulation of a differential diagnosis containing the correct diagnosis.
To achieve this aim, AI tools should work in ways that align with how clinicians reason in terms of System 1 (intuitive) and System 2 (analytical) thinking (two modes of cognitive processing introduced by Daniel Kahneman in Thinking, fast and slow).13 For common clinical presentations, such as crushing central chest pain or sudden onset hemiplegia, intuitive reasoning often suffices in arriving quickly at the correct diagnosis. For more complex and undifferentiated cases, such as fever, weight loss and generalised bruising in an older patient, analytical reasoning is required. This is where AI may help clinicians generate and reason through a differential diagnosis, detailing the key pros and cons of each diagnosis from the clinician or AI.
Using ChatGPT and related technologies to assist with diagnostic reasoning
Large language models (LLMs), such as the general purpose generative pretrained transformer (GPT) series of models, embodied in the chatbot ChatGPT, use natural language processing to learn and generate human‐like text content in response to text‐based prompts (Box). Studies of LLM‐assisted diagnostic reasoning have used GPT‐3.5 or GPT‐4. Applied to EMRs and other source documents, these LLMs can generate concise summaries of patients’ active diagnoses and past medical history (thus saving interview time and effort),14,15 suggest differential diagnoses surpassing previous differential diagnosis generators,16,17 detect diagnostic uncertainty in clinical documentation18 and solve complex diagnostic problems.19,20 Furthermore, LLMs can perform multistep reasoning and provide rationales for the links between each step, known as chain‐of‐thought reasoning.21 LLMs can thereby function as conversant sounding boards against which clinicians iteratively test their diagnostic reasoning.22 Simply providing a list of differential diagnosis, with no rationales or probability rankings, has no effect on clinician diagnostic accuracy.23
Traditional task‐specific machine learning prediction models generate a single set of diagnostic predictions in response to a fixed, one‐off input of pre‐processed data with predictions pre‐analysed by domain experts. These predictions rely on extracting and presenting, as explanation, key features learnt from being trained on a circumscribed, domain‐specific dataset.24 In contrast, LLMs are pretrained and fine‐tuned on a large, heterogenous dataset of clinical knowledge, and can discern complex relationships and variations within the data, beyond the limits of human cognition. In response to text prompts relating to a diagnostic case, LLMs can generate a list of plausible alternatives, and, when provided with further information (eg, revised history or physical signs, clinician insight, simple bedside test results), they can re‐evaluate and re‐order their differential diagnosis. The more narrative text a LLM has as input, rather than only key clinical features, the better its diagnostic performance.25 The generated diagnostic rationales, highlighting relevant patient data, provide a reasoning path towards the final diagnosis.26 This “reasoning aware” approach, using chain‐of‐thought, prompt‐based learning, allows the LLM to use rationales as part of its input, further improving the diagnostic outputs and even correcting its own misdiagnoses arising, in part, from inaccurate training data.27
Such evaluative LLMs could provide clinicians with a second opinion in real time, share uncertainty, deal with limited or noisy data, and defer appropriately to clinician expertise and judgement.28 Studies show the accuracy of clinicians’ diagnoses across multiple cases improves markedly if the diagnoses are discussed with one other colleague, more so with two or more.29,30 The benefit of this collective intelligence can feasibly be replicated using LLMs. This “machine‐in‐the‐loop” approach better leverages clinician expertise in hypothesis‐driven decision making, mitigates over‐ and under‐reliance on machine learning decision support, and builds clinician trust and control.31 This contrasts with the more conventional “human‐in‐the‐loop” approach where the role of the clinician is relegated to accepting or rejecting AI outputs that are unaccompanied by any reasoning chain, leading to clinician resistance to and disuse of LLMs.22
Experimental studies of LLMs in diagnostic reasoning
ChatGPT does not appear to significantly enhance the differential diagnosis of clinicians for common clinical presentations.32,33,34 In contrast, in a study comparing GPT‐4 with a simulated population of 10 000 online medical‐journal‐reading clinicians in solving 38 challenging cases, the March 2023 version of GPT‐4 correctly diagnosed a mean of 22 cases (57%) versus 14 cases (36%) for the clinicians.35 In a vignette study comparing GPT‐4 with 553 clinicians, GPT‐4 more accurately estimated pre‐test probability of the disease in all five cases, and post‐test probability in all cases after a negative test result, and in four cases after a positive test.36 In a randomised study of 20 experienced clinicians diagnosing 302 difficult real‐world cases, those assigned to assistance from Med‐PaLM‐2, an LLM trained on biomedical texts such as PubMed abstracts, compared with those assigned to more traditional decision support (search engines, online resources) generated higher quality differential diagnosis containing the correct diagnosis (top‐10 accuracy, 52% v 44%) and demonstrated higher accuracy for the final diagnosis (59% v 34%).37 In a randomised crossover simulation study, 20 standardised patients were subjected to text‐based consultations with an LLM (Articulate Medical Intelligence Explorer) or face‐to‐face consultations with 20 primary care clinicians across 149 clinical scenarios, with responses assessed by 23 specialists.38 The LLM showed significantly higher top‐10 diagnostic accuracy than clinicians (93% v 83%). Both specialists and patients rated the LLM superior in communication, reasoning and empathy.
But LLMs have limitations. ChatGPT‐3.5 demonstrated an 83% error rate when applied to 100 challenging paediatric cases, underscoring the need to avoid unrepresentative training datasets.39 In another study, the differential diagnosis created by GPT‐4 across 18 standardised clinical vignettes were more likely to include diagnoses that stereotyped certain races, ethnicities and genders.40 ChatGPT is also often inaccurate when used by patients to self‐diagnose and self‐triage,41,42 suggesting research should, for the moment, remain focused on clinician‐facing applications.
Future directions
Several innovations will likely move LLM‐assisted diagnosis towards prime‐time use. Biomedically trained LLMs, such as Med‐PaLM‐2, augmented with real‐time access to additional, up‐to‐date medical information, semantic knowledge graphs, reinforcement learning with human feedback and optimised prompt engineering, will develop accuracy superior to models such as GPT‐4 trained on internet data of variable quality.27,43 Multimodal LLMs are emerging that can process not only text but also numerical, image, video and audio data, further enhancing performance. For example, an LLM trained on both text and images (GPT‐4 with Vision version), when compared with human respondents across 934 cases from the New England Journal of Medicine Image Challenge and 69 clinicopathological conferences (New England Journal of Medicine), achieved an overall diagnostic accuracy of 61% versus 49%, with longer, more informative captions increasing performance.44 Another study found a diagnostic LLM trained on multimodal data from real‐world EMRs outperformed text‐based models.45
But there are challenges. Studies of diagnostic LLMs have involved laboratory‐based vignettes that may not represent usual clinical practice where diagnoses unfold temporally with recursive question‐answering interactions involving clinicians and patients. Variations in patient populations, clinical settings and data quality may degrade model performance. Embedding cognitive bias mitigations into the design of LLM applications and their user interfaces, and implementing LLMs in ways that blend with clinician workflows are required.46 Randomised trials involving clinicians diagnosing acute clinical scenarios in real‐world settings, with and without LLM assistance, are needed. Clinicians will still need to critically appraise the differential diagnosis and associated rationales in terms of their consistency with the clinical data, their correctness and level of relevant detail (ie, specificity), their usefulness in pointing towards the correct diagnosis, and their similarity to the way humans think.26 Clinicians will also have to validate LLM performance on local datasets, use prompts with LLMs correctly,47 and avoid over‐reliance on model outputs. Regulatory approval and monitoring of LLM quality management systems will be required to preserve data privacy, ensure transparency and fairness, determine medical liability for harm and guarantee LLMs remain effective and safe over their life cycle.48 The Therapeutic Goods Administration of Australia has stated that LLM developers must understand and demonstrate the sources and quality of text inputs used to train and test the model, in addition to showing how the data are relevant and appropriate for use on Australian populations.49 Although LLM‐assisted diagnosis is not yet ready for prime time use, it may not be far off.
Box – Brief overview of how large language models (LLMs) work
|
|||||||||||||||
|
|||||||||||||||
|
|||||||||||||||
|
|||||||||||||||
|
|||||||||||||||
|
|||||||||||||||
|
Provenance: Not commissioned; externally peer reviewed.
Open access:
Open access publishing facilitated by The University of Queensland, as part of the Wiley ‐ The University of Queensland agreement via the Council of Australian University Librarians.
No relevant disclosures.