A new study finds that artificial intelligence (AI) large language models have potential in general medicine and clinical practices to diagnose cases.
Key Details
- Researchers from Mass General Brigham released a new study on August 22 about the reliability of large language models in clinical medicine.
- The study found that ChatGPT has a 71.7% success rate when drawing from textbook case studies regarding its ability to diagnose cases and suggest care options, getting the final diagnosis correct 77% of the time.
- The AI performed inferiorly with differential diagnosis compared to general diagnosis (60.3%) and also struggled with clinical management.
- The researchers believe success rates of 80% to 90% are necessary for clinical applicability.
Why It’s Important
The world changed with the release of ChatGPT on November 30, 2022. In the nine months since, tech companies and researchers have raced to find AI solutions that can improve efficiency and profitability in every area of the economy, including medical care. Researchers want to know if AI can improve the accuracy of diagnosis in healthcare, helping catch and reduce severe illness while alleviating stress on the medical system caused by staff shortages.
The Mass General Brigham study is among the first to study the value of AI in clinical medicine in generalized settings rather than evaluating individual tasks. It remains unclear whether AI has a future in clinical medicine, although individual studies have suggested that the technology can be highly effective in individual uses. The study founders tell Axios that there is much work to do to “bridge the gap from a useful machine learning model to actual use in clinical practice” but argue that ChatGPT is currently roughly on par with a newly graduated doctor without seniority or experience.
Notable Quote
“ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT’s training data set,” says the study.