Artificial intelligence (AI) tools “struggle with the basic back-and-forth of a doctor’s visit”, researchers have said.
Previous studies have discovered that AI systems can help healthcare professionals by successfully taking medical histories, providing preliminary diagnoses and triaging patients.
However, scientists from Harvard Medical School and Stanford University have now detected that AI tools do not perform well in situations that more closely mimic the real world.
During the study, the team of academics created a test called Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD).
- Doctors rarely use AI-assisted screening to detect diabetic retinopathy
- AI needs further improvement to make good health decisions
- New AI tool improves outcomes for kidney transplant patients
Once developed, they deployed the test on four large-language models to examine the performance of AI in clinical settings.
They found that the four large-language models performed well on medical exam-style questions, but not so well when they engaged in conversations more closely mimicking real-world interactions.
The authors said: “This gap underscores a two-fold need: first, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnoses based on more realistic interactions before they are deployed in the clinic.
“Evaluation tools like CRAFT-MD can not only assess AI models more accurately for real-world fitness but could also help optimise their performance in clinic.”
First author Professor Pranav Rajpurkar said: “Our work reveals a striking paradox – while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit.
“The dynamic nature of medical conversations – the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms – poses unique challenges that go far beyond answering multiple choice questions.”
Professor Rajpurkar added: “When we switch from standardised tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.”
AI models are currently assessed by getting them to answer multiple choice medical questions, researchers have shown.
Fellow author Shreya Johri said: “This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far messier.
- Type 1 diabetes among youths with presymptomatic symptoms was higher during pandemic
- 1 in 4 people with insulin-dependent diabetes have eating disorder symptoms, research shows
- Keto diet can ease symptoms of severe bipolar disorder
“We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform.”
Co-author, Professor Roxana Daneshjou said: “As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically.
“CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”
The study was published in the journal Nature Medicine.