Medical Chatbots Provide 70 Percent Incorrect Answers

The frequency of errors and their variability underscore chatbots inadequacy for routine diagnostic use.

Medical chatbots that utilize artificial intelligence to answer questions provide erroneous responses in up to 70% of cases, according to a study that received one of four awards presented by the Milan Medical Association in honor of Roberto Anzalone, a historical figure in Milanese medicine.

Out of 200 questions, the chatbot provided at least one error in approximately 70% of cases, and included inaccurate or even non-existent bibliographic references in approximately 30%.

The study, published in the European Journal of Pathology, the official journal of the European Society of Pathology (ESP), concludes that "the clinical eye of the pathologist remains irreplaceable and that artificial intelligence should be considered a useful support, but not a substitute for human expertise."

"Our project, initiated in 2023," explained Vincenzo Guastafierro, a specialist in Pathological Anatomy at the Humanitas University Clinical Institute in Rozzano, "aimed to estimate the risks associated with the use of Artificial Intelligence (AI) tools in clinical practice, particularly chatbots used to support diagnosis and as learning tools."

"We presented real clinical questions to the AI across various sub-specialties, detecting incorrect answers in approximately 70% of cases and inaccurate or non-existent bibliographic references in approximately 30%," he detailed.

"Therefore, these tools must be used with extreme caution, as they can lead to inappropriate diagnostic decisions that negatively impact treatment options," stated Guastafierro.

In the study, researchers created five clinical scenarios, simulating a pathologist using ChatGPT to refine their diagnoses with 200 questions.

Each scenario was aligned with current diagnostic guidelines and validated by expert pathologists. The AI was presented with open-ended or multiple-choice questions, with or without requests for scientific references.

According to the study, "ChatGPT provided useful answers in 62.2% of cases, and 32.1% of the outcomes contained no errors, while the remainder contained at least one. ChatGPT provided 214 bibliographic references: 70.1% were correct, 12.1% were inaccurate, and 17.8% were non-existent."

This last finding greatly surprised the researchers: the AI had constructed a non-existent reality, citing sources that did not exist but were so well-constructed that they seemed credible. Among the most evident errors detected, the AI misdiagnosed a skin cancer and a different type of breast cancer in another case, also generating two incorrect bibliographic references.

Consequently, the data raises important questions for the medical profession, which is developing and will increasingly use artificial intelligence, but also for patients who use it for self-diagnosis. Although ChatGPT provided useful answers in one-third of the cases, the frequency of errors and their variability underscore its inadequacy for routine diagnostic use.

"The inaccuracy of the references, the researchers note, also suggests caution as a self-learning tool for physicians. It is crucial to recognize the irreplaceable role of the human being."

The studies will continue, explained Guastafierro, using the most updated versions to understand the evolution and the potential increase in the reliability of these tools over time. (CubaSí)