24.04.2024 change 24.04.2024

ChatGPT fails Polish internal medicine exam

Credit: Adobe Stock

Even the most sophisticated algorithms and technologies cannot diagnose and treat diseases without human involvement, scientists from the Collegium Medicum of the Nicolaus Copernicus University found after ChatGPT 'failed' the internal medicine exam in the study they had designed.

The enormous progress made in recent years in the field of artificial intelligence means that many tasks previously reserved for humans can now be performed by models and algorithms. Modern medicine is also starting to use the possibilities of AI. Research is ongoing on its use to design new drugs, support doctors in the diagnostic process, predict pandemics and replace surgeons during operations. For some time now, there have been reports on experiments in which artificial intelligence models successfully pass medical exams and provide 'patients' with more accurate and empathetic advice than doctors.

However, according to the results of the latest study by experts from the Ludwik Rydygier Collegium Medicum in Bydgoszcz, as reported by Marcin Behrendt on the Nicolaus Copernicus University website, the moment has not yet come to entrust artificial intelligence with the complete care of patients, especially in the field of internal medicine.

'Internal medicine, as a field, is often referred to as the queen of medical science. Physicians specializing in internal medicine are required to possess extensive knowledge as well as a high degree of focus and self-discipline,’ say the authors of the study published in the Polish Archives of Internal Medicine (https://dx.doi. org/10.20452/pamw.16608).

'In accordance with Polish law, a physician can become an internal medicine specialist after completing specialist training and passing the board certification examination. The assessment consists of two elements: a multiple-choice test that encompasses 120 questions with 5 possible answers of which only 1 is accurate, and an oral examination that can only be attempted upon successfully passing the written test,’ they say.

After ChatGPT successfully passed tests such as the United States Medical Licensing Examination (USMLE), the European Exam in Core Cardiology, and the Ophthalmic Knowledge Assessment Program (OKAP), Polish scientists decided to investigate whether this model would be able to pass the Polish examination required for obtaining the title of specialist in internal diseases. Their study was the first in the world to evaluate AI in the field of internal medicine.

ChatGPT was presented with a total of 1,191 questions from the Polish board certification examinations from the years 2013 to 2017. Only those questions that were impossible for ChatGPT to analyse, such as those containing images were removed.

The authors divided them into different categories, classifying them based on the level of complexity (one correct answer or several), degree of difficulty and length.

It turned out that the rate of correct answers obtained by ChatGPT ranged from 47.5 percent to 53.33 percent (median 49.37%). So it was definitely not enough to pass the exam. 'In all sessions, the performance of ChatGPT was significantly inferior to that of human examinees (whose results ranged from 65.21% to 71.95%),’ the scientists found. (The minimum requirement is 60 percent correct answers).

The language model results showed significant differences depending on question length. ChatGPT performed best with the shortest questions, followed by long, very long, and finally short and medium-length questions. Interestingly, human results are very similar.

Regarding question difficulty, it was found that the correctness of ChatGPT responses gradually decreased as the task difficulty increased, which was also similar to human behaviour.

Additionally, the researchers verified the effectiveness of AI in answering questions from specific areas of internal medicine. It turned out that it most often responded correctly to questions concerning allergology (71.43%), followed by those on infectious diseases (55.26%), endocrinology (54.64%), nephrology (53.51%), rheumatology (52.83%), haematology (51.51%), gastroenterology (50.97%), pulmonology (46.71%), and diabetology (45.1%). It obtained the lowest result (43.72%) on questions regarding cardiology.

'AI has made significant advancements in recent years, and gained considerable popularity in various fields. Previous applications of AI in health care involved tasks such as cataloging and interpreting big data or developing and implementing diagnostic–therapeutic algorithms. AI usage seems to be of great aid given the underfunding of health care systems, the problem of professional burnout among medical professionals, and personnel shortages,’ the scientists say.

However, as they emphasise, their study (as well as several similar ones) shows that the capabilities of artificial intelligence are still very limited and currently it is difficult for AI to compete with the expertise of trained medical professionals, particularly in the field of internal medicine

'However, medicine is a field in which the utilization of AI language processing models may be beneficial,’ they add.

As an example, they cited ChatGPT's empathetic behaviour towards patients. A recent study comparing responses from doctors and chatbots to medical queries posted on public forums found that 79 percent of patients found the answers provided by AI to be more empathetic and comprehensive than those provided by human professionals.

'Undoubtedly, it is worthwhile following the development of AI, especially ChatGPT, to be able to take advantage of its rapid progress,’ the authors write. They add that 'it is unlikely that AI will be able to replace health care professionals in the near future, particularly in the field of internal medicine - even the most sophisticated algorithms and technologies facilitated by AI are incapable of diagnosing and treating diseases without human input.’

The researchers also noted that their experiment had several limitations. First of all, the exam was conducted in Polish, and ChatGPT itself was designed in English. Additionally, ChatGPT undergoes regular updates, and the version employed in the study may not necessarily reflect the most current iteration at the time of publication. (PAP)

Katarzyna Czechowicz

kap/ zan/ kap/

tr. RL