Evaluation of ChatGPT performance on emergency medicine board examination questions: observational study
Loading...
Files
Published Version
Date
2025
Authors
Pastrak, Mila
Kajitani, Sten
Goodings, Anthony James
Drewek, Austin
LaFree, Andrew
Murphy, Adrian
Journal Title
Journal ISSN
Volume Title
Publisher
JMIR Publications
Published Version
Abstract
Background:
The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear.
Objective:
This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation.
Methods:
A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed.
Results:
The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability.
Conclusions:
The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form.
Description
Keywords
Artificial intelligence , ChatGPT-4 , Medical education , Emergency medicine , Examination , Examination preparation
Citation
Pastrak, M., Kajitani, S., Goodings, A.J., Drewek, A., LaFree, A. and Murphy, A. (2025) ‘Evaluation of chatgpt performance on emergency medicine board examination questions: observational study’, JMIR AI, 4, pp. e67696 (9pp). https://doi.org/10.2196/67696