In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors

A recent study investigates the performance of large language models across various medical scenarios, including actual emergency room cases—in which at least one model outperformed human physicians in accuracy. Published this week in Science, the research was conducted by a team of physicians and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers reported conducting various experiments to assess how OpenAI’s models stacked up against human doctors. In one such experiment, they examined 76 patients who visited the Beth Israel emergency room, pitting diagnoses from two attending internal medicine physicians against those produced by OpenAI’s o1 and 4o models. Two additional attending physicians evaluated these diagnoses without knowing which originated from humans and which from AI. “At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,” the study stated, noting that the differences “were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision.” In a Harvard Medical School press release on the study, the researchers highlighted that they applied no preprocessing to the data—the AI models received identical information from electronic medical records as was available during each diagnosis. Using that data, the o1 model delivered the exact or very close diagnosis in 67% of triage cases, outperforming one physician’s 55% rate and the other’s 50% rate. “We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” stated Arjun Manrai, head of an AI lab at Harvard Medical School and a lead study author, in the release. Techcrunch event. October 13-15, 2026 | San Francisco, CA. To clarify, the study did not assert that AI is prepared to handle life-or-death decisions in the emergency room. Instead, it stated that the findings indicate an “urgent need for prospective trials to assess these technologies in real-world patient care settings.” The researchers also observed that they only examined models’ performance with text-based inputs, noting that “existing studies indicate current foundation models have greater limitations in reasoning over nontext inputs.”

In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors

Leave a Reply Cancel reply