JAMA Study Finds Top AI Models Fail Over 80% of Initial Diagnoses, Highlighting Need for Human Oversight

A landmark study in JAMA Network Open revealed that leading AI models, including GPT-5, have a failure rate exceeding 80% in the crucial first step of medical diagnosis.

The AI's performance improves significantly in later stages when more data is available, indicating its main weakness is in handling the uncertainty of initial patient consultations.

These findings reinforce the need for a 'human-in-the-loop' approach, as upcoming regulations like the EU AI Act will classify unsupervised diagnostic AI as a high-risk application.

A major study has revealed a critical weakness in how today's most advanced AI models handle the initial stages of medical diagnosis.

Published in JAMA Network Open, research from Harvard Medical School investigators benchmarked 21 leading Large Language Models (LLMs), including GPT-5 and Gemini. The results were striking: for differential diagnosis—the open-ended, uncertain first step—failure rates often exceeded 80%. However, for final diagnosis, where more clinical data is available, failure rates dropped below 40%. This doesn't mean AI is 'bad at medicine'; rather, it shows that AI is brittle when faced with the ambiguity of an initial consultation.

This finding is particularly timely for several reasons. First, it provides a crucial reality check amidst the constant hype surrounding new AI model releases. While tech companies promote ever-improving reasoning capabilities, this study shows a persistent gap in a critical real-world application. Second, global regulators are finalizing strict rules for AI. The EU AI Act, with enforcement beginning in August 2026, and the FDA's guidance both emphasize human oversight for 'high-risk' applications, a category that patient-facing diagnostic tools clearly fall into after this study. Third, healthcare organizations have already been advocating for clear governance frameworks, and this data provides a strong evidence base for mandating caution.

The study fundamentally shifts the conversation from general AI capabilities to identifying specific weak points. The critical vulnerability is now clear: the very first, unsupervised step in the diagnostic process. This re-frames patient-triage chatbots from being 'promising innovations' to 'high-risk systems requiring strict human supervision.' It changes the calculus for adoption, liability, and regulatory compliance.

Ultimately, the message is not to abandon AI in healthcare but to define its proper role. The evidence suggests AI should be a 'co-pilot' that assists trained clinicians, not an autonomous pilot making initial contact with patients. This study serves as a vital guardrail, ensuring that the integration of powerful AI into clinical care proceeds safely and responsibly.

Differential Diagnosis: The process of distinguishing between two or more conditions that share similar signs or symptoms. It is the crucial, open-ended first step in medical diagnosis.
Large Language Model (LLM): An artificial intelligence model trained on vast amounts of text data to understand and generate human-like language. Examples include GPT-5 and Gemini.
EU AI Act: A comprehensive European Union regulation designed to govern the development and deployment of artificial intelligence, categorizing AI systems by risk level.

JAMA Study Finds Top AI Models Fail Over 80% of Initial Diagnoses, Highlighting Need for Human Oversight

A landmark study in JAMA Network Open revealed that leading AI models, including GPT-5, have a failure rate exceeding 80% in the crucial first step of medical diagnosis.

The AI's performance improves significantly in later stages when more data is available, indicating its main weakness is in handling the uncertainty of initial patient consultations.

These findings reinforce the need for a 'human-in-the-loop' approach, as upcoming regulations like the EU AI Act will classify unsupervised diagnostic AI as a high-risk application.

A major study has revealed a critical weakness in how today's most advanced AI models handle the initial stages of medical diagnosis.

Differential Diagnosis: The process of distinguishing between two or more conditions that share similar signs or symptoms. It is the crucial, open-ended first step in medical diagnosis.
Large Language Model (LLM): An artificial intelligence model trained on vast amounts of text data to understand and generate human-like language. Examples include GPT-5 and Gemini.
EU AI Act: A comprehensive European Union regulation designed to govern the development and deployment of artificial intelligence, categorizing AI systems by risk level.