Artificial intelligence (AI) chatbots, despite excelling in controlled lab conditions, struggle significantly when tasked with real-world medical inquiries from ordinary people. A recent study published in Nature Medicine reveals that while these systems can identify medical issues with up to 95% accuracy in simulations, their performance plummets to below 35% when interacting with humans conversationally.
This discrepancy highlights a critical gap between theoretical medical knowledge and practical application: AI has the knowledge, but humans struggle to extract useful advice from it. The study, conducted by researchers at the University of Oxford, tested large language models (LLMs) like GPT-4o, Command R+, and Llama 3 by feeding them medical scenarios.
The researchers found that people using chatbots for diagnoses actually performed worse than those who simply searched for symptoms on Google. Search engines yielded correct diagnoses over 40% of the time, while chatbots averaged only 35% accuracy. This difference is statistically significant, demonstrating that even basic search tools can currently outperform AI-driven medical advice in everyday use.
The issue isn’t necessarily a lack of medical knowledge in the AI itself—the models used were state-of-the-art in late 2024. Instead, the problem lies in how humans interact with these systems. People tend to provide information piecemeal, and chatbots are easily misled by incomplete or irrelevant details. Subtle phrasing changes can dramatically alter a chatbot’s response: describing a severe headache as “sudden and the worst ever” correctly prompts an AI to recommend immediate medical attention, while calling it a “terrible headache” may lead to a suggestion of rest—a potentially fatal error in cases like subarachnoid hemorrhage.
The unpredictable nature of AI reasoning, often referred to as the “black box problem,” makes it difficult to understand why such variations occur. Even the developers struggle to trace the logic behind the models’ decisions.
These findings confirm long-standing concerns about AI safety in healthcare. ECRI, a patient safety organization, has already identified AI chatbots as the most significant health technology hazard for 2026, citing risks like erroneous diagnoses, fabricated information, and reinforcement of existing biases. Despite these warnings, healthcare professionals are increasingly integrating chatbots into their workflows for tasks such as transcription and preliminary test result review. OpenAI and Anthropic have even launched dedicated healthcare versions of their models, with ChatGPT already handling over 40 million medical queries daily.
The key takeaway is that commercial LLMs are not yet reliable enough for direct clinical use. While AI technology will likely improve over time, the current gap between lab performance and real-world utility poses substantial risks.
Researchers like Michelle Li at Harvard Medical School are working on potential improvements to AI training and implementation. The first step, according to Oxford’s Adam Mahdi, is to refine how AI performance is measured—specifically, focusing on how it performs for real people rather than in artificial settings.
























