Artificial intelligence chatbots are generating dangerously flawed and incomplete answers to questions about medications and health, according to multiple new studies that raise alarms about the public’s growing reliance on AI for medical advice. Research into the performance of leading AI models found that their guidance was often difficult to understand, inaccurate, and in a significant number of cases, potentially harmful. In one study, nearly a quarter of the answers provided by an AI chatbot about common prescription drugs were considered capable of causing severe harm or death if a patient followed the advice.
The findings arrive as an increasing number of people turn to chatbots to bypass long hospital wait times and high medical costs, with one recent survey indicating that one in six American adults seeks health advice from AI at least once a month. Researchers warn that while the large language models underpinning these tools excel at standardized tests, their clinical knowledge does not translate effectively to real-world human interactions. A large-scale experiment led by the University of Oxford found that AI tools provided no significant benefit in health decision-making and, in some cases, weakened it by creating confusion and causing users to downplay serious medical conditions.
Evaluating AI-Generated Drug Information
A detailed analysis conducted by researchers in Belgium and Germany and published in BMJ Quality and Safety examined the advice an AI-powered search engine provided about the 50 most frequently prescribed drugs in the United States. The investigation systematically tested the chatbot’s ability to answer common patient questions, evaluating the completeness, accuracy, and readability of its responses. The results revealed significant shortcomings that could pose a direct threat to patient safety. The chatbot, Bing Copilot, only provided answers with the highest level of completeness for half of the questions posed by the researchers.
When pharmacological experts reviewed the AI’s output, they found that its statements did not match established drug reference data in 26% of the answers. More alarmingly, over 3% of the responses were fully inconsistent with reliable medical sources. Based on these inaccuracies, the expert panel judged that 42% of the chatbot’s answers were likely to cause moderate or mild harm, while a further 22% could lead to severe harm or death. Only one-third of the responses were deemed unlikely to cause any harm. The study also noted that the language used by the AI was often overly complex, requiring a university-level education to understand fully, which could further mislead patients without a scientific background.
Human Interaction as a Critical Failure Point
While some AI flaws stem from the models themselves, a separate landmark study from the Oxford Internet Institute revealed that the interaction between humans and chatbots is a primary source of failure. The researchers conducted a large experiment involving 1,300 participants in the United Kingdom who were tasked with making decisions based on several medical scenarios. They were randomly assigned to use one of several top AI models—including GPT-4o, Cohere Command R+, and Meta’s Llama 3—or to rely on their own judgment and online search tools.
The study, titled “Clinical knowledge in LLMs does not translate to human interactions,” found no major advantage to using AI. Participants using chatbots did not make better or more informed health decisions than those who did not. According to Dr. Adam Mahdi, a co-author of the study, a key problem is that users often do not know what information to provide the AI. “Participants often left out important details when talking to the chatbots,” he noted, which “led to advice that was incomplete or unclear.” This highlights a fundamental disconnect: the AI can only respond to the information it is given, and non-expert users are ill-equipped to provide the comprehensive clinical context needed for a safe recommendation.
How AI Advice Creates Confusion
The Oxford experiment also uncovered that AI-generated advice frequently weakened participants’ decision-making capabilities. Many struggled to identify serious medical conditions from the information presented, and some even downplayed the potential risks after reading a chatbot’s response. This occurred because the AI’s answers often blended accurate statements with harmful misinformation, making it difficult for a layperson to distinguish between sound advice and dangerous suggestions. This mixture of correct and incorrect guidance led to confusion and poor health choices, as users were unable to determine the appropriate next steps.
The Paradox of High Exam Scores
One of the perplexing aspects of AI performance is the contrast between its success in controlled tests and its failure in practical application. Researchers have noted that large language models can now achieve nearly perfect scores on medical licensing exams, demonstrating a vast repository of textbook knowledge. However, this academic prowess does not translate into reliable clinical guidance. Unlike a formal exam, real-world patient questions are often ambiguous and lack critical context. The studies found that chatbots are unable to discern a patient’s underlying intent or discriminate between reliable and unreliable sources of information on the internet. Without this nuanced understanding, the AI cannot apply its knowledge safely.
A Growing Public Health Concern
The widespread use of chatbots for health advice makes these findings particularly urgent. With millions of people turning to AI for quick answers, the risk of misdiagnosis, improper use of medication, and delayed treatment for serious conditions is substantial. The trend is driven by systemic issues in healthcare systems, including the difficulty of scheduling timely appointments and the high cost of professional medical care. This creates a significant demand for accessible alternatives, a role that AI chatbots have quickly filled, despite their technological immaturity in this high-stakes domain.
An Expert Consensus for Caution
In light of the mounting evidence, a strong consensus has emerged among medical experts and researchers: the public should not use AI chatbots as a substitute for professional medical advice. The American Medical Association advises physicians against relying on general-purpose chatbots for making clinical decisions, and even the technology’s creators, such as OpenAI, include disclaimers warning users not to use their tools for medical diagnosis. The studies conclude that despite the potential of AI, it is crucial for patients to continue consulting with human healthcare professionals who can provide accurate, personalized, and safe guidance. Chatbots, the researchers state, are not yet ready for these critical tasks and may not always generate error-free information.