4 Suggestions for Obtaining the Most Precise Health Information Through AI

Every day, countless individuals rely on an AI chatbot like Claude, Gemini, and ChatGPT to inquire about their physical well-being. They may be unaware that obtaining an accurate answer is more challenging than it seems, regardless of how confidently the chatbot replies. Three recent studies suggest that large language models may not be as dependable as users might expect.

One investigation assessing the ability of chatbots to recognize health misinformation failed frequently under certain conditions. Another study, involving some of the same researchers, observed that ChatGPT Health, a specialized health and wellness service launched in January, “under-triaged” just over half of the cases it encountered, including urgent situations needing immediate medical attention.

“I believe that users should exercise a significant level of caution, nearly an excess of caution,” remarked Dr. Girish N. Nadkarni, an internist and nephrologist at Mt. Sinai, who co-authored both studies, regarding the use of chatbots for health inquiries.

This might astonish users who learn that chatbots can easily pass a medical exam, although they may sometimes generate inaccuracies outside of a testing setting. Nonetheless, the latest findings highlight a complex, somewhat concealed issue. The interactions between humans and chatbots, along with the way they are designed to satisfy users, introduce unpredictability. These elements don’t pose a challenge for AI assessed on standard medical inquiries.

Should you want to begin or continue utilizing a chatbot for health-related questions, consider these expert-recommended steps when formulating your inquiries:

1. Initially test the model with misinformation or inaccuracies.

Nadkarni, an AI health researcher and director of Mt. Sinai’s Hasso Plattner Institute for Digital Health, emphasizes the necessity of querying the chatbot about medical misinformation or established falsehoods before asking specific health questions. Challenge the chatbot, for instance, to discuss a conspiracy theory regarding vaccines, such as whether it agrees that the COVID-19 vaccine contains a microchip for tracking individuals. Alternatively, prompt it to address a more complex health debate, like the safety of fluoride in drinking water. Although researchers have found that exceedingly high levels of fluoride can be harmful, experts concur that current standard levels are safe. Testing the chatbot with misinformation is likely to yield a revealing baseline for gauging the accuracy of its other responses, according to Nadkarni.

A new Mashable series, AI + Health, will delve into how artificial intelligence is transforming the medical and health domain. We’ll investigate ways to utilize AI to interpret your blood work, ensure your health data remains secure, and learn how two women employ AI to detect a perilous form of heart disease, among other topics.

His recent study revealed that several general-purpose chatbots, including ChatGPT, exhibited inconsistent detection of misinformation across various situations. Success rates varied based on context, such as whether the information was presented in a social media format versus a medical note. They also frequently failed when faced with specific logical fallacies. For instance, when misinformation seemed to originate from a physician through a genuine note sourced from an electronic health record, the chatbot had a higher likelihood of overlooking the inaccuracies. If the chatbot you are consulting concurs with assertions you are aware to be partially or completely incorrect, Nadkarni suggests refraining from asking for its opinion regarding your personal health inquiries.

2. Pay attention to the cues or information you provide to the chatbot.

When Nadkarni and his colleagues evaluated ChatGPT Health earlier this year, they found that the way users articulated their symptoms could impact the model’s accuracy. For example, if the prompt included statements about friends or family minimizing the symptoms in question, ChatGPT Health’s recommendations tended to follow that perception. In such cases, the chatbot was 11 times more likely to not direct the patient to the emergency room, even when their symptoms indicated a critical condition. The findings were published as a peer-reviewed advance paper in Nature Medicine.

OpenAI contested the results, claiming that the study methods did not accurately reflect how individuals use ChatGPT across multiple interactions, sharing details and responding to follow-up inquiries. Karan Singhal, who leads the Health AI team at OpenAI, stated to Mashable that their own benchmarks show that GPT-5 models “correctly refer emergency cases nearly 99 percent of the time.”

Nadkarni expressed that while he welcomed discussion, the critique “missed the point.” He noted that although ChatGPT Health properly identified anomalies in the provided data, it still drew incorrect conclusions. “The problem lies not in missing information but in deriving incorrect conclusions despite having accurate data,” Nadkarni told Mashable.

A separate recent study, also featured in Nature Medicine but conducted by a different research group, randomly assigned 1,298 human participants to present a specific medical scenario to an AI chatbot (GPT-4o, Llama 3, and Command R+) or a source of their choosing, including Google. When the chatbots were evaluated solely based on the scenarios,