OpenAI Asserts GPT-5 Has Fewer Hallucinations — What Do the Statistics Indicate?

OpenAI has formally introduced GPT-5, providing a quicker and more proficient AI model to improve ChatGPT. The AI corporation emphasizes cutting-edge performance in mathematics, programming, writing, and health guidance. OpenAI remarked that the hallucination rates of GPT-5 have diminished in comparison to earlier models.

GPT makes erroneous assertions 9.6 percent of the time, in contrast to 12.9 percent for GPT-4o. As per the GPT-5 system card, the new model’s hallucination rate is 26 percent lower than that of GPT-4o. Furthermore, GPT-5 generated 44 percent fewer responses that included “at least one major factual error.”

Although this constitutes advancement, it indicates that approximately one in 10 responses from GPT-5 might include hallucinations. This raises concern, particularly since OpenAI emphasized healthcare as a beneficial application for the new model.

How GPT-5 minimizes hallucinations

Hallucinations represent a consistent challenge for AI researchers. Large language models (LLMs) are trained to produce the next likely word based on the extensive data they are trained with. This can result in LLMs confidently generating sentences that are incorrect or nonsensical. One might presume that as models enhance through superior data, training, and computational power, the hallucination rate would diminish. Nonetheless, OpenAI’s introduction of its reasoning models o3 and o4-mini exhibited a troubling pattern that even its researchers could not entirely clarify: they hallucinated more than the earlier models, o1, GPT-4o, and GPT-4.5. Some researchers contend that hallucinations are a fundamental characteristic of LLMs, rather than an issue that can be resolved.

That said, GPT-5 hallucinates less than prior models according to its system card. OpenAI assessed GPT-5 and a variant of GPT-5 with added reasoning capability, named GPT-5-thinking, against its reasoning model o3 and the more traditional model GPT-4o. Evaluating hallucination frequencies necessitates granting models web access. Typically, models exhibit greater accuracy when they can derive answers from reliable online data rather than depending solely on their training data. Here are the hallucination rates when the models have access to browse the web:

– GPT-5: 9.6 percent
– GPT-5-thinking: 4.5 percent
– o3: 12.7 percent
– GPT-4o: 12.9 percent

In the system card, OpenAI also examined various versions of GPT-5 with more open-ended and intricate prompts. In this context, GPT-5 with reasoning capability hallucinated considerably less than earlier reasoning models o3 and o4-mini. Reasoning models are considered more accurate and less prone to hallucinations because they employ more computational power in addressing a query, which is why the hallucination rates of o3 and o4-mini were somewhat perplexing.

Overall, GPT-5 performs commendably when connected to the web. However, results from another assessment present a contrasting narrative. OpenAI evaluated GPT-5 on its in-house benchmark, Simple QA. This test comprises “fact-seeking questions with short answers that gauge model accuracy for attempted responses,” according to the description in the system card. For this assessment, GPT-5 lacked web access, and the results reflect this. In this evaluation, hallucination rates surged significantly.

– GPT-5 main: 47 percent
– GPT-5-thinking: 40 percent
– o3: 46 percent
– GPT-4o: 52 percent

GPT-5 with thinking outperformed o3 slightly, while the standard GPT-5 hallucinated one percent more than o3 and a few percentage points lower than GPT-4o. To be fair, hallucination rates in the Simple QA evaluation are elevated across all models. But this does not offer much reassurance. Users without web search capabilities will encounter much higher chances of hallucination and inaccuracies. Therefore, if you are utilizing ChatGPT for something critically important, ensure that it is searching the web. Alternatively, you could perform the web search yourself.

Users quickly identified hallucinations in GPT-5

In spite of the reported overall reduced rates of inaccuracies, one of the demonstrations uncovered a notable mistake. Beth Barnes, founder and CEO of the AI research nonprofit METR, detected an inaccuracy in the demo of GPT-5 explaining the mechanics of airplane functionality. GPT-5 referenced a prevalent misconception regarding the Bernoulli Effect, according to Barnes, which describes how air moves around airplane wings. Without getting into the specifics of aerodynamics, GPT-5’s interpretation is incorrect.