×
Hallucinations spike in OpenAI’s o3 and o4-mini
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

OpenAI’s newest AI models, o3 and o4-mini, are exhibiting an unexpected and concerning trend: higher hallucination rates than their predecessors. This regression in factual reliability comes at a particularly problematic time as these models are designed for more complex reasoning tasks, potentially undermining trust among enterprise clients and raising questions about how AI advancement is being measured. The company has acknowledged the issue in its technical report but admits it doesn’t fully understand the underlying causes.

The hallucination problem: OpenAI’s technical report reveals that the o3 model hallucinated in response to 33% of questions during evaluation, approximately double the rate of previous reasoning models.

  • Both o3 and o4-mini underperformed in PersonQA, a benchmark specifically designed to evaluate hallucination tendencies in AI systems.
  • The company acknowledged in its report that “o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims.”

Why this matters: The increased hallucination rate runs counter to the expected evolutionary path of AI models, where newer iterations typically demonstrate improved factual reliability alongside other capabilities.

  • For complex reasoning tasks that these models are designed to handle, factual accuracy is particularly crucial as the stakes and complexity of use cases increase.
  • Enterprise customers considering significant financial investments in these advanced models may hesitate if the fundamental reliability issue remains unresolved.

Behind the numbers: OpenAI admits in its technical report that “more research is needed to understand the cause of this result,” suggesting the company is struggling to identify why their newer models are regressing in this specific dimension.

  • The company did note that smaller models generally have less world knowledge and tend to hallucinate more, but this doesn’t fully explain why o3 is hallucinating more than earlier versions.

Reading between the lines: The hallucination increase reveals the complex trade-offs inherent in AI development, where optimizing for certain capabilities might inadvertently compromise others.

  • The fact that o3 “makes more claims overall” suggests the model might be calibrated for greater confidence and assertiveness, which unintentionally leads to more hallucinations.

Where we go from here: With both models now released to the public, OpenAI may be hoping that widespread usage and feedback will help identify patterns and potentially resolve the hallucination issue through further training.

  • The company will likely need to address this regression quickly to maintain credibility in the increasingly competitive AI model market.
  • Future reasoning models may require new evaluation frameworks that better balance fluency, helpfulness, and factual reliability.
OpenAI’s leading models keep making things up

Recent News

California takes action to rescue critical thinking skills as AI reshapes society

Proposed state legislation targets AI's cognitive impacts while experts warn of diminishing critical thinking abilities among heavy users.

Wikipedia faces Trump appointee scrutiny over foreign propaganda allegations

Federal prosecutor alleges Wikipedia has become a conduit for foreign propaganda that could contaminate AI systems and historical records.

4 tips on using Gemini AI to summarize YouTube videos

The new Gemini feature extracts key information from YouTube videos primarily through audio content, with limitations on processing visual details.