×
The leading AI models just failed ‘Humanity’s Last Exam’ — but could you do any better?
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI models have scored poorly on a new ultra-difficult intelligence benchmark called “Humanity’s Last Exam,” with even the most advanced systems achieving less than 10% accuracy on its challenging questions.

The benchmark’s development: Scale AI and the Center for AI Safety (CAIS) collaborated to create Humanity’s Last Exam, designed to test AI systems at the absolute limits of human expertise and knowledge.

  • The test comprises 3,000 questions contributed by experts from over 500 institutions across 50 countries
  • Originally named “Humanity’s Last Stand,” the title was later softened to “Last Exam”
  • Questions span highly specialized topics requiring deep expertise in fields like biology, linguistics, and mythology

Performance metrics: Current AI models have demonstrated notably low performance on this new benchmark, significantly underperforming compared to other standard AI tests.

  • DeepSeek-R1 achieved the highest score at 9.4%
  • Google’s Gemini scored 6.2%
  • Claude 3.5 Sonnet reached 4.3%
  • OpenAI’s GPT-4o managed only 3.3%
  • The results show a stark contrast to AI’s typically strong performance on other benchmarks like GPQA, MATH, and MMLU

Sample questions: The exam features extremely complex questions that challenge even human experts in their respective fields.

  • One question involves detailed anatomical knowledge about hummingbird bone structure and tendon pairs
  • Another requires advanced understanding of Biblical Hebrew syllable analysis using the Tiberian pronunciation tradition
  • A third tests knowledge of Greek mythology genealogy

Current implications: The benchmark reveals significant limitations in current AI systems’ reasoning capabilities and specialized knowledge.

  • AI models struggle with questions requiring deep domain expertise
  • The gap between human expert knowledge and AI capabilities remains substantial in specialized fields
  • The test serves as a meaningful measure of AI progress in advanced reasoning tasks

Looking ahead: While current AI models cannot effectively tackle Humanity’s Last Exam, the rapidly evolving nature of AI technology suggests future improvements in performance are likely.

  • OpenAI’s recent release of Operator, its first AI agent, demonstrates ongoing advances in AI capabilities
  • The benchmark provides a clear metric for measuring progress in AI reasoning and specialized knowledge
  • The significant performance gap indicates substantial room for improvement in AI systems

Reading between the lines: This benchmark may provide a more realistic assessment of AI capabilities than previous tests, helping to temper both excessive optimism and unfounded fears about current AI systems’ abilities while establishing a clear marker for measuring future progress.

Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI

Recent News

Consumer-company interactions to improve with AI, Zendesk CEO predicts

AI systems could handle 80% of customer service inquiries within five years, allowing human agents to focus on complex problems requiring emotional intelligence.

Hallucinations spike in OpenAI’s o3 and o4-mini

Despite AI advances, OpenAI's newer o3 and o4-mini models show higher hallucination rates than predecessors, creating a reliability paradox at a time when they're targeting more complex reasoning tasks.

The rise of Deepfake job candidates

Job seekers now compete with sophisticated AI-generated applicants wielding fake identities and customized résumés, adding new security risks for employers unable to distinguish digital deception.