×
The leading AI models just failed ‘Humanity’s Last Exam’ — but could you do any better?
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI models have scored poorly on a new ultra-difficult intelligence benchmark called “Humanity’s Last Exam,” with even the most advanced systems achieving less than 10% accuracy on its challenging questions.

The benchmark’s development: Scale AI and the Center for AI Safety (CAIS) collaborated to create Humanity’s Last Exam, designed to test AI systems at the absolute limits of human expertise and knowledge.

  • The test comprises 3,000 questions contributed by experts from over 500 institutions across 50 countries
  • Originally named “Humanity’s Last Stand,” the title was later softened to “Last Exam”
  • Questions span highly specialized topics requiring deep expertise in fields like biology, linguistics, and mythology

Performance metrics: Current AI models have demonstrated notably low performance on this new benchmark, significantly underperforming compared to other standard AI tests.

  • DeepSeek-R1 achieved the highest score at 9.4%
  • Google’s Gemini scored 6.2%
  • Claude 3.5 Sonnet reached 4.3%
  • OpenAI’s GPT-4o managed only 3.3%
  • The results show a stark contrast to AI’s typically strong performance on other benchmarks like GPQA, MATH, and MMLU

Sample questions: The exam features extremely complex questions that challenge even human experts in their respective fields.

  • One question involves detailed anatomical knowledge about hummingbird bone structure and tendon pairs
  • Another requires advanced understanding of Biblical Hebrew syllable analysis using the Tiberian pronunciation tradition
  • A third tests knowledge of Greek mythology genealogy

Current implications: The benchmark reveals significant limitations in current AI systems’ reasoning capabilities and specialized knowledge.

  • AI models struggle with questions requiring deep domain expertise
  • The gap between human expert knowledge and AI capabilities remains substantial in specialized fields
  • The test serves as a meaningful measure of AI progress in advanced reasoning tasks

Looking ahead: While current AI models cannot effectively tackle Humanity’s Last Exam, the rapidly evolving nature of AI technology suggests future improvements in performance are likely.

  • OpenAI’s recent release of Operator, its first AI agent, demonstrates ongoing advances in AI capabilities
  • The benchmark provides a clear metric for measuring progress in AI reasoning and specialized knowledge
  • The significant performance gap indicates substantial room for improvement in AI systems

Reading between the lines: This benchmark may provide a more realistic assessment of AI capabilities than previous tests, helping to temper both excessive optimism and unfounded fears about current AI systems’ abilities while establishing a clear marker for measuring future progress.

Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI

Recent News

AI boosts SkinCeuticals sales with Appier’s marketing tech

Data-driven AI marketing tools helped L'Oréal achieve a 152% increase in ad spending returns and 48% revenue growth for SkinCeuticals' online store.

Two-way street: AI etiquette emerges as machines learn from human manners

Users increasingly rely on social niceties with AI assistants, reflecting our tendency to humanize technology despite knowing it lacks consciousness.

AI-driven FOMO stalls purchase decisions for smartphone consumers

Current AI smartphone features provide limited practical value for many users, especially retirees and those outside tech-focused professions, leaving consumers uncertain whether to upgrade functioning older devices.