×
LLM benchmark compares Phi-4, Qwen2 VL 72B and Aya Expanse 32B, finding interesting results
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

A new round of language model benchmarking reveals updated performance metrics for several AI models including Phi-4 variants, Qwen2 VL 72B Instruct, and Aya Expanse 32B using the MMLU-Pro Computer Science benchmark.

Benchmark methodology and scope; The MMLU-Pro Computer Science benchmark evaluates AI models through 410 multiple-choice questions with 10 options each, focusing on complex reasoning rather than just factual recall.

  • Testing was conducted over 103 hours with multiple runs per model to ensure consistency and measure performance variability
  • Results are displayed with error bars showing standard deviation across test runs
  • The benchmark was limited to computer science topics to maintain practical testing timeframes while ensuring relevance for real-world applications

Key findings on new models; Recent testing revealed varied performance across several new AI models, with some showing unexpected results.

  • Microsoft’s Phi-4 and its variants demonstrated comparable performance, with the GGUF version showing slightly higher accuracy
  • Temperature settings significantly impacted Phi-4’s performance, with optimal results at moderate settings
  • Qwen2 VL 72B Instruct showed lower than expected scores, suggesting room for improvement in future versions
  • Aya Expanse 32B, while scoring above 50%, ranked lowest among included models but offers valuable multilingual capabilities

Technical implementation details; The benchmark presentation included innovative visualization techniques to better represent model characteristics.

  • Models were visualized using 3D bars showing MMLU scores, parameter counts, and memory efficiency
  • For quantized models, bar sections were color-coded to show memory savings compared to full-precision models
  • Multiple evaluation runs were conducted for key models like Claude, Gemini-1.5-pro-002, and Athene-V2-Chat

Performance nuances; Testing revealed interesting characteristics specific to certain models.

  • Phi-4 showed improved German language capabilities despite its smaller size
  • Basic prompt engineering could bypass censorship restrictions in tested models
  • Model consistency varied significantly across different temperature settings

Looking ahead; The benchmark results suggest several trends in AI model development and evaluation methodologies that warrant attention in future testing iterations.

  • The need for multiple test runs to establish reliable performance metrics is becoming increasingly important
  • Balancing comprehensive testing with practical time constraints remains a key challenge
  • Future releases, particularly in the Qwen series, may significantly alter the current performance landscape
🐺🐦‍⬛ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark

Recent News

Two-way street: AI etiquette emerges as machines learn from human manners

Users increasingly rely on social niceties with AI assistants, reflecting our tendency to humanize technology despite knowing it lacks consciousness.

AI-driven FOMO stalls purchase decisions for smartphone consumers

Current AI smartphone features provide limited practical value for many users, especially retirees and those outside tech-focused professions, leaving consumers uncertain whether to upgrade functioning older devices.

Copilot, indeed: AI adoption soars in aerospace industry

Advanced AI systems now enhance aircraft design, automate navigation, and predict maintenance issues, transforming operations across the heavily regulated aerospace sector.