×
Hugging Face just made its small AI models even smaller (and multimodal)
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Hugging Face has released two new additions to the SmolVLM model family. The new compact Vision Language Models – a 256M parameter version and a 500M parameter version – are designed to deliver efficient multimodal AI capabilities while maintaining a small computational footprint.

Core innovations; The new SmolVLM models represent significant architectural improvements over their 2B parameter predecessor, introducing key optimizations for real-world applications.

  • The models now utilize a streamlined 93M parameter SigLIP vision encoder, drastically reduced from the previous 400M version
  • Higher resolution image processing capabilities enable enhanced visual comprehension
  • New tokenization optimizations boost performance in practical applications
  • The training data mixture has been refined to better handle document understanding and image captioning tasks

Technical specifications; Both new models come in multiple variants to support different use cases and deployment scenarios.

  • Four distinct checkpoints are available: base and instruction-tuned versions for both the 256M and 500M parameter models
  • The models maintain compatibility across transformers, MLX and ONNX frameworks
  • ColSmolVLM variants have been released specifically for multimodal retrieval applications
  • Existing SmolVLM code bases can be used for inference and fine-tuning with the new models

Deployment flexibility; The models offer multiple implementation paths to accommodate various technical requirements and use cases.

  • Ready-to-use code examples are provided for both transformers and MLX implementations
  • ONNX checkpoints enable broad platform compatibility
  • WebGPU demonstrations showcase the models’ capabilities in browser-based applications
  • Comprehensive documentation includes multimodal RAG (Retrieval-Augmented Generation) examples

Looking ahead; The release of these ultra-compact models marks a significant step toward making multimodal AI more accessible and deployable across a broader range of devices and applications, though their reduced size may present some performance trade-offs compared to larger models.

SmolVLM Grows Smaller – Introducing the 256M & 500M Models!

Recent News

Nvidia launches AI tool to generate images from 3D scenes

Nvidia's new tool enables precise control over AI-generated images through 3D scene layouts, addressing the spatial limitations of traditional text-prompt systems.

SaaStr 2025 unites top cloud, B2B and AI leaders in SF Bay

Featuring over 15,000 attendees and 500 speakers, the three-day event will highlight proven strategies from executives who have built successful cloud businesses rather than theoretical AI discussions.

Visa develops AI-powered cards for seamless automated purchases

Visa's platform allows AI assistants to execute transactions using tokenized credentials within user-defined parameters, eliminating payment friction in automated shopping.