| Qwen-VL | Qwen/Qwen3-VL-235B-A22B-Instruct | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | |
| DeepSeek-VL2 | deepseek-ai/deepseek-vl2 | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | |
| Janus-Pro (1B, 7B) | deepseek-ai/Janus-Pro-7B | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | |
| MiniCPM-V / MiniCPM-o | openbmb/MiniCPM-V-2_6 | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | |
| Llama 3.2 Vision (11B) | meta-llama/Llama-3.2-11B-Vision-Instruct | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | |
| LLaVA (v1.5 & v1.6) | e.g. liuhaotian/llava-v1.5-13b | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | |
| LLaVA-NeXT (8B, 72B) | lmms-lab/llava-next-72b | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | |
| LLaVA-OneVision | lmms-lab/llava-onevision-qwen2-7b-ov | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | |
| Gemma 3 (Multimodal) | google/gemma-3-4b-it | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | |
| Kimi-VL (A3B) | moonshotai/Kimi-VL-A3B-Instruct | Kimi-VL is a multimodal model that can understand and generate text from images. | |
| Mistral-Small-3.1-24B | mistralai/Mistral-Small-3.1-24B-Instruct-2503 | Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. | |
| Phi-4-multimodal-instruct | microsoft/Phi-4-multimodal-instruct | Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. | |
| MiMo-VL (7B) | XiaomiMiMo/MiMo-VL-7B-RL | Xiaomi’s compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. | |
| GLM-4.5V (106B) / GLM-4.1V(9B) | zai-org/GLM-4.5V | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Use --chat-template glm-4v |
| GLM-OCR | zai-org/GLM-OCR | GLM-OCR: A fast and accurate general OCR model | |
| DotsVLM (General/OCR) | rednote-hilab/dots.vlm1.inst | RedNote’s vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training. | |
| DotsVLM-OCR | rednote-hilab/dots.ocr | Specialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities. | Don’t use --trust-remote-code |
| NVILA (8B, 15B, Lite-2B, Lite-8B, Lite-15B) | Efficient-Large-Model/NVILA-8B | chatml | NVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance. |
| NVIDIA Nemotron Nano 2.0 VL | nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 | NVIDIA Nemotron Nano v2 VL enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities. It builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, in order to achieve higher inference throughput in long document and video scenarios. | Use --trust-remote-code. You may need to adjust --max-mamba-cache-size [default is 512] to fit memory constraints. |
| Ernie4.5-VL | baidu/ERNIE-4.5-VL-28B-A3B-PT | Baidu’s vision-language models(28B,424B). Support image and video comprehension, and also support thinking. | |
| JetVLM | | JetVLM is an vision-language model designed for high-performance multimodal understanding and generation tasks built upon Jet-Nemotron. | Coming soon |