1. Model Introduction
GLM-4.5V is a state-of-the-art multimodal vision-language model from ZhipuAI, built on the next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It achieves SOTA performance among models of the same scale across 42 public vision-language benchmarks. Through efficient hybrid training, GLM-4.5V focuses on real-world usability and enables full-spectrum vision reasoning across diverse visual content types. Hardware Support: NVIDIA B200/H100/H200, AMD MI300X/MI325X/MI355X GLM-4.5V introduces several key features:- Image Reasoning & Grounding Scene understanding, complex multi-image analysis, and spatial recognition with precise visual element localization. Supports bounding box predictions with normalized coordinates (0-1000) for accurate object detection.
- Video Understanding Long video segmentation and event recognition, supporting comprehensive temporal analysis across extended video sequences.
- GUI Agent Tasks Screen reading, icon recognition, and desktop operation assistance for agent-based applications. Enables natural interaction with graphical user interfaces.
- Complex Chart & Long Document Parsing Research report analysis and information extraction from documents with text, charts, tables, and figures. Processes up to 64K tokens of multimodal context.
- Thinking Mode Switch Allows users to balance between quick responses and deep reasoning. Users can enable/disable Chain-of-Thought reasoning based on task requirements for improved accuracy and interpretability.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The GLM-4.5V offers models in various sizes and architectures, optimized for different hardware platforms. The recommended launch configurations vary by hardware and model size. Interactive Command Generator: Use the interactive configuration generator below to customize your deployment settings. Select your hardware platform, model size, quantization method, and other options to generate the appropriate launch command.3.2 Configuration Tips
- TTFT Optimization : Set
SGLANG_USE_CUDA_IPC_TRANSPORT=1to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting--mem-fraction-staticand/or--max-running-requests. (additional memory is proportional to image size * number of images in current running requests.) - TP=8 Configuration: When using Tensor Parallelism (TP) of 8, the vision attention’s 12 heads cannot be evenly divided. You can resolve this by adding
--mm-enable-dp-encoder. - Fast Model Loading: For large models (like the 106B version), you can speed up model loading by using
--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'. - For more detailed configuration tips, please refer to GLM-4.5V/GLM-4.6V Usage.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Multi-Modal Inputs
GLM-4.5V supports both image and video inputs. Here’s a basic example with image input:- For video processing, ensure you have sufficient context length configured (up to 64K tokens)
- Video processing may require more memory; adjust
--mem-fraction-staticaccordingly - You can also provide local file paths using
file://protocol
4.2.2 Thinking Mode
GLM-4.5V supports thinking mode for enhanced reasoning. Enable thinking mode during deployment:4.2.3 Tool Calling
GLM-4.5V supports tool calling capabilities. Enable the tool call parser:- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
5.1 Accuracy Benchmark
Document model accuracy on standard benchmarks:5.1.1 MMMU Benchmark
- Benchmark Command
- Test Result
