1. Model Introduction
GLM-4.6V series model includes two versions: GLM-4.6V (106B), a foundation model designed for cloud and high-performance cluster scenarios, and GLM-4.6V-Flash (9B), a lightweight model optimized for local deployment and low-latency applications. GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, GLM team integrated native Function Calling capabilities for the first time. This effectively bridges the gap between “visual perception” and “executable action” providing a unified technical foundation for multimodal agents in real-world business scenarios. Beyond achieves SoTA performance across major multimodal benchmarks at comparable model scales. GLM-4.6V introduces several key features:- Native Multimodal Function Calling Enables native vision-driven tool use. Images, screenshots, and document pages can be passed directly as tool inputs without text conversion, while visual outputs (charts, search images, rendered pages) are interpreted and integrated into the reasoning chain. This closes the loop from perception to understanding to execution. Please refer to this example.
- Interleaved Image-Text Content Generation Supports high-quality mixed media creation from complex multimodal inputs. GLM-4.6V takes a multimodal context—spanning documents, user inputs, and tool-retrieved images—and synthesizes coherent, interleaved image-text content tailored to the task. During generation it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content.
- Multimodal Document Understanding GLM-4.6V can process up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. It understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents without requiring prior conversion to plain text.
- Frontend Replication & Visual Editing Reconstructs pixel-accurate HTML/CSS from UI screenshots and supports natural-language-driven edits. It detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.2.1 Docker Installation (Recommended)
- Ready to use out of the box, no manual environment configuration needed
- Avoids dependency conflict issues
- Easy to migrate between different environments
2.2 Build from Source
If you need to use the latest development version or require custom modifications, you can build from source:- Need to customize and modify SGLang source code
- Want to use the latest development features
- Participate in SGLang project development
3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the interactive configuration generator below to customize your deployment settings. Select your hardware platform, model size, quantization method, and other options to generate the appropriate launch command.3.2 Configuration Tips
- TTFT Optimization : Set
SGLANG_USE_CUDA_IPC_TRANSPORT=1to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting--mem-fraction-staticand/or--max-running-requests. (additional memory is proportional to image size * number of images in current running requests.) - TP=8 Configuration: When using Tensor Parallelism (TP) of 8, the vision attention’s 12 heads cannot be evenly divided. You can resolve this by adding
--mm-enable-dp-encoder(which the generator above handles automatically). - Fast Model Loading: For large models (like the 106B version), you can speed up model loading by using
--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'. - For more detailed configuration tips, please refer to GLM-4.5V/GLM-4.6V Usage.
