1. Model Introduction
GLM-4.6 is a powerful language model developed by Zhipu AI, featuring advanced capabilities in reasoning, function calling, and multi-modal understanding. As the latest iteration in the GLM series, GLM-4.6 achieves comprehensive enhancements across multiple domains, including real-world coding, long-context processing, reasoning, searching, writing, and agentic applications. Details are as follows:- Longer context window: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks.
- Superior coding performance: The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code, Cline, Roo Code and Kilo Code, including improvements in generating visually polished front-end pages.
- Advanced reasoning: GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability.
- More capable agents: GLM-4.6 exhibits stronger performance in tool use and search-based agents, and integrates more effectively within agent frameworks.
- Refined writing: Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities.3.2 Configuration Tips
For more detailed configuration tips, please refer to GLM-4.5/GLM-4.6 Usage.4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
GLM-4.6 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections:4.2.2 Tool Calling
GLM-4.6 supports tool calling capabilities. Enable the tool call parser:- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
This section uses industry-standard configurations for comparable benchmark results.5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (8x), AMD MI300X (8x), AMD MI325X (8x), AMD MI355X (8x)
- Model: GLM-4.6
- Tensor Parallelism: 8
- SGLang Version: 0.5.6.post1
5.1.1 Standard Test Scenarios
Three core scenarios reflect real-world usage patterns:| Scenario | Input Length | Output Length | Use Case |
|---|---|---|---|
| Chat | 1K | 1K | Most common conversational AI workload |
| Reasoning | 1K | 8K | Long-form generation, complex reasoning tasks |
| Summarization | 8K | 1K | Document summarization, RAG retrieval |
5.1.2 Concurrency Levels
Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier):- Low Concurrency:
--max-concurrency 1(Latency-optimized) - Medium Concurrency:
--max-concurrency 16(Balanced) - High Concurrency:
--max-concurrency 100(Throughput-optimized)
5.1.3 Number of Prompts
For each concurrency level, configurenum_prompts to simulate realistic user loads:
- Quick Test:
num_prompts = concurrency × 1(minimal test) - Recommended:
num_prompts = concurrency × 5(standard benchmark) - Stable Measurements:
num_prompts = concurrency × 10(production-grade)
5.1.4 Benchmark Commands
Scenario 1: Chat (1K/1K) - Most Important- Model Deployment
- Low Concurrency (Latency-Optimized)
- Medium Concurrency (Balanced)
- High Concurrency (Throughput-Optimized)
- Low Concurrency
- Medium Concurrency
- High Concurrency
- Low
- Medium Concurrency
- High Concurrency
5.1.5 Understanding the Results
Key Metrics:- Request Throughput (req/s): Number of requests processed per second
- Output Token Throughput (tok/s): Total tokens generated per second
- Mean TTFT (ms): Time to First Token - measures responsiveness
- Mean TPOT (ms): Time Per Output Token - measures generation speed
- Mean ITL (ms): Inter-Token Latency - measures streaming consistency
- 1K/1K (Chat): Represents the most common conversational AI workload. This is the highest priority scenario for most deployments.
- 1K/8K (Reasoning): Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations.
- 8K/1K (Summarization): Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks.
- Variable Concurrency: Captures the Pareto frontier - the optimal tradeoff between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput.
- Compare your results against baseline numbers for your hardware
- Higher throughput at same latency = better performance
- Lower TTFT = more responsive user experience
- Lower TPOT = faster generation speed
5.2 Accuracy Benchmark
Document model accuracy on standard benchmarks:5.2.1 GSM8K Benchmark
- Benchmark Command
- Test Result
