1. Model Introduction
Qwen3 series are the most powerful vision-language models in the Qwen series to date, featuring advanced capabilities in multi-modal understanding, reasoning, and agentic applications. This generation delivers comprehensive upgrades across the board:- Stronger general intelligence: Significant improvements in instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage.
- Broader multilingual knowledge: Substantial gains in long-tail knowledge coverage across multiple languages.
- More helpful & aligned responses: Markedly better alignment with user preferences in subjective and open-ended tasks, enabling higher-quality, more useful text generation.
- Extended context length: Enhanced capabilities in understanding and reasoning over 256K-token long contexts.
- Stronger agent interaction capabilities: Improved tool use and search-based agent performance.
- Flexible deployment options: Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning-enhanced Thinking editions.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The Qwen3 series offers models in various sizes and architectures, optimized for different hardware platforms including NVIDIA and AMD GPUs. The recommended launch configurations vary by hardware and model size. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.3.2 Configuration Tips
- Memory Management : Set lower
--context-lengthto conserve memory. A value of128000is sufficient for most scenarios, down from the default 262K. - Expert Parallelism : SGLang supports Expert Parallelism (EP) via
--ep, allowing experts in MoE models to be deployed on separate GPUs for better throughput. One thing to note is that, for quantized models, you need to set--epto a value that satisfies the requirement:(moe_intermediate_size / moe_tp_size) % weight_block_size_n == 0, where moe_tp_size is equal to tp_size divided by ep_size.Note that EP may perform worse in low concurrency scenarios due to additional communication overhead. Check out Expert Parallelism Deployment for more details. - Kernel Tuning : For MoE Triton kernel tuning on your specific hardware, refer to fused_moe_triton.
- Speculative Decoding: Using Speculative Decoding for latency-sensitive scenarios.
--speculative-algorithm EAGLE3: Speculative decoding algorithm--speculative-num-steps 3: Number of speculative verification rounds--speculative-eagle-topk 1: Top-k sampling for draft tokens--speculative-num-draft-tokens 4: Number of draft tokens per step--speculative-draft-model-path: The path of the draft model weights. This can be a local folder or a Hugging Face repo ID such aslmsys/SGLang-EAGLE3-Qwen3-235B-A22B-Instruct-2507-SpecForge-Meituan.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
Qwen3-235B-A22B supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:4.2.3 Tool Calling
Qwen3 supports tool calling capabilities. Enable the tool call parser:- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (8x)
- Model: Qwen3-235B-A22B-Instruct-2507
- Tensor Parallelism: 8
- sglang version: 0.5.6
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
5.1.1.1 Low Concurrency
- Benchmark Command:
- Test Results:
5.1.1.2 Medium Concurrency
- Benchmark Command:
- Test Results:
5.1.1.3 High Concurrency
- Benchmark Command:
- Test Results:
5.1.2 Reasoning Scenario Benchmark
- Model Deployment Command:
5.1.2.1 Low Concurrency
- Benchmark Command:
- Test Results:
5.1.2.2 Medium Concurrency
- Benchmark Command:
- Test Results:
5.1.2.3 High Concurrency
- Benchmark Command:
- Test Results:
5.1.3 Summarization Scenario Benchmark
5.1.3.1 Low Concurrency
- Benchmark Command:
- Test Results:
5.1.3.2 Medium Concurrency
- Benchmark Command:
- Test Results:
5.1.3.3 High Concurrency
- Benchmark Command:
- Test Results:
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
-
Results:
- Qwen/Qwen3-235B-A22B-Instruct-2507
- Qwen/Qwen3-235B-A22B-Instruct-2507
