1. Model Introduction
DeepSeek V3.1 is an advanced Mixture-of-Experts (MoE) large language model developed by DeepSeek, representing a major capability and usability upgrade over DeepSeek V3. As a refined iteration in the DeepSeek V3 family, DeepSeek V3.1 introduces a hybrid reasoning paradigm that supports both fast non-thinking responses and explicit multi-step reasoning, alongside significantly improved tool calling and agentic behavior. The model demonstrates strong performance across reasoning, mathematics, coding, long-context understanding, and real-world agent workflows, benefiting from continued training, alignment optimization, and inference-time refinements. DeepSeek V3.1 is designed to serve as a robust general-purpose foundation model, well suited for conversational AI, structured tool invocation, search-augmented generation, and complex multi-step tasks, while maintaining high efficiency through its sparse MoE architecture. DeepSeek-V3.1-Terminus is an experimental version designed for general conversations and long-context processing. It features hybrid thinking capabilities, allowing you to toggle between “Think” mode for deliberate reasoning and “Non-Think” mode for faster responses. Recommended for general conversations, long-context processing, and experimental use cases.2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.3.2 Configuration Tips
For more detailed configuration tips, please refer to DeepSeek V3/V3.1/R1 Usage.4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
DeepSeek-V3.1 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:4.2.2 Tool Calling
DeepSeek-V3.1 and DeepSeek-V3.1-Terminus support tool calling capabilities. Enable the tool call parser: Note: DeepSeek-V3.1-Speciale does NOT support tool calling. It is designed exclusively for deep reasoning tasks. Deployment Command:--tool-call-parser deepseekv31 as well.
Python Example (with Thinking Process):
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: AMD MI300X GPU (8x)
- Model: DeepSeek-V3.1-Terminus
- Tensor Parallelism: 8
- sglang version: 0.5.7
5.1.1 Standard Test Scenarios
Three core scenarios reflect real-world usage patterns:| Scenario | Input Length | Output Length | Use Case |
|---|---|---|---|
| Chat | 1K | 1K | Most common conversational AI workload |
| Reasoning | 1K | 8K | Long-form generation, complex reasoning tasks |
| Summarization | 8K | 1K | Document summarization, RAG retrieval |
5.1.2 Concurrency Levels
Test each scenario at different concurrency levels to capture the throughput vs. latency trade-off:- Low Concurrency:
--max-concurrency 1(Latency-optimized) - Medium Concurrency:
--max-concurrency 16(Balanced) - High Concurrency:
--max-concurrency 100(Throughput-optimized)
5.1.3 Number of Prompts
For each concurrency level, configurenum_prompts to simulate realistic user loads:
- Quick Test:
num_prompts = concurrency × 1(minimal test) - Recommended:
num_prompts = concurrency × 5(standard benchmark) - Stable Measurements:
num_prompts = concurrency × 10(production-grade)
5.1.4 Benchmark Commands
Scenario 1: Chat (1K/1K) - Most Important- Model Deployment
- Low Concurrency (Latency-Optimized)
- Medium Concurrency (Balanced)
- High Concurrency (Throughput-Optimized)
- Low Concurrency
- Medium Concurrency
- High Concurrency
- Low Concurrency
- Medium Concurrency
- High Concurrency
5.1.5 Understanding the Results
Key Metrics:- Request Throughput (req/s): Number of requests processed per second
- Output Token Throughput (tok/s): Total tokens generated per second
- Mean TTFT (ms): Time to First Token - measures responsiveness
- Mean TPOT (ms): Time Per Output Token - measures generation speed
- Mean ITL (ms): Inter-Token Latency - measures streaming consistency
- 1K/1K (Chat): Represents the most common conversational AI workload. This is the highest priority scenario for most deployments.
- 1K/8K (Reasoning): Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations.
- 8K/1K (Summarization): Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks.
- Variable Concurrency: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput.
- Compare your results against baseline numbers for your hardware
- Higher throughput at same latency = better performance
- Lower TTFT = more responsive user experience
- Lower TPOT = faster generation speed
5.2 Accuracy Benchmark
Document model accuracy on standard benchmarks:5.2.1 GSM8K Benchmark
- Benchmark Command
