DeepSeek-V3.1 - SGLang Cookbook

1. Model Introduction

DeepSeek V3.1 is an advanced Mixture-of-Experts (MoE) large language model developed by DeepSeek, representing a major capability and usability upgrade over DeepSeek V3. As a refined iteration in the DeepSeek V3 family, DeepSeek V3.1 introduces a hybrid reasoning paradigm that supports both fast non-thinking responses and explicit multi-step reasoning, alongside significantly improved tool calling and agentic behavior. The model demonstrates strong performance across reasoning, mathematics, coding, long-context understanding, and real-world agent workflows, benefiting from continued training, alignment optimization, and inference-time refinements. DeepSeek V3.1 is designed to serve as a robust general-purpose foundation model, well suited for conversational AI, structured tool invocation, search-augmented generation, and complex multi-step tasks, while maintaining high efficiency through its sparse MoE architecture. DeepSeek-V3.1-Terminus is an experimental version designed for general conversations and long-context processing. It features hybrid thinking capabilities, allowing you to toggle between “Think” mode for deliberate reasoning and “Non-Think” mode for faster responses. Recommended for general conversations, long-context processing, and experimental use cases.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and thinking capabilities.

3.2 Configuration Tips

For more detailed configuration tips, please refer to DeepSeek V3/V3.1/R1 Usage.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

Basic API Usage

4.2 Advanced Usage

4.2.1 Reasoning Parser

DeepSeek-V3.1 supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:

python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3.1-Terminus \
  --reasoning-parser deepseek-v3 \
  --tp 8 \
  --host 0.0.0.0 \
  --port 8000

Streaming with Thinking Process:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1-Terminus",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    extra_body = {"chat_template_kwargs": {"thinking": True}},
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

=============== Thinking =================
First, the problem is asking for 15% of 240. Percent means per hundred, so 15% is the same as 15 out of 100, or 15/100.

To find a percentage of a number, I can multiply the number by the percentage expressed as a decimal. So, I need to convert 15% to a decimal. To do that, I divide 15 by 100, which gives me 0.15.

Now, I multiply 0.15 by 240. So, the calculation is 0.15 × 240.

I can compute this step by step. First, I know that 15% of 100 is 15, but since 240 is larger, I need to adjust. Alternatively, I can think of 10% of 240, which is easy because 10% is just 240 divided by 10, which is 24. Then, 5% is half of 10%, so half of 24 is 12. Therefore, 15% is 10% plus 5%, so 24 plus 12, which equals 36.

I should also do the multiplication to confirm. 0.15 × 240. I can break it down: 0.15 × 200 = 30, and 0.15 × 40 = 6, so 30 + 6 = 36. Same answer.

So, 15% of 240 is 36.

The problem says "step by step," so I should present it clearly.
=============== Content =================
To find 15% of 240, follow these steps:

1. Understand that "percent" means "per hundred," so 15% is equivalent to \( \frac{15}{100} \).
2. Convert 15% to a decimal by dividing by 100: \( 15\% = \frac{15}{100} = 0.15 \).
3. Multiply the decimal by 240: \( 0.15 \times 240 \).
4. Perform the multiplication:
   - \( 0.15 \times 200 = 30 \)
   - \( 0.15 \times 40 = 6 \)
   - Add the results: \( 30 + 6 = 36 \).

Alternatively, you can find 15% by breaking it into parts:
- 10% of 240 is \( \frac{10}{100} \times 240 = 0.10 \times 240 = 24 \).
- 5% of 240 is half of 10%, so \( \frac{24}{2} = 12 \).
- Add 10% and 5%: \( 24 + 12 = 36 \).

Thus, 15% of 240 is 36.

Note: The reasoning parser captures the model’s step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.2 Tool Calling

DeepSeek-V3.1 and DeepSeek-V3.1-Terminus support tool calling capabilities. Enable the tool call parser: Note: DeepSeek-V3.1-Speciale does NOT support tool calling. It is designed exclusively for deep reasoning tasks. Deployment Command:

python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3.1-Terminus \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv31.jinja \
  --tp 8 \
  --host 0.0.0.0 \
  --port 8000

For DeepSeek-V3.1, use --tool-call-parser deepseekv31 as well. Python Example (with Thinking Process):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1-Terminus",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    extra_body = {"chat_template_kwargs": {"thinking": True}},
    temperature=0.7,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Accumulate tool calls
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            # Close thinking section if needed
            if has_thinking and thinking_started:
                print("\n=============== Content =================\n", flush=True)
                thinking_started = False

            for tool_call in delta.tool_calls:
                index = tool_call.index
                if index not in tool_calls_accumulator:
                    tool_calls_accumulator[index] = {
                        'name': None,
                        'arguments': ''
                    }

                if tool_call.function:
                    if tool_call.function.name:
                        tool_calls_accumulator[index]['name'] = tool_call.function.name
                    if tool_call.function.arguments:
                        tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"🔧 Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()

Output Example:

=============== Thinking =================
Hmm, the user is asking for the weather in Beijing. This is a straightforward request that matches exactly what the weather tool can provide.

I need to call the get_weather function with Beijing as the location parameter. The user didn't specify a temperature unit, so I'll default to Celsius since that's commonly used in most parts of the world.

The tool call format needs to be precise - just the city name and unit selection. Once I get the weather data back, I'll present it clearly to the user.I'll check the weather in Beijing for you.
=============== Content =================

🔧 Tool Call: get_weather
   Arguments: {"location": "Beijing", "unit": "celsius"}

Note:

The reasoning parser shows how the model decides to use a tool
Tool calls are clearly marked with the function name and arguments
You can then execute the function and send the result back to continue the conversation

Handling Tool Call Results: Please attach the code blocks below to the previous Python script.

# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
    # Your actual weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Beijing", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1-Terminus",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "Currently, it is **22°C and sunny** in Beijing."

5. Benchmark

5.1 Speed Benchmark

Test Environment:

Hardware: AMD MI300X GPU (8x)
Model: DeepSeek-V3.1-Terminus
Tensor Parallelism: 8
sglang version: 0.5.7

Benchmark Methodology: We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms.

5.1.1 Standard Test Scenarios

Three core scenarios reflect real-world usage patterns:

Scenario	Input Length	Output Length	Use Case
Chat	1K	1K	Most common conversational AI workload
Reasoning	1K	8K	Long-form generation, complex reasoning tasks
Summarization	8K	1K	Document summarization, RAG retrieval

5.1.2 Concurrency Levels

Test each scenario at different concurrency levels to capture the throughput vs. latency trade-off:

Low Concurrency: --max-concurrency 1 (Latency-optimized)
Medium Concurrency: --max-concurrency 16 (Balanced)
High Concurrency: --max-concurrency 100 (Throughput-optimized)

5.1.3 Number of Prompts

For each concurrency level, configure num_prompts to simulate realistic user loads:

Quick Test: num_prompts = concurrency × 1 (minimal test)
Recommended: num_prompts = concurrency × 5 (standard benchmark)
Stable Measurements: num_prompts = concurrency × 10 (production-grade)

5.1.4 Benchmark Commands

Scenario 1: Chat (1K/1K) - Most Important

Model Deployment

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.1 \
  --tp 8

Low Concurrency (Latency-Optimized)

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  106.24
Total input tokens:                      6101
Total input text tokens:                 6101
Total input vision tokens:               0
Total generated tokens:                  4220
Total generated tokens (retokenized):    4201
Request throughput (req/s):              0.09
Input token throughput (tok/s):          57.43
Output token throughput (tok/s):         39.72
Peak output token throughput (tok/s):    43.00
Peak concurrent requests:                2
Total token throughput (tok/s):          97.15
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10620.29
Median E2E Latency (ms):                 8868.09
---------------Time to First Token----------------
Mean TTFT (ms):                          557.85
Median TTFT (ms):                        213.58
P99 TTFT (ms):                           1625.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.84
Median TPOT (ms):                        23.90
P99 TPOT (ms):                           24.03
---------------Inter-Token Latency----------------
Mean ITL (ms):                           23.90
Median ITL (ms):                         23.92
P95 ITL (ms):                            24.15
P99 ITL (ms):                            24.25
Max ITL (ms):                            25.44
==================================================

Medium Concurrency (Balanced)

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  107.71
Total input tokens:                      39668
Total input text tokens:                 39668
Total input vision tokens:               0
Total generated tokens:                  40805
Total generated tokens (retokenized):    40625
Request throughput (req/s):              0.74
Input token throughput (tok/s):          368.28
Output token throughput (tok/s):         378.84
Peak output token throughput (tok/s):    508.00
Peak concurrent requests:                19
Total token throughput (tok/s):          747.12
Concurrency:                             13.72
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18473.65
Median E2E Latency (ms):                 19558.42
---------------Time to First Token----------------
Mean TTFT (ms):                          607.91
Median TTFT (ms):                        191.32
P99 TTFT (ms):                           2135.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.50
Median TPOT (ms):                        35.99
P99 TPOT (ms):                           43.62
---------------Inter-Token Latency----------------
Mean ITL (ms):                           35.10
Median ITL (ms):                         32.18
P95 ITL (ms):                            33.03
P99 ITL (ms):                            159.99
Max ITL (ms):                            453.99
==================================================

High Concurrency (Throughput-Optimized)

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  207.65
Total input tokens:                      249831
Total input text tokens:                 249831
Total input vision tokens:               0
Total generated tokens:                  252662
Total generated tokens (retokenized):    251238
Request throughput (req/s):              2.41
Input token throughput (tok/s):          1203.15
Output token throughput (tok/s):         1216.79
Peak output token throughput (tok/s):    2100.00
Peak concurrent requests:                106
Total token throughput (tok/s):          2419.94
Concurrency:                             91.02
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   37800.20
Median E2E Latency (ms):                 35921.56
---------------Time to First Token----------------
Mean TTFT (ms):                          835.15
Median TTFT (ms):                        236.88
P99 TTFT (ms):                           2868.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          73.33
Median TPOT (ms):                        76.35
P99 TPOT (ms):                           97.63
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.30
Median ITL (ms):                         50.82
P95 ITL (ms):                            180.67
P99 ITL (ms):                            186.83
Max ITL (ms):                            1661.39
==================================================

Scenario 2: Reasoning (1K/8K)

Low Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  1097.29
Total input tokens:                      6101
Total input text tokens:                 6101
Total input vision tokens:               0
Total generated tokens:                  44462
Total generated tokens (retokenized):    44313
Request throughput (req/s):              0.01
Input token throughput (tok/s):          5.56
Output token throughput (tok/s):         40.52
Peak output token throughput (tok/s):    43.00
Peak concurrent requests:                2
Total token throughput (tok/s):          46.08
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   109725.52
Median E2E Latency (ms):                 117748.67
---------------Time to First Token----------------
Mean TTFT (ms):                          156.67
Median TTFT (ms):                        156.19
P99 TTFT (ms):                           159.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.41
Median TPOT (ms):                        24.51
P99 TPOT (ms):                           24.96
---------------Inter-Token Latency----------------
Mean ITL (ms):                           24.65
Median ITL (ms):                         24.58
P95 ITL (ms):                            25.68
P99 ITL (ms):                            25.93
Max ITL (ms):                            29.80
==================================================

Medium Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  775.02
Total input tokens:                      39668
Total input text tokens:                 39668
Total input vision tokens:               0
Total generated tokens:                  318306
Total generated tokens (retokenized):    317426
Request throughput (req/s):              0.10
Input token throughput (tok/s):          51.18
Output token throughput (tok/s):         410.70
Peak output token throughput (tok/s):    512.00
Peak concurrent requests:                18
Total token throughput (tok/s):          461.89
Concurrency:                             13.86
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   134236.65
Median E2E Latency (ms):                 135181.28
---------------Time to First Token----------------
Mean TTFT (ms):                          214.35
Median TTFT (ms):                        194.12
P99 TTFT (ms):                           300.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.72
Median TPOT (ms):                        34.00
P99 TPOT (ms):                           34.75
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.69
Median ITL (ms):                         33.71
P95 ITL (ms):                            34.50
P99 ITL (ms):                            34.92
Max ITL (ms):                            164.76
==================================================

High Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 320 \
  --max-concurrency 64 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 64
Successful requests:                     320
Benchmark duration (s):                  1231.97
Total input tokens:                      158939
Total input text tokens:                 158939
Total input vision tokens:               0
Total generated tokens:                  1301025
Total generated tokens (retokenized):    1296845
Request throughput (req/s):              0.26
Input token throughput (tok/s):          129.01
Output token throughput (tok/s):         1056.05
Peak output token throughput (tok/s):    1472.00
Peak concurrent requests:                67
Total token throughput (tok/s):          1185.07
Concurrency:                             56.17
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   216256.25
Median E2E Latency (ms):                 224192.84
---------------Time to First Token----------------
Mean TTFT (ms):                          317.68
Median TTFT (ms):                        235.28
P99 TTFT (ms):                           649.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.30
Median TPOT (ms):                        55.10
P99 TPOT (ms):                           56.58
---------------Inter-Token Latency----------------
Mean ITL (ms):                           53.13
Median ITL (ms):                         52.95
P95 ITL (ms):                            56.23
P99 ITL (ms):                            181.04
Max ITL (ms):                            208.61
==================================================

Scenario 3: Summarization (8K/1K)

Low Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  114.47
Total input tokens:                      41941
Total input text tokens:                 41941
Total input vision tokens:               0
Total generated tokens:                  4220
Total generated tokens (retokenized):    4194
Request throughput (req/s):              0.09
Input token throughput (tok/s):          366.39
Output token throughput (tok/s):         36.87
Peak output token throughput (tok/s):    42.00
Peak concurrent requests:                2
Total token throughput (tok/s):          403.26
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11442.86
Median E2E Latency (ms):                 9508.87
---------------Time to First Token----------------
Mean TTFT (ms):                          883.78
Median TTFT (ms):                        481.38
P99 TTFT (ms):                           2217.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.93
Median TPOT (ms):                        25.05
P99 TPOT (ms):                           26.11
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.08
Median ITL (ms):                         25.08
P95 ITL (ms):                            26.18
P99 ITL (ms):                            26.28
Max ITL (ms):                            27.41
==================================================

Medium Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  162.33
Total input tokens:                      300020
Total input text tokens:                 300020
Total input vision tokens:               0
Total generated tokens:                  41669
Total generated tokens (retokenized):    41443
Request throughput (req/s):              0.49
Input token throughput (tok/s):          1848.27
Output token throughput (tok/s):         256.70
Peak output token throughput (tok/s):    467.00
Peak concurrent requests:                19
Total token throughput (tok/s):          2104.97
Concurrency:                             14.52
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   29456.89
Median E2E Latency (ms):                 27628.16
---------------Time to First Token----------------
Mean TTFT (ms):                          1784.30
Median TTFT (ms):                        1347.21
P99 TTFT (ms):                           5384.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.65
Median TPOT (ms):                        52.09
P99 TPOT (ms):                           74.39
---------------Inter-Token Latency----------------
Mean ITL (ms):                           53.23
Median ITL (ms):                         34.52
P95 ITL (ms):                            35.81
P99 ITL (ms):                            513.25
Max ITL (ms):                            2865.73
==================================================

High Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model deepseek-ai/DeepSeek-V3.1 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 320 \
  --max-concurrency 64 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 64
Successful requests:                     320
Benchmark duration (s):                  282.55
Total input tokens:                      1273893
Total input text tokens:                 1273893
Total input vision tokens:               0
Total generated tokens:                  170000
Total generated tokens (retokenized):   169081
Request throughput (req/s):              1.13
Input token throughput (tok/s):          4508.6
Output token throughput (tok/s):         601.67
Peak output token throughput (tok/s):   1216
Peak concurrent requests:                68
Total token throughput (tok/s):         5110.27
Concurrency:                            59.81
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                  52810.32
Median E2E Latency (ms):                50981.81
---------------Time to First Token----------------
Mean TTFT (ms):                         786.69
Median TTFT (ms):                       499.38
P99 TTFT (ms):                          2925.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         97.93
Median TPOT (ms):                       103.45
P99 TPOT (ms):                          157.84
---------------Inter-Token Latency----------------
Mean ITL (ms):                          98.11
Median ITL (ms):                        55.7
P95 ITL (ms):                           240.71
P99 ITL (ms):                          1114.36
==================================================

5.1.5 Understanding the Results

Key Metrics:

Request Throughput (req/s): Number of requests processed per second
Output Token Throughput (tok/s): Total tokens generated per second
Mean TTFT (ms): Time to First Token - measures responsiveness
Mean TPOT (ms): Time Per Output Token - measures generation speed
Mean ITL (ms): Inter-Token Latency - measures streaming consistency

Why These Configurations Matter:

1K/1K (Chat): Represents the most common conversational AI workload. This is the highest priority scenario for most deployments.
1K/8K (Reasoning): Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations.
8K/1K (Summarization): Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks.
Variable Concurrency: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput.

Interpreting Results:

Compare your results against baseline numbers for your hardware
Higher throughput at same latency = better performance
Lower TTFT = more responsive user experience
Lower TPOT = faster generation speed

5.2 Accuracy Benchmark

Document model accuracy on standard benchmarks:

5.2.1 GSM8K Benchmark

Benchmark Command

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

Test Results:

Accuracy: 0.959
Invalid: 0.000
Latency: 29.185 s
Output throughput: 4854.672 token/s

Getting Started

Autoregressive / Qwen

Autoregressive / DeepSeek

Autoregressive / Llama

Autoregressive / GLM

Autoregressive / OpenAI

Autoregressive / Moonshotai

Autoregressive / MiniMax

Autoregressive / NVIDIA

Autoregressive / Ernie

Autoregressive / InternVL

Autoregressive / InternLM

Autoregressive / Jina AI

Autoregressive / Mistral

Autoregressive / Xiaomi

Autoregressive / FlashLabs

Diffusion / FLUX

Diffusion / Wan

Diffusion / Qwen-Image

Diffusion / Z-Image

Others / SpecBundle

Others / Benchmarks

Reference

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Reasoning Parser

​4.2.2 Tool Calling

​5. Benchmark

​5.1 Speed Benchmark

​5.1.1 Standard Test Scenarios

​5.1.2 Concurrency Levels

​5.1.3 Number of Prompts

​5.1.4 Benchmark Commands

​5.1.5 Understanding the Results

​5.2 Accuracy Benchmark

​5.2.1 GSM8K Benchmark