GLM-4.7-Flash - SGLang Cookbook

1. Model Introduction

GLM-4.7-Flash is a lightweight and high-speed model in the GLM-4.7 series developed by Zhipu AI, featuring state-of-the-art capabilities in reasoning, function calling, and efficient local deployment. As a compact variant in the GLM-4.7 family, GLM-4.7-Flash is a 30B-A3B MoE model designed to balance performance and efficiency:

Lightweight Architecture: 30B total parameters with only 3B active parameters, enabling efficient inference
Enhanced Reasoning: Inherits the reasoning capabilities from GLM-4.7 with optimized performance
Superior Coding: Strong code generation and understanding capabilities
Advanced Tool Use: Robust tool calling and agent capabilities for complex workflows
Optimized for Local Deployment: Designed for single-GPU deployment scenarios

For more details, please refer to the official GLM-4.7 documentation. Key Features:

Efficient MoE Architecture: 30B-A3B sparse activation for optimal performance/efficiency trade-off
Multiple Quantizations: BF16 and FP8 variants for different performance/memory trade-offs
Hardware Optimization: Specifically tuned for NVIDIA H100/H200/B200 GPUs
High Performance: Optimized for both throughput and latency scenarios

Available Models:

BF16 (Full precision): zai-org/GLM-4.7-Flash

License: Please refer to the official GLM-4.7-Flash model card for license details.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, deployment strategy, and thinking capabilities.

3.2 Configuration Tips

For more detailed configuration tips, please refer to GLM-4.7 Usage.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

4.2 Advanced Usage

4.2.1 Reasoning Parser

GLM-4.7-Flash supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and the content sections:

python -m sglang.launch_server \
  --model zai-org/GLM-4.7-Flash \
  --reasoning-parser glm45 \
  --attention-backend triton \
  --tp 1 \
  --host 0.0.0.0 \
  --port 8000

Streaming with Thinking Process:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="zai-org/GLM-4.7-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.7,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

=============== Thinking =================
To solve this problem, I need to calculate 15% of 240.
Step 1: Convert 15% to decimal: 15% = 0.15
Step 2: Multiply 240 by 0.15
Step 3: 240 × 0.15 = 36
=============== Content =================

The answer is 36. To find 15% of 240, we multiply 240 by 0.15, which equals 36.

Note: The reasoning parser captures the model’s step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.2 Tool Calling

GLM-4.7-Flash supports tool calling capabilities. Enable the tool call parser:

python -m sglang.launch_server \
  --model zai-org/GLM-4.7-Flash \
  --reasoning-parser glm45 \
  --tool-call-parser glm47 \
  --attention-backend triton \
  --tp 1 \
  --host 0.0.0.0 \
  --port 8000

Python Example (with Thinking Process):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="zai-org/GLM-4.7-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=0.7,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Accumulate tool calls (tool call deltas may stream in multiple chunks)
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            for tool_call in delta.tool_calls:
                index = tool_call.index
                if index not in tool_calls_accumulator:
                    tool_calls_accumulator[index] = {
                        'name': None,
                        'arguments': ''
                    }

                if tool_call.function:
                    if tool_call.function.name:
                        tool_calls_accumulator[index]['name'] = tool_call.function.name
                    if tool_call.function.arguments:
                        tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

# Print accumulated tool calls
if tool_calls_accumulator:
    print("\n=============== Tool Calls =================", flush=True)
    for index, tool_call in sorted(tool_calls_accumulator.items()):
        print(f"Tool Call: {tool_call['name']}")
        print(f"   Arguments: {tool_call['arguments']}")

print()

Output Example:

=============== Thinking =================
The user is asking for the weather in Beijing. I have the get_weather function available which can provide weather information for a location. The required parameter is "location" and the
 user has provided "Beijing". There's an optional parameter "unit" for temperature unit, but the user hasn't specified which unit they prefer, and since it's optional, I should not ask about it or make up a value for it. I'll call the function with just the location parameter.I'll check the current weather in Beijing for you.
=============== Tool Calls =================
Tool Call: get_weather
   Arguments: {"location": "Beijing"}

Note:

The reasoning parser shows how the model decides to use a tool
Tool calls are clearly marked with the function name and arguments
You can then execute the function and send the result back to continue the conversation

Handling Tool Call Results:

# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
    # Your actual weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Beijing", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="zai-org/GLM-4.7-Flash",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."

5. Benchmark

This section uses industry-standard configurations for comparable benchmark results.

5.1 Speed Benchmark

Test Environment:

Hardware: NVIDIA B200 (1x)
Model: GLM-4.7-Flash
Tensor Parallelism: 1
SGLang Version: 0.5.7

Benchmark Methodology: We use industry-standard benchmark configurations to ensure results are comparable across frameworks and hardware platforms.

5.1.1 Standard Test Scenarios

Three core scenarios reflect real-world usage patterns:

Scenario	Input Length	Output Length	Use Case
Chat	1K	1K	Most common conversational AI workload
Reasoning	1K	8K	Long-form generation, complex reasoning tasks
Summarization	8K	1K	Document summarization, RAG retrieval

5.1.2 Concurrency Levels

Test each scenario at three concurrency levels to capture the throughput vs. latency tradeoff (Pareto frontier):

Low Concurrency: --max-concurrency 1 (Latency-optimized)
Medium Concurrency: --max-concurrency 16 (Balanced)
High Concurrency: --max-concurrency 100 (Throughput-optimized)

5.1.3 Number of Prompts

For each concurrency level, configure num_prompts to simulate realistic user loads:

Quick Test: num_prompts = concurrency × 1 (minimal test)
Recommended: num_prompts = concurrency × 5 (standard benchmark)
Stable Measurements: num_prompts = concurrency × 10 (production-grade)

5.1.4 Benchmark Commands

Scenario 1: Chat (1K/1K) - Most Important

Model Deployment

python -m sglang.launch_server \
  --model zai-org/GLM-4.7-Flash \
  --attention-backend triton \
  --tp 1

Low Concurrency (Latency-Optimized)

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  38.94
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    4220
Request throughput (req/s):              0.26
Input token throughput (tok/s):          156.67
Output token throughput (tok/s):         108.37
Peak output token throughput (tok/s):    125.00
Peak concurrent requests:                2
Total token throughput (tok/s):          265.03
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3891.12
Median E2E Latency (ms):                 3061.48
P90 E2E Latency (ms):                    7172.25
P99 E2E Latency (ms):                    9042.62
---------------Time to First Token----------------
Mean TTFT (ms):                          131.36
Median TTFT (ms):                        94.55
P99 TTFT (ms):                           435.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.75
Median TPOT (ms):                        8.82
P99 TPOT (ms):                           9.39
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.93
Median ITL (ms):                         8.98
P95 ITL (ms):                            9.83
P99 ITL (ms):                            10.20
Max ITL (ms):                            18.50
==================================================

Medium Concurrency (Balanced)

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  52.73
Total input tokens:                      39668
Total input text tokens:                 39668
Total generated tokens:                  40805
Total generated tokens (retokenized):    40775
Request throughput (req/s):              1.52
Input token throughput (tok/s):          752.27
Output token throughput (tok/s):         773.83
Peak output token throughput (tok/s):    1040.00
Peak concurrent requests:                21
Total token throughput (tok/s):          1526.10
Concurrency:                             13.98
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9217.90
Median E2E Latency (ms):                 9642.50
P90 E2E Latency (ms):                    15147.02
P99 E2E Latency (ms):                    18237.06
---------------Time to First Token----------------
Mean TTFT (ms):                          299.02
Median TTFT (ms):                        105.98
P99 TTFT (ms):                           1109.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.03
Median TPOT (ms):                        18.00
P99 TPOT (ms):                           26.51
---------------Inter-Token Latency----------------
Mean ITL (ms):                           17.52
Median ITL (ms):                         16.07
P95 ITL (ms):                            18.14
P99 ITL (ms):                            89.43
Max ITL (ms):                            763.13
==================================================

High Concurrency (Throughput-Optimized)

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  91.48
Total input tokens:                      249831
Total input text tokens:                 249831
Total generated tokens:                  252662
Total generated tokens (retokenized):    250941
Request throughput (req/s):              5.47
Input token throughput (tok/s):          2730.87
Output token throughput (tok/s):         2761.82
Peak output token throughput (tok/s):    4199.00
Peak concurrent requests:                109
Total token throughput (tok/s):          5492.69
Concurrency:                             90.54
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16566.04
Median E2E Latency (ms):                 16134.36
P90 E2E Latency (ms):                    30167.60
P99 E2E Latency (ms):                    34034.04
---------------Time to First Token----------------
Mean TTFT (ms):                          433.94
Median TTFT (ms):                        123.26
P99 TTFT (ms):                           1760.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.26
Median TPOT (ms):                        33.56
P99 TPOT (ms):                           38.78
---------------Inter-Token Latency----------------
Mean ITL (ms):                           31.99
Median ITL (ms):                         24.06
P95 ITL (ms):                            79.62
P99 ITL (ms):                            103.03
Max ITL (ms):                            1369.20
==================================================

Scenario 2: Reasoning (1K/8K)

Low Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  525.43
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  44462
Total generated tokens (retokenized):    44451
Request throughput (req/s):              0.02
Input token throughput (tok/s):          11.61
Output token throughput (tok/s):         84.62
Peak output token throughput (tok/s):    125.00
Peak concurrent requests:                2
Total token throughput (tok/s):          96.23
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   52540.19
Median E2E Latency (ms):                 53694.45
P90 E2E Latency (ms):                    94742.08
P99 E2E Latency (ms):                    101224.18
---------------Time to First Token----------------
Mean TTFT (ms):                          97.45
Median TTFT (ms):                        95.28
P99 TTFT (ms):                           105.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.94
Median TPOT (ms):                        11.25
P99 TPOT (ms):                           13.09
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.80
Median ITL (ms):                         11.51
P95 ITL (ms):                            15.83
P99 ITL (ms):                            16.86
Max ITL (ms):                            19.96
==================================================

Medium Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  473.92
Total input tokens:                      39668
Total input text tokens:                 39668
Total generated tokens:                  318306
Total generated tokens (retokenized):    317860
Request throughput (req/s):              0.17
Input token throughput (tok/s):          83.70
Output token throughput (tok/s):         671.65
Peak output token throughput (tok/s):    1040.00
Peak concurrent requests:                19
Total token throughput (tok/s):          755.35
Concurrency:                             13.80
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   81746.73
Median E2E Latency (ms):                 78508.54
P90 E2E Latency (ms):                    155292.49
P99 E2E Latency (ms):                    166769.99
---------------Time to First Token----------------
Mean TTFT (ms):                          117.50
Median TTFT (ms):                        101.97
P99 TTFT (ms):                           182.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.36
Median TPOT (ms):                        20.48
P99 TPOT (ms):                           22.63
---------------Inter-Token Latency----------------
Mean ITL (ms):                           20.52
Median ITL (ms):                         20.42
P95 ITL (ms):                            23.41
P99 ITL (ms):                            26.29
Max ITL (ms):                            90.48
==================================================

High Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 8000 \
  --num-prompts 320 \
  --max-concurrency 64 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 64
Successful requests:                     320
Benchmark duration (s):                  714.72
Total input tokens:                      158939
Total input text tokens:                 158939
Total generated tokens:                  1301025
Total generated tokens (retokenized):    1289431
Request throughput (req/s):              0.45
Input token throughput (tok/s):          222.38
Output token throughput (tok/s):         1820.33
Peak output token throughput (tok/s):    3200.00
Peak concurrent requests:                68
Total token throughput (tok/s):          2042.71
Concurrency:                             55.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   124364.58
Median E2E Latency (ms):                 129250.98
P90 E2E Latency (ms):                    219175.80
P99 E2E Latency (ms):                    247741.77
---------------Time to First Token----------------
Mean TTFT (ms):                          149.40
Median TTFT (ms):                        114.78
P99 TTFT (ms):                           288.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.51
Median TPOT (ms):                        31.75
P99 TPOT (ms):                           33.32
---------------Inter-Token Latency----------------
Mean ITL (ms):                           30.56
Median ITL (ms):                         30.82
P95 ITL (ms):                            33.20
P99 ITL (ms):                            80.54
Max ITL (ms):                            117.72
==================================================

Scenario 3: Summarization (8K/1K)

Low Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  58.27
Total input tokens:                      41941
Total input text tokens:                 41941
Total generated tokens:                  4220
Total generated tokens (retokenized):    4220
Request throughput (req/s):              0.17
Input token throughput (tok/s):          719.73
Output token throughput (tok/s):         72.42
Peak output token throughput (tok/s):    112.00
Peak concurrent requests:                2
Total token throughput (tok/s):          792.15
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5825.08
Median E2E Latency (ms):                 4624.26
P90 E2E Latency (ms):                    12690.22
P99 E2E Latency (ms):                    13177.96
---------------Time to First Token----------------
Mean TTFT (ms):                          296.01
Median TTFT (ms):                        195.59
P99 TTFT (ms):                           717.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.63
Median TPOT (ms):                        13.07
P99 TPOT (ms):                           16.68
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.13
Median ITL (ms):                         13.17
P95 ITL (ms):                            17.02
P99 ITL (ms):                            17.47
Max ITL (ms):                            19.84
==================================================

Medium Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 80 \
  --max-concurrency 16 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  89.59
Total input tokens:                      300020
Total input text tokens:                 300020
Total generated tokens:                  41669
Total generated tokens (retokenized):    41656
Request throughput (req/s):              0.89
Input token throughput (tok/s):          3348.77
Output token throughput (tok/s):         465.10
Peak output token throughput (tok/s):    752.00
Peak concurrent requests:                19
Total token throughput (tok/s):          3813.87
Concurrency:                             14.39
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16120.74
Median E2E Latency (ms):                 16246.55
P90 E2E Latency (ms):                    27279.72
P99 E2E Latency (ms):                    34577.93
---------------Time to First Token----------------
Mean TTFT (ms):                          1943.94
Median TTFT (ms):                        382.19
P99 TTFT (ms):                           8980.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.87
Median TPOT (ms):                        28.26
P99 TPOT (ms):                           40.55
---------------Inter-Token Latency----------------
Mean ITL (ms):                           27.27
Median ITL (ms):                         21.74
P95 ITL (ms):                            23.32
P99 ITL (ms):                            232.65
Max ITL (ms):                            4282.01
==================================================

High Concurrency

python -m sglang.bench_serving \
  --backend sglang \
  --model zai-org/GLM-4.7-Flash \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 320 \
  --max-concurrency 64 \
  --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 64
Successful requests:                     320
Benchmark duration (s):                  167.01
Total input tokens:                      1273893
Total input text tokens:                 1273893
Total generated tokens:                  170000
Total generated tokens (retokenized):    169226
Request throughput (req/s):              1.92
Input token throughput (tok/s):          7627.82
Output token throughput (tok/s):         1017.93
Peak output token throughput (tok/s):    1984.00
Peak concurrent requests:                69
Total token throughput (tok/s):          8645.75
Concurrency:                             59.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31147.52
Median E2E Latency (ms):                 30603.34
P90 E2E Latency (ms):                    54889.44
P99 E2E Latency (ms):                    67665.30
---------------Time to First Token----------------
Mean TTFT (ms):                          428.87
Median TTFT (ms):                        441.69
P99 TTFT (ms):                           1232.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.06
Median TPOT (ms):                        62.79
P99 TPOT (ms):                           82.23
---------------Inter-Token Latency----------------
Mean ITL (ms):                           57.93
Median ITL (ms):                         33.30
P95 ITL (ms):                            247.98
P99 ITL (ms):                            409.63
Max ITL (ms):                            1421.21
==================================================

5.1.5 Understanding the Results

Key Metrics:

Request Throughput (req/s): Number of requests processed per second
Output Token Throughput (tok/s): Total tokens generated per second
Mean TTFT (ms): Time to First Token - measures responsiveness
Mean TPOT (ms): Time Per Output Token - measures generation speed
Mean ITL (ms): Inter-Token Latency - measures streaming consistency

Why These Configurations Matter:

1K/1K (Chat): Represents the most common conversational AI workload. This is the highest priority scenario for most deployments.
1K/8K (Reasoning): Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations.
8K/1K (Summarization): Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks.
Variable Concurrency: Captures the Pareto frontier - the optimal tradeoff between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput.

Interpreting Results:

Compare your results against baseline numbers for your hardware
Higher throughput at same latency = better performance
Lower TTFT = more responsive user experience
Lower TPOT = faster generation speed

5.2 Accuracy Benchmark

Document model accuracy on standard benchmarks:

5.2.1 GSM8K Benchmark

Benchmark Command

python -m sglang.test.few_shot_gsm8k \
  --num-questions 200 \
  --port 30000

Result

Accuracy: 0.845
Invalid: 0.000
Latency: 8.431 s
Output throughput: 2195.387 token/s

Getting Started

Autoregressive / Qwen

Autoregressive / DeepSeek

Autoregressive / Llama

Autoregressive / GLM

Autoregressive / OpenAI

Autoregressive / Moonshotai

Autoregressive / MiniMax

Autoregressive / NVIDIA

Autoregressive / Ernie

Autoregressive / InternVL

Autoregressive / InternLM

Autoregressive / Jina AI

Autoregressive / Mistral

Autoregressive / Xiaomi

Autoregressive / FlashLabs

Diffusion / FLUX

Diffusion / Wan

Diffusion / Qwen-Image

Diffusion / Z-Image

Others / SpecBundle

Others / Benchmarks

Reference

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Reasoning Parser

​4.2.2 Tool Calling

​5. Benchmark

​5.1 Speed Benchmark

​5.1.1 Standard Test Scenarios

​5.1.2 Concurrency Levels

​5.1.3 Number of Prompts

​5.1.4 Benchmark Commands

​5.1.5 Understanding the Results

​5.2 Accuracy Benchmark

​5.2.1 GSM8K Benchmark