Skip to main content

1. Model Introduction

Kimi-K2 is a state-of-the-art MoE language model by Moonshot AI with 32B activated parameters and 1T total parameters. Model Variants:
  • Kimi-K2-Instruct: Post-trained model optimized for general-purpose chat and agentic tasks. Compatible with vLLM, SGLang, KTransformers, and TensorRT-LLM.
  • Kimi-K2-Thinking: Advanced thinking model with step-by-step reasoning and tool calling. Native INT4 quantization with 256k context window. Ideal for complex reasoning and multi-step tool use.
For details, see official documentation and technical report.

2. SGLang Installation

Refer to the official SGLang installation guide.

3. Model Deployment

This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and capabilities.

3.2 Configuration Tips

  • Memory: Requires 8 GPUs with ≥140GB each (H200/B200). Use --context-length 128000 to conserve memory.
  • Expert Parallelism (EP): Use --ep for better MoE throughput. See EP docs.
  • Data Parallel (DP): Enable with --dp 4 --enable-dp-attention for production throughput.
  • KV Cache: Use --kv-cache-dtype fp8_e4m3 to reduce memory by 50% (CUDA 11.8+).
  • Reasoning Parser: Add --reasoning-parser kimi_k2 for Kimi-K2-Thinking to separate thinking and content.
  • Tool Call Parser: Add --tool-call-parser kimi_k2 for structured tool calls.

4. Model Invocation

4.1 Basic Usage

See Basic API Usage.

4.2 Advanced Usage

4.2.1 Reasoning Parser

Enable reasoning parser for Kimi-K2-Thinking:
python -m sglang.launch_server \
  --model moonshotai/Kimi-K2-Thinking \
  --reasoning-parser kimi_k2 \
  --tp 8 \
  --host 0.0.0.0 \
  --port 8000
Example:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.6,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()
Output Example:
=============== Thinking =================
  The user asks: "What is 15% of 240?" This is a straightforward percentage calculation problem. I need to solve it step by step.

Step 1: Understand what "percent" means.
- "Percent" means "per hundred". So 15% means 15 per 100, or 15/100, or 0.15.

Step 2: Convert the percentage to a decimal.
- 15% = 15 / 100 = 0.15

Step 3: Multiply the decimal by the number.
- 0.15 * 240

Step 4: Perform the multiplication.
- 0.15 * 240 = (15/100) * 240
- = 15 * 240 / 100
- = 3600 / 100
- = 36

Alternatively, I can calculate it directly:
- 0.15 * 240
- 15 * 240 = 3600
- 3600 / 100 = 36

Or, break it down:
- 10% of 240 = 24
- 5% of 240 = half of 10% = 12
- 15% of 240 = 10% + 5% = 24 + 12 = 36

I should present the solution clearly with steps. The most standard method is converting to decimal and multiplying.

Let me structure the answer:
1. Convert the percentage to a decimal.
2. Multiply the decimal by the number.
3. Show the calculation.
4. State the final answer.

This is simple and easy to follow.
=============== Content =================
 Here is the step-by-step solution:

**Step 1: Convert the percentage to a decimal**
15% means 15 per 100, which is 15 ÷ 100 = **0.15**

**Step 2: Multiply the decimal by the number**
0.15 × 240

**Step 3: Calculate the result**
0.15 × 240 = **36**

**Answer:** 15% of 240 is **36**.
Note: The reasoning parser captures the model’s step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.2 Tool Calling

Kimi-K2-Instruct and Kimi-K2-Thinking support tool calling capabilities. Enable the tool call parser during deployment: Deployment Command:
python -m sglang.launch_server \
  --model moonshotai/Kimi-K2-Instruct \
  --tool-call-parser kimi_k2 \
  --tp 8 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000
Python Example (with Thinking Process):
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=0.7,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Accumulate tool calls
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            # Close thinking section if needed
            if has_thinking and thinking_started:
                print("\n=============== Content =================\n", flush=True)
                thinking_started = False

            for tool_call in delta.tool_calls:
                index = tool_call.index
                if index not in tool_calls_accumulator:
                    tool_calls_accumulator[index] = {
                        'name': None,
                        'arguments': ''
                    }

                if tool_call.function:
                    if tool_call.function.name:
                        tool_calls_accumulator[index]['name'] = tool_call.function.name
                    if tool_call.function.arguments:
                        tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"🔧 Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()
Output Example:
=============== Thinking =================
  The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. Beijing is a major city in China, so I should be able to get weather data for it. The location parameter is required, but the unit parameter is optional. Since the user didn't specify a temperature unit, I can just provide the location and let the function use its default. I'll check the weather in Beijing for you.
=============== Content =================

  🔧 Tool Call: get_weather
   Arguments: {"location":"Beijing"}
Note:
  • The reasoning parser shows how the model decides to use a tool
  • Tool calls are clearly marked with the function name and arguments
  • You can then execute the function and send the result back to continue the conversation
Handling Tool Call Results:
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
    # Your actual weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Beijing", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."

5. Benchmark

5.1 Speed Benchmark

Test Environment:
  • Hardware: NVIDIA B200 GPU (8x)
  • Model: Kimi-K2-Instruct
  • sglang version: 0.5.6.post1
We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.

5.1.1 Latency-Sensitive Benchmark

  • Model Deployment Command:
python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2-Instruct \
    --tp 8 \
    --dp 4 \
    --enable-dp-attention \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
  • Benchmark Command:
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model moonshotai/Kimi-K2-Instruct\
  --num-prompts 10 \
  --max-concurrency 1
  • Test Results:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  44.93
Total input tokens:                      1951
Total input text tokens:                 1951
Total input vision tokens:               0
Total generated tokens:                  2755
Total generated tokens (retokenized):    2748
Request throughput (req/s):              0.22
Input token throughput (tok/s):          43.42
Output token throughput (tok/s):         61.32
Peak output token throughput (tok/s):    64.00
Peak concurrent requests:                3
Total token throughput (tok/s):          104.74
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4489.56
Median E2E Latency (ms):                 4994.53
---------------Time to First Token----------------
Mean TTFT (ms):                          141.22
Median TTFT (ms):                        158.28
P99 TTFT (ms):                           166.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.40
Median TPOT (ms):                        15.63
P99 TPOT (ms):                           39.88
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.78
Median ITL (ms):                         15.76
P95 ITL (ms):                            16.36
P99 ITL (ms):                            16.59
Max ITL (ms):                            19.94
==================================================

5.1.2 Throughput-Sensitive Benchmark

  • Model Deployment Command:
python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2-Instruct \
    --tp 8 \
    --dp 4 \
    --ep 4 \
    --enable-dp-attention \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
  • Benchmark Command:
python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model moonshotai/Kimi-K2-Instruct\
  --num-prompts 1000 \
  --max-concurrency 100
  • Test Results:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  174.11
Total input tokens:                      296642
Total input text tokens:                 296642
Total input vision tokens:               0
Total generated tokens:                  193831
Total generated tokens (retokenized):    168687
Request throughput (req/s):              5.74
Input token throughput (tok/s):          1703.73
Output token throughput (tok/s):         1113.25
Peak output token throughput (tok/s):    2383.00
Peak concurrent requests:                112
Total token throughput (tok/s):          2816.97
Concurrency:                             89.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15601.09
Median E2E Latency (ms):                 10780.52
---------------Time to First Token----------------
Mean TTFT (ms):                          457.42
Median TTFT (ms):                        221.62
P99 TTFT (ms):                           2475.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          97.23
Median TPOT (ms):                        85.61
P99 TPOT (ms):                           435.95
---------------Inter-Token Latency----------------
Mean ITL (ms):                           78.61
Median ITL (ms):                         43.66
P95 ITL (ms):                            169.53
P99 ITL (ms):                            260.91
Max ITL (ms):                            1703.21
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Server Command
python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2-Instruct \
    --tp 8 \
    --dp 4 \
    --trust-remote-code  \
    --host 0.0.0.0 \
    --port 8000
  • Benchmark Command
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000
  • Result:
Accuracy: 0.960
Invalid: 0.000
Latency: 15.956 s
Output throughput: 1231.699 token/s