Kimi-K2

1. Model Introduction

Kimi-K2 is a state-of-the-art MoE language model by Moonshot AI with 32B activated parameters and 1T total parameters. Model Variants:

Kimi-K2-Instruct: Post-trained model optimized for general-purpose chat and agentic tasks. Compatible with vLLM, SGLang, KTransformers, and TensorRT-LLM.
Kimi-K2-Thinking: Advanced thinking model with step-by-step reasoning and tool calling. Native INT4 quantization with 256k context window. Ideal for complex reasoning and multi-step tool use.

For details, see official documentation and technical report.

2. SGLang Installation

Refer to the official SGLang installation guide.

3. Model Deployment

This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and capabilities.

3.2 Configuration Tips

Memory: Requires 8 GPUs with ≥140GB each (H200/B200). Use --context-length 128000 to conserve memory.
Expert Parallelism (EP): Use --ep for better MoE throughput. See EP docs.
Data Parallel (DP): Enable with --dp 4 --enable-dp-attention for production throughput.
KV Cache: Use --kv-cache-dtype fp8_e4m3 to reduce memory by 50% (CUDA 11.8+).
Reasoning Parser: Add --reasoning-parser kimi_k2 for Kimi-K2-Thinking to separate thinking and content.
Tool Call Parser: Add --tool-call-parser kimi_k2 for structured tool calls.

4. Model Invocation

4.1 Basic Usage

See Basic API Usage.

4.2 Advanced Usage

4.2.1 Reasoning Parser

Enable reasoning parser for Kimi-K2-Thinking:

python -m sglang.launch_server \
  --model moonshotai/Kimi-K2-Thinking \
  --reasoning-parser kimi_k2 \
  --tp 8 \
  --host 0.0.0.0 \
  --port 8000

Example:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    temperature=0.6,
    max_tokens=2048,
    stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Print answer content
        if delta.content:
            # Close thinking section and add content header
            if has_thinking and not has_answer:
                print("\n=============== Content =================", flush=True)
                has_answer = True
            print(delta.content, end="", flush=True)

print()

Output Example:

=============== Thinking =================
  The user asks: "What is 15% of 240?" This is a straightforward percentage calculation problem. I need to solve it step by step.

Step 1: Understand what "percent" means.
- "Percent" means "per hundred". So 15% means 15 per 100, or 15/100, or 0.15.

Step 2: Convert the percentage to a decimal.
- 15% = 15 / 100 = 0.15

Step 3: Multiply the decimal by the number.
- 0.15 * 240

Step 4: Perform the multiplication.
- 0.15 * 240 = (15/100) * 240
- = 15 * 240 / 100
- = 3600 / 100
- = 36

Alternatively, I can calculate it directly:
- 0.15 * 240
- 15 * 240 = 3600
- 3600 / 100 = 36

Or, break it down:
- 10% of 240 = 24
- 5% of 240 = half of 10% = 12
- 15% of 240 = 10% + 5% = 24 + 12 = 36

I should present the solution clearly with steps. The most standard method is converting to decimal and multiplying.

Let me structure the answer:
1. Convert the percentage to a decimal.
2. Multiply the decimal by the number.
3. Show the calculation.
4. State the final answer.

This is simple and easy to follow.
=============== Content =================
 Here is the step-by-step solution:

**Step 1: Convert the percentage to a decimal**
15% means 15 per 100, which is 15 ÷ 100 = **0.15**

**Step 2: Multiply the decimal by the number**
0.15 × 240

**Step 3: Calculate the result**
0.15 × 240 = **36**

**Answer:** 15% of 240 is **36**.

Note: The reasoning parser captures the model’s step-by-step thinking process, allowing you to see how the model arrives at its conclusions.

4.2.2 Tool Calling

Kimi-K2-Instruct and Kimi-K2-Thinking support tool calling capabilities. Enable the tool call parser during deployment: Deployment Command:

python -m sglang.launch_server \
  --model moonshotai/Kimi-K2-Instruct \
  --tool-call-parser kimi_k2 \
  --tp 8 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Python Example (with Thinking Process):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools,
    temperature=0.7,
    stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            if not thinking_started:
                print("=============== Thinking =================", flush=True)
                thinking_started = True
            has_thinking = True
            print(delta.reasoning_content, end="", flush=True)

        # Accumulate tool calls
        if hasattr(delta, 'tool_calls') and delta.tool_calls:
            # Close thinking section if needed
            if has_thinking and thinking_started:
                print("\n=============== Content =================\n", flush=True)
                thinking_started = False

            for tool_call in delta.tool_calls:
                index = tool_call.index
                if index not in tool_calls_accumulator:
                    tool_calls_accumulator[index] = {
                        'name': None,
                        'arguments': ''
                    }

                if tool_call.function:
                    if tool_call.function.name:
                        tool_calls_accumulator[index]['name'] = tool_call.function.name
                    if tool_call.function.arguments:
                        tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

        # Print content
        if delta.content:
            print(delta.content, end="", flush=True)

# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"🔧 Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()

Output Example:

=============== Thinking =================
  The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. Beijing is a major city in China, so I should be able to get weather data for it. The location parameter is required, but the unit parameter is optional. Since the user didn't specify a temperature unit, I can just provide the location and let the function use its default. I'll check the weather in Beijing for you.
=============== Content =================

  🔧 Tool Call: get_weather
   Arguments: {"location":"Beijing"}

Note:

The reasoning parser shows how the model decides to use a tool
Tool calls are clearly marked with the function name and arguments
You can then execute the function and send the result back to continue the conversation

Handling Tool Call Results:

# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
    # Your actual weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Send tool result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Beijing?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Beijing", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Beijing", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Thinking",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."

5. Benchmark

5.1 Speed Benchmark

Test Environment:

Hardware: NVIDIA B200 GPU (8x)
Model: Kimi-K2-Instruct
sglang version: 0.5.6.post1

We use SGLang’s built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.

5.1.1 Latency-Sensitive Benchmark

Model Deployment Command:

python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2-Instruct \
    --tp 8 \
    --dp 4 \
    --enable-dp-attention \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Benchmark Command:

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model moonshotai/Kimi-K2-Instruct\
  --num-prompts 10 \
  --max-concurrency 1

Test Results:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  44.93
Total input tokens:                      1951
Total input text tokens:                 1951
Total input vision tokens:               0
Total generated tokens:                  2755
Total generated tokens (retokenized):    2748
Request throughput (req/s):              0.22
Input token throughput (tok/s):          43.42
Output token throughput (tok/s):         61.32
Peak output token throughput (tok/s):    64.00
Peak concurrent requests:                3
Total token throughput (tok/s):          104.74
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4489.56
Median E2E Latency (ms):                 4994.53
---------------Time to First Token----------------
Mean TTFT (ms):                          141.22
Median TTFT (ms):                        158.28
P99 TTFT (ms):                           166.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.40
Median TPOT (ms):                        15.63
P99 TPOT (ms):                           39.88
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.78
Median ITL (ms):                         15.76
P95 ITL (ms):                            16.36
P99 ITL (ms):                            16.59
Max ITL (ms):                            19.94
==================================================

5.1.2 Throughput-Sensitive Benchmark

Model Deployment Command:

python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2-Instruct \
    --tp 8 \
    --dp 4 \
    --ep 4 \
    --enable-dp-attention \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Benchmark Command:

python3 -m sglang.bench_serving \
  --backend sglang \
  --host 127.0.0.1 \
  --port 8000 \
  --model moonshotai/Kimi-K2-Instruct\
  --num-prompts 1000 \
  --max-concurrency 100

Test Results:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     1000
Benchmark duration (s):                  174.11
Total input tokens:                      296642
Total input text tokens:                 296642
Total input vision tokens:               0
Total generated tokens:                  193831
Total generated tokens (retokenized):    168687
Request throughput (req/s):              5.74
Input token throughput (tok/s):          1703.73
Output token throughput (tok/s):         1113.25
Peak output token throughput (tok/s):    2383.00
Peak concurrent requests:                112
Total token throughput (tok/s):          2816.97
Concurrency:                             89.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15601.09
Median E2E Latency (ms):                 10780.52
---------------Time to First Token----------------
Mean TTFT (ms):                          457.42
Median TTFT (ms):                        221.62
P99 TTFT (ms):                           2475.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          97.23
Median TPOT (ms):                        85.61
P99 TPOT (ms):                           435.95
---------------Inter-Token Latency----------------
Mean ITL (ms):                           78.61
Median ITL (ms):                         43.66
P95 ITL (ms):                            169.53
P99 ITL (ms):                            260.91
Max ITL (ms):                            1703.21
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

Server Command

python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2-Instruct \
    --tp 8 \
    --dp 4 \
    --trust-remote-code  \
    --host 0.0.0.0 \
    --port 8000

Benchmark Command

python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000

Result:

Accuracy: 0.960
Invalid: 0.000
Latency: 15.956 s
Output throughput: 1231.699 token/s

Getting Started

Autoregressive / Qwen

Autoregressive / DeepSeek

Autoregressive / Llama

Autoregressive / GLM

Autoregressive / OpenAI

Autoregressive / Moonshotai

Autoregressive / MiniMax

Autoregressive / NVIDIA

Autoregressive / Ernie

Autoregressive / InternVL

Autoregressive / InternLM

Autoregressive / Jina AI

Autoregressive / Mistral

Autoregressive / Xiaomi

Autoregressive / FlashLabs

Diffusion / FLUX

Diffusion / Wan

Diffusion / Qwen-Image

Diffusion / Z-Image

Others / SpecBundle

Others / Benchmarks

Reference

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Reasoning Parser

4.2.2 Tool Calling

5. Benchmark

5.1 Speed Benchmark

5.1.1 Latency-Sensitive Benchmark

5.1.2 Throughput-Sensitive Benchmark

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

Getting Started

Autoregressive / Qwen

Autoregressive / DeepSeek

Autoregressive / Llama

Autoregressive / GLM

Autoregressive / OpenAI

Autoregressive / Moonshotai

Autoregressive / MiniMax

Autoregressive / NVIDIA

Autoregressive / Ernie

Autoregressive / InternVL

Autoregressive / InternLM

Autoregressive / Jina AI

Autoregressive / Mistral

Autoregressive / Xiaomi

Autoregressive / FlashLabs

Diffusion / FLUX

Diffusion / Wan

Diffusion / Qwen-Image

Diffusion / Z-Image

Others / SpecBundle

Others / Benchmarks

Reference

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Reasoning Parser

​4.2.2 Tool Calling

​5. Benchmark

​5.1 Speed Benchmark

​5.1.1 Latency-Sensitive Benchmark

​5.1.2 Throughput-Sensitive Benchmark

​5.2 Accuracy Benchmark

​5.2.1 GSM8K Benchmark

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Reasoning Parser

4.2.2 Tool Calling

5. Benchmark

5.1 Speed Benchmark

5.1.1 Latency-Sensitive Benchmark

5.1.2 Throughput-Sensitive Benchmark

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark