Support Matrix
The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the SGLang documentation on DeepSeek MLA and the original DeepSeek MLA paper.MHA Backends
| Backend | Page Size > 1 (native) | FP8 KV Cache | FP4 KV Cache | Spec topk=1 | Spec topk>1 | Sliding Window | MultiModal |
|---|---|---|---|---|---|---|---|
| FlashInfer | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| FA3 (FlashAttention 3) | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| FA4 (FlashAttention 4) | 128 | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Triton | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Torch Native (SDPA) | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| FlexAttention (PyTorch) | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| TRTLLM MHA | 16, 32 or 64 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| Dual Chunk FlashAttention | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AITER (ROCm) | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ |
| Wave (ROCm) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Ascend (NPU) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Intel XPU | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Intel AMX (CPU) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
MLA Backends
| Backend | Native Page Sizes | FP8 KV Cache | FP4 KV Cache | Chunked Prefix Cache | Spec topk=1 | Spec topk>1 |
|---|---|---|---|---|---|---|
| FlashInfer MLA | 1 | ❌ | ✅ | ✅ | ✅ | ❌ |
| FlashMLA | 64 | ✅ | ✅ | ✅ | ✅ | ❌ |
| Cutlass MLA | 128 | ✅ | ✅ | ✅ | ✅ | ❌ |
| TRTLLM MLA (Blackwell) | 32 or 64 | ✅ | ✅ | ✅ | ✅ | ❌ |
| FA3 (FlashAttention 3) | n/a | ❌ | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) |
| Triton | n/a | ❌ | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) |
| FA4 | 1 | ❌ | ✅ | ❌ | ❌ | ❌ |
| Ascend MLA (NPU) | 128 | ❌ | ❌ | ❌ | ❌ | ❌ |
Multimodal attention is selected by
--mm-attention-backend. The “MultiModal” column indicates whether a corresponding multimodal implementation exists for that backend family.- FlashAttention 4 is prefill-only for now.
- NSA is specifically designed for DeepSeek V3.2 DSA.
For the KV4 FA4 scenario, FA4 requires using a different —decode-attention-backend to run. Except for trtllm_mha being incompatible with FA4, all other decode backends behave as shown in the table.
page_size > 1 at the wrapper layer by expanding page tables to per-token indices. The “Page Size > 1 (native)” column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).
MLA page-size constraints:
- FlashInfer MLA: page_size = 1.
- FlashMLA: page_size = 64.
- Cutlass MLA: page_size = 128.
- TRTLLM MLA: page_size ∈ {32, 64}.
Hybrid attention (different backends for prefill vs decode) (Experimental)
You can mix-and-match attention backends for prefill and decode. This is useful when one backend excels at prefill and another excels at decode. For the implementation details, please seepython/sglang/srt/layers/attention/hybrid_attn_backend.py.
Speculative decoding with hybrid attention
Hybrid attention also works with speculative decoding. The backend used for draft decoding and target verification depends on--speculative-attention-mode:
--speculative-attention-mode decode(recommended): draft/verify use the decode backend.--speculative-attention-mode prefill(default): draft/verify use the prefill backend.
- If any attention backend is
trtllm_mha, speculative decoding supports only--speculative-eagle-topk 1. - For paged MHA backends with
--page-size > 1and--speculative-eagle-topk > 1, onlyflashinferis supported. - CUDA Graph: the decode backend is always captured; the prefill backend is captured only when
--speculative-attention-mode prefill.
Attention Backend Selection Guide (CUDA)
If the--attention-backend argument is not specified, SGLang automatically selects the best backend based on the hardware (CUDA) and model architecture.
Automatic Selection Logic
1. MHA Models (e.g., Llama, Qwen)- Hopper (e.g., H100, H200): Defaults to
fa3if using CUDA 12.3+ and the model configuration is supported. - Blackwell (e.g., B200): Defaults to
trtllm_mha, unless using speculative decoding withtopk > 1. - Other Architectures (Ampere, Ada, etc.): Defaults to
flashinferif available; otherwise falls back totriton.
- Hopper: Defaults to
fa3(requires CUDA 12.3+). - Blackwell: Defaults to
trtllm_mla. - Other Architectures: Defaults to
triton.
User Guide
Launch Command for Different Attention Backends
- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
- FlashAttention 3 (Default for Hopper Machines, e.g., H100, H200, H20)
- Triton
- FlashMLA
- TRTLLM MLA (Optimized for Blackwell Architecture, e.g., B200)
- TRTLLM MLA with FP8 KV Cache (Higher concurrency, lower memory footprint)
- FlashAttention 4 (MHA & MLA)
- Cutlass MLA
- Ascend
- Intel XPU
- Wave
- FlexAttention
- Dual Chunk FlashAttention
- Torch Native
Steps to add a new attention backend
To add a new attention backend, you can learn from the existing backends (python/sglang/srt/layers/attention/triton_backend.py, python/sglang/srt/layers/attention/flashattention_backend.py)
and follow the steps below.
- Run without cuda graph. Support the two forward functions
- forward_extend
- Will be used for prefill, prefill with KV cache, and target verification
- It will be called once per layer
- forward_decode
- Will be used for normal decode, and draft decode
- It will be called once per layer
- init_forward_metadata
- Initialize the class and common metadata shared by all layers
- Call the plan function for optimizations like split_kv
- It will be called once per forward
- forward_extend
- Run with cuda graph. It has two phases (capture and replay) and you need to implement three functions
- init_cuda_graph_state
- It will be called once during life time
- Create all common shared buffers
- init_forward_metadata_capture_cuda_graph
- It will be called before capturing a cuda graph
- It is similar to init_forward_metadata but write the medatada to some pre-defined buffers
- init_forward_metadata_replay_cuda_graph
- It will be called before replaying a cuda graph
- This function is in the critical path and needs to be fast
- init_cuda_graph_state
