Attention Backend

SGLang supports a large variety of attention backends. Each of them has different pros and cons. You can test them according to your needs.

Selecting an optimal attention backend is crucial for maximizing your performance. Different backends excel in various scenarios, so choose based on your model, hardware, and use case. Not all backends are supported on all platforms and model architectures.If you don’t specify --attention-backend, SGLang makes a best effort to automatically select the most performant backend based on your hardware and model architecture.

Support Matrix

The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the SGLang documentation on DeepSeek MLA and the original DeepSeek MLA paper.

MHA Backends

Backend	Page Size > 1 (native)	FP8 KV Cache	FP4 KV Cache	Spec topk=1	Spec topk>1	Sliding Window	MultiModal
FlashInfer	✅	✅	❌	✅	✅	✅	❌
FA3 (FlashAttention 3)	✅	✅	❌	✅	✅	✅	✅
FA4 (FlashAttention 4)	128	❌	✅	❌	❌	❌	✅
Triton	❌	❌	✅	✅	✅	✅	✅
Torch Native (SDPA)	❌	✅	✅	❌	❌	❌	✅
FlexAttention (PyTorch)	❌	❌	✅	❌	❌	❌	❌
TRTLLM MHA	16, 32 or 64	✅	✅	✅	❌	✅	❌
Dual Chunk FlashAttention	✅	❌	❌	❌	❌	❌	❌
AITER (ROCm)	✅	✅	❌	✅	✅	❌	✅
Wave (ROCm)	✅	❌	❌	❌	❌	❌	❌
Ascend (NPU)	✅	❌	❌	❌	❌	❌	✅
Intel XPU	✅	❌	❌	❌	❌	✅	❌
Intel AMX (CPU)	❌	❌	❌	❌	❌	❌	❌

MLA Backends

Backend	Native Page Sizes	FP8 KV Cache	FP4 KV Cache	Chunked Prefix Cache	Spec topk=1	Spec topk>1
FlashInfer MLA	1	❌	✅	✅	✅	❌
FlashMLA	64	✅	✅	✅	✅	❌
Cutlass MLA	128	✅	✅	✅	✅	❌
TRTLLM MLA (Blackwell)	32 or 64	✅	✅	✅	✅	❌
FA3 (FlashAttention 3)	n/a	❌	❌	✅	✅	⚠️ (page_size=1 only)
Triton	n/a	❌	❌	❌	✅	⚠️ (page_size=1 only)
FA4	1	❌	✅	❌	❌	❌
Ascend MLA (NPU)	128	❌	❌	❌	❌	❌

Multimodal attention is selected by --mm-attention-backend. The “MultiModal” column indicates whether a corresponding multimodal implementation exists for that backend family.

FlashAttention 4 is prefill-only for now.
NSA is specifically designed for DeepSeek V3.2 DSA.

For the KV4 FA4 scenario, FA4 requires using a different —decode-attention-backend to run. Except for trtllm_mha being incompatible with FA4, all other decode backends behave as shown in the table.

Speculative decoding topk: topk is the number of draft tokens sampled per step from the draft model. topk = 1 follows classic EAGLE; topk > 1 explores multiple branches and requires backend support in both draft and verification paths.

Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and page_size = 64, it won’t fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and page_size = 64, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use page_size = 1 for maximum prefix reuse (token-level matching).

Many backends that do not natively operate on pages can emulate page_size > 1 at the wrapper layer by expanding page tables to per-token indices. The “Page Size > 1 (native)” column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). MLA page-size constraints:

FlashInfer MLA: page_size = 1.
FlashMLA: page_size = 64.
Cutlass MLA: page_size = 128.
TRTLLM MLA: page_size ∈ {32, 64}.

Hybrid attention (different backends for prefill vs decode) (Experimental)

Hybrid attention is an experimental feature.

You can mix-and-match attention backends for prefill and decode. This is useful when one backend excels at prefill and another excels at decode. For the implementation details, please see python/sglang/srt/layers/attention/hybrid_attn_backend.py.

# Example: Prefill with FA4, Decode with TRTLLM MLA (Blackwell)
python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-FP4 \
  --tp 8 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --prefill-attention-backend fa4

Speculative decoding with hybrid attention

Hybrid attention also works with speculative decoding. The backend used for draft decoding and target verification depends on --speculative-attention-mode:

--speculative-attention-mode decode (recommended): draft/verify use the decode backend.
--speculative-attention-mode prefill (default): draft/verify use the prefill backend.

Constraints when combining hybrid attention with speculative decoding:

If any attention backend is trtllm_mha, speculative decoding supports only --speculative-eagle-topk 1.
For paged MHA backends with --page-size > 1 and --speculative-eagle-topk > 1, only flashinfer is supported.
CUDA Graph: the decode backend is always captured; the prefill backend is captured only when --speculative-attention-mode prefill.

If you set only one of --prefill-attention-backend or --decode-attention-backend, the unspecified phase inherits --attention-backend. If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.