Launch Llama 4 with SGLang
To serve Llama 4 models on 8xH100/H200 GPUs:Configuration Tips
-
OOM Mitigation: Adjust
--context-lengthto avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8*H100 and up to 2.5M on 8*H200. For the Maverick model, we don’t need to set context length on 8*H200. When hybrid kv cache is enabled,--context-lengthcan be set up to 5M on 8*H100 and up to 10M on 8*H200 for the Scout model. -
Attention Backend Auto-Selection: SGLang automatically selects the optimal attention backend for Llama 4 based on your hardware. You typically don’t need to specify
--attention-backendmanually:- Blackwell GPUs (B200/GB200):
trtllm_mha - Hopper GPUs (H100/H200):
fa3 - AMD GPUs:
aiter - Intel XPU:
intel_xpu - Other platforms:
triton(fallback)
--attention-backendwith one of the supported backends:fa3,aiter,triton,trtllm_mha, orintel_xpu. - Blackwell GPUs (B200/GB200):
-
Chat Template: Add
--chat-template llama-4for chat completion tasks. -
Enable Multi-Modal: Add
--enable-multimodalfor multi-modal capabilities. -
Enable Hybrid-KVCache: Set
--swa-full-tokens-ratioto adjust the ratio of SWA layer (for Llama4, it’s local attention layer) KV tokens / full layer KV tokens. (default: 0.8, range: 0-1)
EAGLE Speculative Decoding
Description: SGLang has supported Llama 4 Maverick (400B) with EAGLE speculative decoding. Usage: Add arguments--speculative-draft-model-path, --speculative-algorithm, --speculative-num-steps, --speculative-eagle-topk and --speculative-num-draft-tokens to enable this feature. For example:
- Note The Llama 4 draft model nvidia/Llama-4-Maverick-17B-128E-Eagle3 can only recognize conversations in chat mode.
Benchmarking Results
Accuracy Test with lm_eval
The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the official benchmark numbers.
Benchmark results on MMLU Pro dataset with 8*H100:
| Llama-4-Scout-17B-16E-Instruct | Llama-4-Maverick-17B-128E-Instruct | |
|---|---|---|
| Official Benchmark | 74.3 | 80.5 |
| SGLang | 75.2 | 80.7 |
