Skip to main content

Motivation

In multimodal reasoning services, the visual encoder (ViT / Vision Transformer) typically has a few characteristic traits: Many layers, fragmented operators: Each layer includes LN, QKV projections, attention, MLP, residual connections, etc., resulting in extremely frequent kernel launches. Server-side “small batch / low latency” is common: The batch size is very small (sometimes it looks like 1 after “flattening” the batch), so kernel launch overhead accounts for a large portion of end-to-end latency. Input token count (number of patches) varies frequently: Different image/video resolutions and different batch composition lead to different sequence lengths S — and this is precisely the biggest obstacle for CUDA Graph (unstable shapes). The value of CUDA Graph: It captures a long sequence of GPU kernels with fixed shapes and fixed memory addresses into a graph; later, for the same shapes, it can replay the graph directly, dramatically reducing launch overhead and making GPU scheduling more compact. This led us to seek a CUDA Graph enabled feature for ViT in order to improve ViT performance.

Design and Restrictions

The new CUDA Graph enabled ViT logic is built on ViTCudaGraphRunner. This runner captures the “blocks + merger + deepstack merger (optional)” part of a vision transformer into a CUDA graph and replays it for identical shapes. See the following design consideration and restrictions for more details.

Dynamic inputs to fit static constraints of CUDA Graph

Variable sequence length S is very common in ViT. While CUDA Graph requires fixed shapes. The solution is to build a graph cache by S(e.g., graph_key = S). The first time create a new S, and then capture a graph; afterwards, replay it. If there are many distinct S values, we need to increase VRAM usage which is graph-private memory pools for many graphs.

Stable addresses

Everything “parameter-like” becomes a static buffer:
  • block_input / block_ws / block_output
  • cu_full_len / cu_window_len and their kk variants
  • sin_cos_ws
In this way to solve the underlying requirement: during replay, not allowed to swap tensors, can only modify tensor contents.

Attention backend arguments

Attention backend arguments are fixed inside the graph: TritonAttn expects [cu_seqlens, cu_seqlens_kk, max_len] FA3 expects [cu_seqlens, max_len] max_len is frozen as an int constant. cu_seqlens is cached into a dict during create_graph(), and its contents are not updated during subsequent replays. For the same graph_key = S, you not only require the input shape to match, but also require the segmentation pattern in cu_seqlens (and window seqlens) to be identical. Otherwise, attention will segment the sequence incorrectly.

Rotary buffer management

The feature reallocates a larger sin_cos_ws when seq_len increases. The max_content_len is used to make sure the maximum size of the allocated rotary buffer.

Command Example

You can enable CUDA Graph for ViT by setting env variable SGLANG_VIT_ENABLE_CUDA_GRAPH=1, for example:
SGLANG_VIT_ENABLE_CUDA_GRAPH=1 \
python3 -m sglang.launch_server \
  --model Qwen/Qwen3-VL-8B-Instruct
Or you can run CUDA Graph for ViT together with Piecewise CUDA Graph feature by both setting env variable SGLANG_VIT_ENABLE_CUDA_GRAPH=1 and setting --enable-piecewise-cuda-graph, for example:
SGLANG_VIT_ENABLE_CUDA_GRAPH=1 \
python3 -m sglang.launch_server \
  --model Qwen/Qwen3-VL-8B-Instruct \
  --piecewise-cuda-graph-max-tokens 4096 \
  --enable-piecewise-cuda-graph \
  --piecewise-cuda-graph-compiler eager

Known supported models