Skip to main content
SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels from sgl-kernel and an efficient scheduler loop.

Key Features

  • Broad Model Support: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
  • Fast Inference: Optimized kernels from sgl-kernel, efficient scheduler loop, and Cache-DiT acceleration
  • Ease of Use: OpenAI-compatible API, CLI, and Python SDK
  • Multi-Platform: NVIDIA GPUs (H100, H200, A100, B200, 4090) and AMD GPUs (MI300X, MI325X)

Install SGLang-diffusion

You can install sglang-diffusion using one of the methods below. This page primarily applies to common NVIDIA GPU platforms. For AMD Instinct/ROCm environments see the dedicated ROCm quickstart, which lists the exact steps (including kernel builds) we used to validate sgl-diffusion on MI300X.

Method 1: With pip or uv

It is recommended to use uv for a faster installation:
pip install --upgrade pip
pip install uv
uv pip install "sglang[diffusion]" --prerelease=allow

Method 2: From source

# Use the latest release branch
git clone https://github.com/sgl-project/sglang.git
cd sglang

# Install the Python packages
pip install --upgrade pip
pip install -e "python[diffusion]"

# With uv
uv pip install -e "python[diffusion]" --prerelease=allow

Method 3: Using Docker

The Docker images are available on Docker Hub at lmsysorg/sglang, built from the Dockerfile. Replace <secret> below with your HuggingFace Hub token.
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:dev \
    sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A logo With Bold Large text: SGL Diffusion" \
    --save-output

ROCm quickstart for sgl-diffusion

docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env HF_TOKEN=<secret> \
  lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \
  sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output

Compatibility Matrix

The table below shows every supported model and the optimizations supported for them. The symbols used have the following meanings:
  • ✅ = Full compatibility
  • ❌ = No compatibility
  • ⭕ = Does not apply to this model

Models x Optimization

The HuggingFace Model ID can be passed directly to from_pretrained() methods, and sglang-diffusion will use the optimal default parameters when initializing and generating videos.

Video Generation Models

Model NameHugging Face Model IDResolutionsTeaCacheSliding Tile AttnSage AttnVideo Sparse Attention (VSA)
FastWan2.1 T2V 1.3BFastVideo/FastWan2.1-T2V-1.3B-Diffusers480p
FastWan2.2 TI2V 5B Full AttnFastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers720p
Wan2.2 TI2V 5BWan-AI/Wan2.2-TI2V-5B-Diffusers720p
Wan2.2 T2V A14BWan-AI/Wan2.2-T2V-A14B-Diffusers480p
720p
Wan2.2 I2V A14BWan-AI/Wan2.2-I2V-A14B-Diffusers480p
720p
HunyuanVideohunyuanvideo-community/HunyuanVideo720×1280
544×960
FastHunyuanFastVideo/FastHunyuan-diffusers720×1280
544×960
Wan2.1 T2V 1.3BWan-AI/Wan2.1-T2V-1.3B-Diffusers480p
Wan2.1 T2V 14BWan-AI/Wan2.1-T2V-14B-Diffusers480p, 720p
Wan2.1 I2V 480PWan-AI/Wan2.1-I2V-14B-480P-Diffusers480p
Wan2.1 I2V 720PWan-AI/Wan2.1-I2V-14B-720P-Diffusers720p
Note: Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.

Image Generation Models

Model NameHuggingFace Model IDResolutions
FLUX.1-devblack-forest-labs/FLUX.1-devAny resolution
FLUX.2-devblack-forest-labs/FLUX.2-devAny resolution
FLUX.2-Kleinblack-forest-labs/FLUX.2-klein-4BAny resolution
Z-Image-TurboTongyi-MAI/Z-Image-TurboAny resolution
GLM-Imagezai-org/GLM-ImageAny resolution
Qwen ImageQwen/Qwen-ImageAny resolution
Qwen Image 2512Qwen/Qwen-Image-2512Any resolution
Qwen Image EditQwen/Qwen-Image-EditAny resolution

Verified LoRA Examples

This section lists example LoRAs that have been explicitly tested and verified with each base model in the SGLang Diffusion pipeline.
Important:
LoRAs that are not listed here are not necessarily incompatible. In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions. The entries below simply reflect configurations that have been manually validated by the SGLang team.

Verified LoRAs by Base Model

Base ModelSupported LoRAs
Wan2.2lightx2v/Wan2.2-Distill-Loras
Cseti/wan2.2-14B-Arcane_Jinx-lora-v1
Wan2.1lightx2v/Wan2.1-Distill-Loras
Z-Image-Turbotarn59/pixel_art_style_lora_z_image_turbo
wcde/Z-Image-Turbo-DeJPEG-Lora
Qwen-Imagelightx2v/Qwen-Image-Lightning
flymy-ai/qwen-image-realism-lora
prithivMLmods/Qwen-Image-HeadshotX
starsfriday/Qwen-Image-EVA-LoRA
Qwen-Image-Editostris/qwen_image_edit_inpainting
lightx2v/Qwen-Image-Edit-2511-Lightning
Fluxdvyio/flux-lora-simple-illustration
XLabs-AI/flux-furry-lora
XLabs-AI/flux-RealismLora

Special Requirements

Sliding Tile Attention: Currently, only Hopper GPUs (H100s) are supported.

SGLang diffusion CLI Inference

The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.

Prerequisites

  • A working SGLang diffusion installation and the sglang CLI available in $PATH.
  • Python 3.11+ if you plan to use the OpenAI Python SDK.

Supported Arguments

Server Arguments

  • --model-path {MODEL_PATH}: Path to the model or model ID
  • --vae-path {VAE_PATH}: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path.
  • --lora-path {LORA_PATH}: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
  • --lora-nickname {NAME}: Nickname for the LoRA adapter. (default: default).
  • --num-gpus {NUM_GPUS}: Number of GPUs to use
  • --tp-size {TP_SIZE}: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
  • --sp-degree {SP_SIZE}: Sequence parallelism size (typically should match the number of GPUs)
  • --ulysses-degree {ULYSSES_DEGREE}: The degree of DeepSpeed-Ulysses-style SP in USP
  • --ring-degree {RING_DEGREE}: The degree of ring attention-style SP in USP

Sampling Parameters

  • --prompt {PROMPT}: Text description for the video you want to generate
  • --num-inference-steps {STEPS}: Number of denoising steps
  • --negative-prompt {PROMPT}: Negative prompt to guide generation away from certain concepts
  • --seed {SEED}: Random seed for reproducible generation

Image/Video Configuration

  • --height {HEIGHT}: Height of the generated output
  • --width {WIDTH}: Width of the generated output
  • --num-frames {NUM_FRAMES}: Number of frames to generate
  • --fps {FPS}: Frames per second for the saved output, if this is a video-generation task

Output Options

  • --output-path {PATH}: Directory to save the generated video
  • --save-output: Whether to save the image/video to disk
  • --return-frames: Whether to return the raw frames

Using Configuration Files

Instead of specifying all parameters on the command line, you can use a configuration file:
sglang generate --config {CONFIG_FILE_PATH}
The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file. Example configuration file (config.json):
{
    "model_path": "FastVideo/FastHunyuan-diffusers",
    "prompt": "A beautiful woman in a red dress walking down a street",
    "output_path": "outputs/",
    "num_gpus": 2,
    "sp_size": 2,
    "tp_size": 1,
    "num_frames": 45,
    "height": 720,
    "width": 1280,
    "num_inference_steps": 6,
    "seed": 1024,
    "fps": 24,
    "precision": "bf16",
    "vae_precision": "fp16",
    "vae_tiling": true,
    "vae_sp": true,
    "vae_config": {
        "load_encoder": false,
        "load_decoder": true,
        "tile_sample_min_height": 256,
        "tile_sample_min_width": 256
    },
    "text_encoder_precisions": [
        "fp16",
        "fp16"
    ],
    "mask_strategy_file_path": null,
    "enable_torch_compile": false
}
Or using YAML format (config.yaml):
model_path: "FastVideo/FastHunyuan-diffusers"
prompt: "A beautiful woman in a red dress walking down a street"
output_path: "outputs/"
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: "bf16"
vae_precision: "fp16"
vae_tiling: true
vae_sp: true
vae_config:
  load_encoder: false
  load_decoder: true
  tile_sample_min_height: 256
  tile_sample_min_width: 256
text_encoder_precisions:
  - "fp16"
  - "fp16"
mask_strategy_file_path: null
enable_torch_compile: false
To see all the options, you can use the --help flag:
sglang generate --help

Serve

Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.

Start the server

Use the following command to launch the server:
SERVER_ARGS=(
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
)

sglang serve "${SERVER_ARGS[@]}"
  • —model-path: Which model to load. The example uses Wan-AI/Wan2.1-T2V-1.3B-Diffusers.
  • —port: HTTP port to listen on (the default here is 30010).
For detailed API usage, including Image, Video Generation and LoRA management, please refer to the OpenAI API Documentation.

Generate

Run a one-off generation task without launching a persistent server. To use it, pass both server arguments and sampling parameters in one command, after the generate subcommand, for example:
SERVER_ARGS=(
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
)

SAMPLING_ARGS=(
  --prompt "A curious raccoon"
  --save-output
  --output-path outputs
  --output-file-name "A curious raccoon.mp4"
)

sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"

# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
Once the generation task has finished, the server will shut down automatically.
The HTTP server-related arguments are ignored in this subcommand.

Diffusers Backend

SGLang diffusion supports a diffusers backend that allows you to run any diffusers-compatible model through SGLang’s infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.

Arguments

ArgumentValuesDescription
--backendauto (default), sglang, diffusersauto: prefer native SGLang, fallback to diffusers. sglang: force native (fails if unavailable). diffusers: force vanilla diffusers pipeline.
--diffusers-attention-backendflash, _flash_3_hub, sage, xformers, nativeAttention backend for diffusers pipelines. See diffusers attention backends.
--trust-remote-codeflagRequired for models with custom pipeline classes (e.g., Ovis).
--vae-tilingflagEnable VAE tiling for large image support (decodes tile-by-tile).
--vae-slicingflagEnable VAE slicing for lower memory usage (decodes slice-by-slice).
--dit-precisionfp16, bf16, fp32Precision for the diffusion transformer.
--vae-precisionfp16, bf16, fp32Precision for the VAE.

Example: Running Ovis-Image-7B

Ovis-Image-7B is a 7B text-to-image model optimized for high-quality text rendering.
sglang generate \
  --model-path AIDC-AI/Ovis-Image-7B \
  --backend diffusers \
  --trust-remote-code \
  --diffusers-attention-backend flash \
  --prompt "A serene Japanese garden with cherry blossoms" \
  --height 1024 \
  --width 1024 \
  --num-inference-steps 30 \
  --save-output \
  --output-path outputs \
  --output-file-name ovis_garden.png

Extra Diffusers Arguments

For pipeline-specific parameters not exposed via CLI, use diffusers_kwargs in a config file:
{
    "model_path": "AIDC-AI/Ovis-Image-7B",
    "backend": "diffusers",
    "prompt": "A beautiful landscape",
    "diffusers_kwargs": {
        "cross_attention_kwargs": {"scale": 0.5}
    }
}
sglang generate --config config.json

SGLang Diffusion OpenAI API

The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.

Serve

Launch the server using the sglang serve command.

Start the server

SERVER_ARGS=(
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
  --port 30010
)

sglang serve "${SERVER_ARGS[@]}"
  • —model-path: Path to the model or model ID.
  • —port: HTTP port to listen on (default: 30000).

Get Model Information

Endpoint: GET /models Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings. Curl Example:
curl -sS -X GET "http://localhost:30010/models"
Response Example:
{
  "model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
  "task_type": "T2V",
  "pipeline_name": "wan_pipeline",
  "pipeline_class": "WanPipeline",
  "num_gpus": 4,
  "dit_precision": "bf16",
  "vae_precision": "fp16"
}

Endpoints

Image Generation

The server implements an OpenAI-compatible Images API under the /v1/images namespace.

Create an image

Endpoint: POST /v1/images/generations Python Example (b64_json response):
import base64
from openai import OpenAI

client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")

img = client.images.generate(
    prompt="A calico cat playing a piano on stage",
    size="1024x1024",
    n=1,
    response_format="b64_json",
)

image_bytes = base64.b64decode(img.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_bytes)
Curl Example:
curl -sS -X POST "http://localhost:30010/v1/images/generations" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
      }'
Note The response_format=url option is not supported for POST /v1/images/generations and will return a 400 error.

Edit an image

Endpoint: POST /v1/images/edits This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image. Curl Example (b64_json response):
curl -sS -X POST "http://localhost:30010/v1/images/edits" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "image=@local_input_image.png" \
  -F "url=image_url.jpg" \
  -F "prompt=A calico cat playing a piano on stage" \
  -F "size=1024x1024" \
  -F "response_format=b64_json"
Curl Example (URL response):
curl -sS -X POST "http://localhost:30010/v1/images/edits" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "image=@local_input_image.png" \
  -F "url=image_url.jpg" \
  -F "prompt=A calico cat playing a piano on stage" \
  -F "size=1024x1024" \
  -F "response_format=url"

Download image content

When response_format=url is used with POST /v1/images/edits, the API returns a relative URL like /v1/images/<IMAGE_ID>/content. Endpoint: GET /v1/images/{image_id}/content Curl Example:
curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -o output.png

Video Generation

The server implements a subset of the OpenAI Videos API under the /v1/videos namespace.

Create a video

Endpoint: POST /v1/videos Python Example:
from openai import OpenAI

client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")

video = client.videos.create(
    prompt="A calico cat playing a piano on stage",
    size="1280x720"
)
print(f"Video ID: {video.id}, Status: {video.status}")
Curl Example:
curl -sS -X POST "http://localhost:30010/v1/videos" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1280x720"
      }'

List videos

Endpoint: GET /v1/videos Python Example:
videos = client.videos.list()
for item in videos.data:
    print(item.id, item.status)
Curl Example:
curl -sS -X GET "http://localhost:30010/v1/videos" \
  -H "Authorization: Bearer sk-proj-1234567890"

Download video content

Endpoint: GET /v1/videos/{video_id}/content Python Example:
import time

# Poll for completion
while True:
    page = client.videos.list()
    item = next((v for v in page.data if v.id == video_id), None)
    if item and item.status == "completed":
        break
    time.sleep(5)

# Download content
resp = client.videos.download_content(video_id=video_id)
with open("output.mp4", "wb") as f:
    f.write(resp.read())
Curl Example:
curl -sS -L "http://localhost:30010/v1/videos/<VIDEO_ID>/content" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -o output.mp4

LoRA Management

The server supports dynamic loading, merging, and unmerging of LoRA adapters. Important Notes:
  • Mutual Exclusion: Only one LoRA can be merged (active) at a time
  • Switching: To switch LoRAs, you must first unmerge the current one, then set the new one
  • Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost

Set LoRA Adapter

Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters. Endpoint: POST /v1/set_lora Parameters:
  • lora_nickname (string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAs
  • lora_path (string or list of strings/None, optional): Path to the .safetensors file(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length of lora_nickname
  • target (string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length of lora_nickname. Valid values:
    • "all" (default): Apply to all transformers
    • "transformer": Apply only to the primary transformer (high noise for Wan2.2)
    • "transformer_2": Apply only to transformer_2 (low noise for Wan2.2)
    • "critic": Apply only to the critic model
  • strength (float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length of lora_nickname. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
Single LoRA Example:
curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": "lora_name",
        "lora_path": "/path/to/lora.safetensors",
        "target": "all",
        "strength": 0.8
      }'
Multiple LoRA Example:
curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": ["lora_1", "lora_2"],
        "lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"],
        "target": ["transformer", "transformer_2"],
        "strength": [0.8, 1.0]
      }'
Multiple LoRA with Same Target:
curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": ["style_lora", "character_lora"],
        "lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"],
        "target": "all",
        "strength": [0.7, 0.9]
      }'
When using multiple LoRAs:
  • All list parameters (lora_nickname, lora_path, target, strength) must have the same length
  • If target or strength is a single value, it will be applied to all LoRAs
  • Multiple LoRAs applied to the same target will be merged in order

Merge LoRA Weights

Manually merges the currently set LoRA weights into the base model.
set_lora automatically performs a merge, so this is typically only needed if you have manually unmerged but want to re-apply the same LoRA without calling set_lora again.
Endpoint: POST /v1/merge_lora_weights Parameters:
  • target (string, optional): Which transformer(s) to merge. One of “all” (default), “transformer”, “transformer_2”, “critic”
  • strength (float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
Curl Example:
curl -X POST http://localhost:30010/v1/merge_lora_weights \
  -H "Content-Type: application/json" \
  -d '{"strength": 0.8}'

Unmerge LoRA Weights

Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This must be called before setting a different LoRA. Endpoint: POST /v1/unmerge_lora_weights Curl Example:
curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
  -H "Content-Type: application/json"

List LoRA Adapters

Returns loaded LoRA adapters and current application status per module. Endpoint: GET /v1/list_loras Curl Example:
curl -sS -X GET "http://localhost:30010/v1/list_loras"
Response Example:
{
  "loaded_adapters": [
    { "nickname": "lora_a", "path": "/weights/lora_a.safetensors" },
    { "nickname": "lora_b", "path": "/weights/lora_b.safetensors" }
  ],
  "active": {
    "transformer": [
      {
        "nickname": "lora2",
        "path": "tarn59/pixel_art_style_lora_z_image_turbo",
        "merged": true,
        "strength": 1.0
      }
    ]
  }
}
Notes:
  • If LoRA is not enabled for the current pipeline, the server will return an error.
  • num_lora_layers_with_weights counts only layers that have LoRA weights applied for the active adapter.

Example: Switching LoRAs

  1. Set LoRA A:
    curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}'
    
  2. Generate with LoRA A…
  3. Unmerge LoRA A:
    curl -X POST http://localhost:30010/v1/unmerge_lora_weights
    
  4. Set LoRA B:
    curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'
    
  5. Generate with LoRA B…

Attention Backends

This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.

Overview

Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend. Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).
  • CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
  • ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
  • MPS: always uses PyTorch SDPA.

Backend options

The CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.
CLI valueEnum valueNotes
fa / fa3 / fa4FAFlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__).
torch_sdpaTORCH_SDPAPyTorch scaled_dot_product_attention.
sliding_tile_attnSLIDING_TILE_ATTNSliding Tile Attention (STA). Requires st_attn and a mask-strategy config file set via the SGLANG_DIFFUSION_ATTENTION_CONFIG environment variable.
sage_attnSAGE_ATTNRequires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py.
sage_attn_3SAGE_ATTN_3Requires SageAttention3 installed per upstream instructions.
video_sparse_attnVIDEO_SPARSE_ATTNRequires vsa.
vmoba_attnVMOBA_ATTNRequires kernel.attn.vmoba_attn.vmoba.
aiterAITERRequires aiter.

Selection priority

The selection order in runtime/layers/attention/selector.py is:
  1. global_force_attn_backend(...) / global_force_attn_backend_context_manager(...)
  2. CLI --attention-backend (ServerArgs.attention_backend)
  3. Auto selection (platform capability, dtype, and installed packages)

Platform support matrix

BackendCUDAROCmMPSNotes
faCUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa.
torch_sdpaMost compatible option across platforms.
sliding_tile_attnCUDA-only. Requires st_attn and SGLANG_DIFFUSION_ATTENTION_CONFIG.
sage_attnCUDA-only (optional dependency).
sage_attn_3CUDA-only (optional dependency).
video_sparse_attnCUDA-only. Requires vsa.
vmoba_attnCUDA-only. Requires kernel.attn.vmoba_attn.vmoba.
aiterRequires aiter.

Usage

Select a backend via CLI

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend fa
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend torch_sdpa

Using Sliding Tile Attention (STA)

export SGLANG_DIFFUSION_ATTENTION_CONFIG=/abs/path/to/mask_strategy.json

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend sliding_tile_attn

Notes for ROCm / MPS

  • ROCm: use --attention-backend torch_sdpa or fa depending on what is available in your environment.
  • MPS: the platform implementation always uses torch_sdpa.

Cache-DiT Acceleration

SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss.

Overview

Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:
  • DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
  • TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
  • SCM (Step Computation Masking): Step-level caching control for additional speedup

Basic Usage

Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Advanced Configuration

DBCache Parameters

DBCache controls block-level caching behavior:
ParameterEnv VariableDefaultDescription
FnSGLANG_CACHE_DIT_FN1Number of first blocks to always compute
BnSGLANG_CACHE_DIT_BN0Number of last blocks to always compute
WSGLANG_CACHE_DIT_WARMUP4Warmup steps before caching starts
RSGLANG_CACHE_DIT_RDT0.24Residual difference threshold
MCSGLANG_CACHE_DIT_MC3Maximum continuous cached steps

TaylorSeer Configuration

TaylorSeer improves caching accuracy using Taylor expansion:
ParameterEnv VariableDefaultDescription
EnableSGLANG_CACHE_DIT_TAYLORSEERfalseEnable TaylorSeer calibrator
OrderSGLANG_CACHE_DIT_TS_ORDER1Taylor expansion order (1 or 2)

Combined Configuration Example

DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A curious raccoon in a forest"

SCM (Step Computation Masking)

SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results.

SCM Presets

SCM is configured with presets:
PresetCompute RatioSpeedQuality
none100%BaselineBest
slow~75%~1.3xHigh
medium~50%~2xGood
fast~35%~3xAcceptable
ultra~25%~4xLower
Usage
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"

Custom SCM Bins

For fine-grained control over which steps to compute vs cache:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"

SCM Policy

PolicyEnv VariableDescription
dynamicSGLANG_CACHE_DIT_SCM_POLICY=dynamicAdaptive caching based on content (default)
staticSGLANG_CACHE_DIT_SCM_POLICY=staticFixed caching pattern

Environment Variables

All Cache-DiT parameters can be set via the following environment variables:
Environment VariableDefaultDescription
SGLANG_CACHE_DIT_ENABLEDfalseEnable Cache-DiT acceleration
SGLANG_CACHE_DIT_FN1First N blocks to always compute
SGLANG_CACHE_DIT_BN0Last N blocks to always compute
SGLANG_CACHE_DIT_WARMUP4Warmup steps before caching
SGLANG_CACHE_DIT_RDT0.24Residual difference threshold
SGLANG_CACHE_DIT_MC3Max continuous cached steps
SGLANG_CACHE_DIT_TAYLORSEERfalseEnable TaylorSeer calibrator
SGLANG_CACHE_DIT_TS_ORDER1TaylorSeer order (1 or 2)
SGLANG_CACHE_DIT_SCM_PRESETnoneSCM preset (none/slow/medium/fast/ultra)
SGLANG_CACHE_DIT_SCM_POLICYdynamicSCM caching policy
SGLANG_CACHE_DIT_SCM_COMPUTE_BINSnot setCustom SCM compute bins
SGLANG_CACHE_DIT_SCM_CACHE_BINSnot setCustom SCM cache bins

Supported Models

SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
Model FamilyExample Models
WanWan2.1, Wan2.2
FluxFLUX.1-dev, FLUX.2-dev, FLUX.2-Klein
Z-ImageZ-Image-Turbo
QwenQwen-Image, Qwen-Image-Edit
GLMGLM-Image
HunyuanHunyuanVideo

Performance Tips

  1. Start with defaults: The default parameters work well for most models
  2. Use TaylorSeer: It typically improves both speed and quality
  3. Tune R threshold: Lower values = better quality, higher values = faster
  4. SCM for extra speed: Use medium preset for good speed/quality balance
  5. Warmup matters: Higher warmup = more stable caching decisions

Limitations

  • Single GPU only: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically disabled when world_size > 1
  • SCM minimum steps: SCM requires >= 8 inference steps to be effective
  • Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported

Troubleshooting

Distributed environment warning

WARNING: cache-dit is disabled in distributed environment (world_size=N)
This is expected behavior. Cache-DiT currently only supports single-GPU inference.

SCM disabled for low step count

For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.

References


Profiling Multimodal Generation

This guide covers profiling techniques for multimodal generation pipelines in SGLang.

PyTorch Profiler

PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics.

Denoising Stage Profiling

Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step):
sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --seed 0 \
  --profile
Parameters:
  • --profile: Enable profiling for the denoising stage
  • --num-profiled-timesteps N: Number of timesteps to profile after warmup (default: 5)
    • Smaller values reduce trace file size
    • Example: --num-profiled-timesteps 10 profiles 10 steps after 1 warmup step

Full Pipeline Profiling

Profile all pipeline stages (text encoding, denoising, VAE decoding, etc.):
sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --seed 0 \
  --profile \
  --profile-all-stages
Parameters:
  • --profile-all-stages: Used with --profile, profile all pipeline stages instead of just denoising

Output Location

By default, trace files are saved in the ./logs/ directory. The exact output file path will be shown in the console output, for example:
[mm-dd hh:mm:ss] Saved profiler traces to: /sgl-workspace/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz

View Traces

Load and visualize trace files at: For large trace files, reduce --num-profiled-timesteps or avoid using --profile-all-stages.

--perf-dump-path (Stage/Step Timing Dump)

Besides profiler traces, you can also dump a lightweight JSON report that contains:
  • stage-level timing breakdown for the full pipeline
  • step-level timing breakdown for the denoising stage (per diffusion step)
This is useful to quickly identify which stage dominates end-to-end latency, and whether denoising steps have uniform runtimes (and if not, which step has an abnormal spike). The dumped JSON contains a denoise_steps_ms field formatted as an array of objects, each with a step key (the step index) and a duration_ms key. Example:
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "<PROMPT>" \
  --perf-dump-path perf.json

Nsight Systems

Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns.

Installation

See the SGLang profiling guide for installation instructions.

Basic Profiling

Profile the entire pipeline execution:
nsys profile \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  --force-overwrite=true \
  -o QwenImage \
  sglang generate \
    --model-path Qwen/Qwen-Image \
    --prompt "A Logo With Bold Large Text: SGL Diffusion" \
    --seed 0

Targeted Stage Profiling

Use --delay and --duration to capture specific stages and reduce file size:
nsys profile \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  --force-overwrite=true \
  --delay 10 \
  --duration 30 \
  -o QwenImage_denoising \
  sglang generate \
    --model-path Qwen/Qwen-Image \
    --prompt "A Logo With Bold Large Text: SGL Diffusion" \
    --seed 0
Parameters:
  • --delay N: Wait N seconds before starting capture (skip initialization overhead)
  • --duration N: Capture for N seconds (focus on specific stages)
  • --force-overwrite: Overwrite existing output files

Notes

  • Reduce trace size: Use --num-profiled-timesteps with smaller values or --delay/--duration with Nsight Systems
  • Stage-specific analysis: Use --profile alone for denoising stage, add --profile-all-stages for full pipeline
  • Multiple runs: Profile with different prompts and resolutions to identify bottlenecks across workloads

FAQ

  • If you are profiling sglang generate with Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model’s inference steps to extend the execution time.

Contributing to SGLang Diffusion

This guide outlines the requirements for contributing to the SGLang Diffusion module (sglang.multimodal_gen).

1. Commit Message Convention

We follow a structured commit message format to maintain a clean history. Format:
[diffusion] <scope>: <subject>
Examples:
  • [diffusion] cli: add --perf-dump-path argument
  • [diffusion] scheduler: fix deadlock in batch processing
  • [diffusion] model: support Stable Diffusion 3.5
Rules:
  • Prefix: Always start with [diffusion].
  • Scope (Optional): cli, scheduler, model, pipeline, docs, etc.
  • Subject: Imperative mood, short and clear (e.g., “add feature” not “added feature”).

2. Performance Reporting

For PRs that impact latency, throughput, or memory usage, you should provide a performance comparison report.

How to Generate a Report

  1. Baseline: run the benchmark (for a single generation task)
    $ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path baseline.json
    
  2. New: run the same benchmark, without modifying any server_args or sampling_params
    $ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path new.json
    
  3. Compare: run the compare script, which will print a Markdown table to the console
    $ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...]
    ### Performance Comparison Report
    ...
    
  4. Paste: paste the table into the PR description

3. CI-Based Change Protection

Consider adding tests to the pr-test or nightly-test suites to safeguard your changes, especially for PRs that:
  1. support a new model
  2. support or fix important features
  3. significantly improve performance
See test for examples

How to Support New Diffusion Models

SGLang diffusion uses a modular pipeline architecture built around two key concepts:
  • ComposedPipeline: Orchestrates PipelineStages to define the complete generation process
  • PipelineStage: Modular components (prompt encoding, denoising loop, VAE decoding, etc.)
To add a new model, you’ll need to define:
  1. PipelineConfig: Static model configurations (paths, precision settings)
  2. SamplingParams: Runtime generation parameters (prompt, guidance_scale, steps)
  3. ComposedPipeline: Chain together pipeline stages
  4. Modules: Model components (text_encoder, transformer, vae, scheduler)
For the complete implementation guide with examples, see: How to Support New Diffusion Models

References