SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels from sgl-kernel and an efficient scheduler loop.

Key Features

Broad Model Support: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
Fast Inference: Optimized kernels from sgl-kernel, efficient scheduler loop, and Cache-DiT acceleration
Ease of Use: OpenAI-compatible API, CLI, and Python SDK
Multi-Platform: NVIDIA GPUs (H100, H200, A100, B200, 4090) and AMD GPUs (MI300X, MI325X)

Install SGLang-diffusion

You can install sglang-diffusion using one of the methods below. This page primarily applies to common NVIDIA GPU platforms. For AMD Instinct/ROCm environments see the dedicated ROCm quickstart, which lists the exact steps (including kernel builds) we used to validate sgl-diffusion on MI300X.

Method 1: With pip or uv

It is recommended to use uv for a faster installation:

pip install --upgrade pip
pip install uv
uv pip install "sglang[diffusion]" --prerelease=allow

Method 2: From source

# Use the latest release branch
git clone https://github.com/sgl-project/sglang.git
cd sglang

# Install the Python packages
pip install --upgrade pip
pip install -e "python[diffusion]"

# With uv
uv pip install -e "python[diffusion]" --prerelease=allow

Method 3: Using Docker

The Docker images are available on Docker Hub at lmsysorg/sglang, built from the Dockerfile. Replace <secret> below with your HuggingFace Hub token.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:dev \
    sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A logo With Bold Large text: SGL Diffusion" \
    --save-output

ROCm quickstart for sgl-diffusion

docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env HF_TOKEN=<secret> \
  lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \
  sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output

Compatibility Matrix

The table below shows every supported model and the optimizations supported for them. The symbols used have the following meanings:

✅ = Full compatibility
❌ = No compatibility
⭕ = Does not apply to this model

Models x Optimization

The HuggingFace Model ID can be passed directly to from_pretrained() methods, and sglang-diffusion will use the optimal default parameters when initializing and generating videos.

Video Generation Models

Model Name	Hugging Face Model ID	Resolutions	TeaCache	Sliding Tile Attn	Sage Attn	Video Sparse Attention (VSA)
FastWan2.1 T2V 1.3B	`FastVideo/FastWan2.1-T2V-1.3B-Diffusers`	480p	⭕	⭕	⭕	✅
FastWan2.2 TI2V 5B Full Attn	`FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers`	720p	⭕	⭕	⭕	✅
Wan2.2 TI2V 5B	`Wan-AI/Wan2.2-TI2V-5B-Diffusers`	720p	⭕	⭕	✅	⭕
Wan2.2 T2V A14B	`Wan-AI/Wan2.2-T2V-A14B-Diffusers`	480p
720p	❌	❌	✅	⭕
Wan2.2 I2V A14B	`Wan-AI/Wan2.2-I2V-A14B-Diffusers`	480p
720p	❌	❌	✅	⭕
HunyuanVideo	`hunyuanvideo-community/HunyuanVideo`	720×1280
544×960	❌	✅	✅	⭕
FastHunyuan	`FastVideo/FastHunyuan-diffusers`	720×1280
544×960	❌	✅	✅	⭕
Wan2.1 T2V 1.3B	`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`	480p	✅	✅	✅	⭕
Wan2.1 T2V 14B	`Wan-AI/Wan2.1-T2V-14B-Diffusers`	480p, 720p	✅	✅	✅	⭕
Wan2.1 I2V 480P	`Wan-AI/Wan2.1-I2V-14B-480P-Diffusers`	480p	✅	✅	✅	⭕
Wan2.1 I2V 720P	`Wan-AI/Wan2.1-I2V-14B-720P-Diffusers`	720p	✅	✅	✅	⭕

Note: Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.

Image Generation Models

Model Name	HuggingFace Model ID	Resolutions
FLUX.1-dev	`black-forest-labs/FLUX.1-dev`	Any resolution
FLUX.2-dev	`black-forest-labs/FLUX.2-dev`	Any resolution
FLUX.2-Klein	`black-forest-labs/FLUX.2-klein-4B`	Any resolution
Z-Image-Turbo	`Tongyi-MAI/Z-Image-Turbo`	Any resolution
GLM-Image	`zai-org/GLM-Image`	Any resolution
Qwen Image	`Qwen/Qwen-Image`	Any resolution
Qwen Image 2512	`Qwen/Qwen-Image-2512`	Any resolution
Qwen Image Edit	`Qwen/Qwen-Image-Edit`	Any resolution

Verified LoRA Examples

This section lists example LoRAs that have been explicitly tested and verified with each base model in the SGLang Diffusion pipeline.

Important:
LoRAs that are not listed here are not necessarily incompatible. In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions. The entries below simply reflect configurations that have been manually validated by the SGLang team.

Verified LoRAs by Base Model

Base Model	Supported LoRAs
Wan2.2	`lightx2v/Wan2.2-Distill-Loras`
`Cseti/wan2.2-14B-Arcane_Jinx-lora-v1`
Wan2.1	`lightx2v/Wan2.1-Distill-Loras`
Z-Image-Turbo	`tarn59/pixel_art_style_lora_z_image_turbo`
`wcde/Z-Image-Turbo-DeJPEG-Lora`
Qwen-Image	`lightx2v/Qwen-Image-Lightning`
`flymy-ai/qwen-image-realism-lora`
`prithivMLmods/Qwen-Image-HeadshotX`
`starsfriday/Qwen-Image-EVA-LoRA`
Qwen-Image-Edit	`ostris/qwen_image_edit_inpainting`
`lightx2v/Qwen-Image-Edit-2511-Lightning`
Flux	`dvyio/flux-lora-simple-illustration`
`XLabs-AI/flux-furry-lora`
`XLabs-AI/flux-RealismLora`

Special Requirements

Sliding Tile Attention: Currently, only Hopper GPUs (H100s) are supported.

SGLang diffusion CLI Inference

The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.

Prerequisites

A working SGLang diffusion installation and the sglang CLI available in $PATH.
Python 3.11+ if you plan to use the OpenAI Python SDK.

Supported Arguments

Server Arguments

--model-path {MODEL_PATH}: Path to the model or model ID
--vae-path {VAE_PATH}: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path.
--lora-path {LORA_PATH}: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
--lora-nickname {NAME}: Nickname for the LoRA adapter. (default: default).
--num-gpus {NUM_GPUS}: Number of GPUs to use
--tp-size {TP_SIZE}: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
--sp-degree {SP_SIZE}: Sequence parallelism size (typically should match the number of GPUs)
--ulysses-degree {ULYSSES_DEGREE}: The degree of DeepSpeed-Ulysses-style SP in USP
--ring-degree {RING_DEGREE}: The degree of ring attention-style SP in USP

Sampling Parameters

--prompt {PROMPT}: Text description for the video you want to generate
--num-inference-steps {STEPS}: Number of denoising steps
--negative-prompt {PROMPT}: Negative prompt to guide generation away from certain concepts
--seed {SEED}: Random seed for reproducible generation

Image/Video Configuration

--height {HEIGHT}: Height of the generated output
--width {WIDTH}: Width of the generated output
--num-frames {NUM_FRAMES}: Number of frames to generate
--fps {FPS}: Frames per second for the saved output, if this is a video-generation task

Output Options

--output-path {PATH}: Directory to save the generated video
--save-output: Whether to save the image/video to disk
--return-frames: Whether to return the raw frames

Using Configuration Files

Instead of specifying all parameters on the command line, you can use a configuration file:

sglang generate --config {CONFIG_FILE_PATH}

The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file. Example configuration file (config.json):

{
    "model_path": "FastVideo/FastHunyuan-diffusers",
    "prompt": "A beautiful woman in a red dress walking down a street",
    "output_path": "outputs/",
    "num_gpus": 2,
    "sp_size": 2,
    "tp_size": 1,
    "num_frames": 45,
    "height": 720,
    "width": 1280,
    "num_inference_steps": 6,
    "seed": 1024,
    "fps": 24,
    "precision": "bf16",
    "vae_precision": "fp16",
    "vae_tiling": true,
    "vae_sp": true,
    "vae_config": {
        "load_encoder": false,
        "load_decoder": true,
        "tile_sample_min_height": 256,
        "tile_sample_min_width": 256
    },
    "text_encoder_precisions": [
        "fp16",
        "fp16"
    ],
    "mask_strategy_file_path": null,
    "enable_torch_compile": false
}

Or using YAML format (config.yaml):

model_path: "FastVideo/FastHunyuan-diffusers"
prompt: "A beautiful woman in a red dress walking down a street"
output_path: "outputs/"
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: "bf16"
vae_precision: "fp16"
vae_tiling: true
vae_sp: true
vae_config:
  load_encoder: false
  load_decoder: true
  tile_sample_min_height: 256
  tile_sample_min_width: 256
text_encoder_precisions:
  - "fp16"
  - "fp16"
mask_strategy_file_path: null
enable_torch_compile: false

To see all the options, you can use the --help flag:

sglang generate --help

Serve

Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.

Start the server

Use the following command to launch the server:

SERVER_ARGS=(
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
)

sglang serve "${SERVER_ARGS[@]}"

—model-path: Which model to load. The example uses Wan-AI/Wan2.1-T2V-1.3B-Diffusers.
—port: HTTP port to listen on (the default here is 30010).

For detailed API usage, including Image, Video Generation and LoRA management, please refer to the OpenAI API Documentation.

Generate

Run a one-off generation task without launching a persistent server. To use it, pass both server arguments and sampling parameters in one command, after the generate subcommand, for example:

SERVER_ARGS=(
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
)

SAMPLING_ARGS=(
  --prompt "A curious raccoon"
  --save-output
  --output-path outputs
  --output-file-name "A curious raccoon.mp4"
)

sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"

# Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"

Once the generation task has finished, the server will shut down automatically.

The HTTP server-related arguments are ignored in this subcommand.

Diffusers Backend

SGLang diffusion supports a diffusers backend that allows you to run any diffusers-compatible model through SGLang’s infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.

Arguments

Argument	Values	Description
`--backend`	`auto` (default), `sglang`, `diffusers`	`auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline.
`--diffusers-attention-backend`	`flash`, `_flash_3_hub`, `sage`, `xformers`, `native`	Attention backend for diffusers pipelines. See diffusers attention backends.
`--trust-remote-code`	flag	Required for models with custom pipeline classes (e.g., Ovis).
`--vae-tiling`	flag	Enable VAE tiling for large image support (decodes tile-by-tile).
`--vae-slicing`	flag	Enable VAE slicing for lower memory usage (decodes slice-by-slice).
`--dit-precision`	`fp16`, `bf16`, `fp32`	Precision for the diffusion transformer.
`--vae-precision`	`fp16`, `bf16`, `fp32`	Precision for the VAE.

Example: Running Ovis-Image-7B

Ovis-Image-7B is a 7B text-to-image model optimized for high-quality text rendering.

sglang generate \
  --model-path AIDC-AI/Ovis-Image-7B \
  --backend diffusers \
  --trust-remote-code \
  --diffusers-attention-backend flash \
  --prompt "A serene Japanese garden with cherry blossoms" \
  --height 1024 \
  --width 1024 \
  --num-inference-steps 30 \
  --save-output \
  --output-path outputs \
  --output-file-name ovis_garden.png

Extra Diffusers Arguments

For pipeline-specific parameters not exposed via CLI, use diffusers_kwargs in a config file:

{
    "model_path": "AIDC-AI/Ovis-Image-7B",
    "backend": "diffusers",
    "prompt": "A beautiful landscape",
    "diffusers_kwargs": {
        "cross_attention_kwargs": {"scale": 0.5}
    }
}

sglang generate --config config.json

SGLang Diffusion OpenAI API

The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.

Serve

Launch the server using the sglang serve command.

Start the server

SERVER_ARGS=(
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
  --port 30010
)

sglang serve "${SERVER_ARGS[@]}"

—model-path: Path to the model or model ID.
—port: HTTP port to listen on (default: 30000).

Get Model Information

Endpoint: GET /models Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings. Curl Example:

curl -sS -X GET "http://localhost:30010/models"

Response Example:

{
  "model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
  "task_type": "T2V",
  "pipeline_name": "wan_pipeline",
  "pipeline_class": "WanPipeline",
  "num_gpus": 4,
  "dit_precision": "bf16",
  "vae_precision": "fp16"
}

Endpoints

Image Generation

The server implements an OpenAI-compatible Images API under the /v1/images namespace.

Create an image

Endpoint: POST /v1/images/generations Python Example (b64_json response):

import base64
from openai import OpenAI

client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")

img = client.images.generate(
    prompt="A calico cat playing a piano on stage",
    size="1024x1024",
    n=1,
    response_format="b64_json",
)

image_bytes = base64.b64decode(img.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_bytes)

Curl Example:

curl -sS -X POST "http://localhost:30010/v1/images/generations" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
      }'

Note The response_format=url option is not supported for POST /v1/images/generations and will return a 400 error.

Edit an image

Endpoint: POST /v1/images/edits This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image. Curl Example (b64_json response):

curl -sS -X POST "http://localhost:30010/v1/images/edits" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "image=@local_input_image.png" \
  -F "url=image_url.jpg" \
  -F "prompt=A calico cat playing a piano on stage" \
  -F "size=1024x1024" \
  -F "response_format=b64_json"

Curl Example (URL response):

curl -sS -X POST "http://localhost:30010/v1/images/edits" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "image=@local_input_image.png" \
  -F "url=image_url.jpg" \
  -F "prompt=A calico cat playing a piano on stage" \
  -F "size=1024x1024" \
  -F "response_format=url"

Download image content

When response_format=url is used with POST /v1/images/edits, the API returns a relative URL like /v1/images/<IMAGE_ID>/content. Endpoint: GET /v1/images/{image_id}/content Curl Example:

curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -o output.png

Video Generation

The server implements a subset of the OpenAI Videos API under the /v1/videos namespace.

Create a video

Endpoint: POST /v1/videos Python Example:

from openai import OpenAI

client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")

video = client.videos.create(
    prompt="A calico cat playing a piano on stage",
    size="1280x720"
)
print(f"Video ID: {video.id}, Status: {video.status}")

Curl Example:

curl -sS -X POST "http://localhost:30010/v1/videos" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1280x720"
      }'

Endpoint: GET /v1/videos Python Example:

videos = client.videos.list()
for item in videos.data:
    print(item.id, item.status)

Curl Example:

curl -sS -X GET "http://localhost:30010/v1/videos" \
  -H "Authorization: Bearer sk-proj-1234567890"

Download video content

Endpoint: GET /v1/videos/{video_id}/content Python Example:

import time

# Poll for completion
while True:
    page = client.videos.list()
    item = next((v for v in page.data if v.id == video_id), None)
    if item and item.status == "completed":
        break
    time.sleep(5)

# Download content
resp = client.videos.download_content(video_id=video_id)
with open("output.mp4", "wb") as f:
    f.write(resp.read())

Curl Example:

curl -sS -L "http://localhost:30010/v1/videos/<VIDEO_ID>/content" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -o output.mp4

LoRA Management

The server supports dynamic loading, merging, and unmerging of LoRA adapters. Important Notes:

Mutual Exclusion: Only one LoRA can be merged (active) at a time
Switching: To switch LoRAs, you must first unmerge the current one, then set the new one
Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost

Set LoRA Adapter

Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters. Endpoint: POST /v1/set_lora Parameters:

lora_nickname (string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAs
lora_path (string or list of strings/None, optional): Path to the .safetensors file(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length of lora_nickname
target (string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length of lora_nickname. Valid values:
- "all" (default): Apply to all transformers
- "transformer": Apply only to the primary transformer (high noise for Wan2.2)
- "transformer_2": Apply only to transformer_2 (low noise for Wan2.2)
- "critic": Apply only to the critic model
strength (float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length of lora_nickname. Values < 1.0 reduce the effect, values > 1.0 amplify the effect

Single LoRA Example:

curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": "lora_name",
        "lora_path": "/path/to/lora.safetensors",
        "target": "all",
        "strength": 0.8
      }'

Multiple LoRA Example:

curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": ["lora_1", "lora_2"],
        "lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"],
        "target": ["transformer", "transformer_2"],
        "strength": [0.8, 1.0]
      }'

Multiple LoRA with Same Target:

curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": ["style_lora", "character_lora"],
        "lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"],
        "target": "all",
        "strength": [0.7, 0.9]
      }'

When using multiple LoRAs:

All list parameters (lora_nickname, lora_path, target, strength) must have the same length
If target or strength is a single value, it will be applied to all LoRAs
Multiple LoRAs applied to the same target will be merged in order

Merge LoRA Weights

Manually merges the currently set LoRA weights into the base model.

set_lora automatically performs a merge, so this is typically only needed if you have manually unmerged but want to re-apply the same LoRA without calling set_lora again.

Endpoint: POST /v1/merge_lora_weights Parameters:

target (string, optional): Which transformer(s) to merge. One of “all” (default), “transformer”, “transformer_2”, “critic”
strength (float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect

Curl Example:

curl -X POST http://localhost:30010/v1/merge_lora_weights \
  -H "Content-Type: application/json" \
  -d '{"strength": 0.8}'

Unmerge LoRA Weights

Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This must be called before setting a different LoRA. Endpoint: POST /v1/unmerge_lora_weights Curl Example:

curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
  -H "Content-Type: application/json"

List LoRA Adapters

Returns loaded LoRA adapters and current application status per module. Endpoint: GET /v1/list_loras Curl Example:

curl -sS -X GET "http://localhost:30010/v1/list_loras"

Response Example:

{
  "loaded_adapters": [
    { "nickname": "lora_a", "path": "/weights/lora_a.safetensors" },
    { "nickname": "lora_b", "path": "/weights/lora_b.safetensors" }
  ],
  "active": {
    "transformer": [
      {
        "nickname": "lora2",
        "path": "tarn59/pixel_art_style_lora_z_image_turbo",
        "merged": true,
        "strength": 1.0
      }
    ]
  }
}

Notes:

If LoRA is not enabled for the current pipeline, the server will return an error.
num_lora_layers_with_weights counts only layers that have LoRA weights applied for the active adapter.

Example: Switching LoRAs

Set LoRA A:

curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}'

Generate with LoRA A…

Unmerge LoRA A:

curl -X POST http://localhost:30010/v1/unmerge_lora_weights

Set LoRA B:

curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'

Generate with LoRA B…

Attention Backends

This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.

Overview

Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend. Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).

CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
MPS: always uses PyTorch SDPA.

Backend options

The CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.

CLI value	Enum value	Notes
`fa` / `fa3` / `fa4`	`FA`	FlashAttention. `fa3/fa4` are normalized to `fa` during argument parsing (`ServerArgs.__post_init__`).
`torch_sdpa`	`TORCH_SDPA`	PyTorch `scaled_dot_product_attention`.
`sliding_tile_attn`	`SLIDING_TILE_ATTN`	Sliding Tile Attention (STA). Requires `st_attn` and a mask-strategy config file set via the `SGLANG_DIFFUSION_ATTENTION_CONFIG` environment variable.
`sage_attn`	`SAGE_ATTN`	Requires `sageattention`. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream `setup.py`: https://github.com/thu-ml/SageAttention/blob/main/setup.py.
`sage_attn_3`	`SAGE_ATTN_3`	Requires SageAttention3 installed per upstream instructions.
`video_sparse_attn`	`VIDEO_SPARSE_ATTN`	Requires `vsa`.
`vmoba_attn`	`VMOBA_ATTN`	Requires `kernel.attn.vmoba_attn.vmoba`.
`aiter`	`AITER`	Requires `aiter`.

Selection priority

The selection order in runtime/layers/attention/selector.py is:

global_force_attn_backend(...) / global_force_attn_backend_context_manager(...)
CLI --attention-backend (ServerArgs.attention_backend)
Auto selection (platform capability, dtype, and installed packages)

Platform support matrix

Backend	CUDA	ROCm	MPS	Notes
`fa`	✅	✅	❌	CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`.
`torch_sdpa`	✅	✅	✅	Most compatible option across platforms.
`sliding_tile_attn`	✅	❌	❌	CUDA-only. Requires `st_attn` and `SGLANG_DIFFUSION_ATTENTION_CONFIG`.
`sage_attn`	✅	❌	❌	CUDA-only (optional dependency).
`sage_attn_3`	✅	❌	❌	CUDA-only (optional dependency).
`video_sparse_attn`	✅	❌	❌	CUDA-only. Requires `vsa`.
`vmoba_attn`	✅	❌	❌	CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`.
`aiter`	✅	❌	❌	Requires `aiter`.

Usage

Select a backend via CLI

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend fa

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend torch_sdpa

Using Sliding Tile Attention (STA)

export SGLANG_DIFFUSION_ATTENTION_CONFIG=/abs/path/to/mask_strategy.json

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend sliding_tile_attn

Notes for ROCm / MPS

ROCm: use --attention-backend torch_sdpa or fa depending on what is available in your environment.
MPS: the platform implementation always uses torch_sdpa.

Cache-DiT Acceleration

SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss.

Overview

Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:

DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
SCM (Step Computation Masking): Step-level caching control for additional speedup

Basic Usage

Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :

SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Advanced Configuration

DBCache Parameters

DBCache controls block-level caching behavior:

Parameter	Env Variable	Default	Description
Fn	`SGLANG_CACHE_DIT_FN`	1	Number of first blocks to always compute
Bn	`SGLANG_CACHE_DIT_BN`	0	Number of last blocks to always compute
W	`SGLANG_CACHE_DIT_WARMUP`	4	Warmup steps before caching starts
R	`SGLANG_CACHE_DIT_RDT`	0.24	Residual difference threshold
MC	`SGLANG_CACHE_DIT_MC`	3	Maximum continuous cached steps

TaylorSeer Configuration

TaylorSeer improves caching accuracy using Taylor expansion:

Parameter	Env Variable	Default	Description
Enable	`SGLANG_CACHE_DIT_TAYLORSEER`	false	Enable TaylorSeer calibrator
Order	`SGLANG_CACHE_DIT_TS_ORDER`	1	Taylor expansion order (1 or 2)

Combined Configuration Example

DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A curious raccoon in a forest"

SCM (Step Computation Masking)

SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results.

SCM Presets

SCM is configured with presets:

Preset	Compute Ratio	Speed	Quality
`none`	100%	Baseline	Best
`slow`	~75%	~1.3x	High
`medium`	~50%	~2x	Good
`fast`	~35%	~3x	Acceptable
`ultra`	~25%	~4x	Lower

Usage

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"

Custom SCM Bins

For fine-grained control over which steps to compute vs cache:

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"

SCM Policy

Policy	Env Variable	Description
`dynamic`	`SGLANG_CACHE_DIT_SCM_POLICY=dynamic`	Adaptive caching based on content (default)
`static`	`SGLANG_CACHE_DIT_SCM_POLICY=static`	Fixed caching pattern

Environment Variables

All Cache-DiT parameters can be set via the following environment variables:

Environment Variable	Default	Description
`SGLANG_CACHE_DIT_ENABLED`	false	Enable Cache-DiT acceleration
`SGLANG_CACHE_DIT_FN`	1	First N blocks to always compute
`SGLANG_CACHE_DIT_BN`	0	Last N blocks to always compute
`SGLANG_CACHE_DIT_WARMUP`	4	Warmup steps before caching
`SGLANG_CACHE_DIT_RDT`	0.24	Residual difference threshold
`SGLANG_CACHE_DIT_MC`	3	Max continuous cached steps
`SGLANG_CACHE_DIT_TAYLORSEER`	false	Enable TaylorSeer calibrator
`SGLANG_CACHE_DIT_TS_ORDER`	1	TaylorSeer order (1 or 2)
`SGLANG_CACHE_DIT_SCM_PRESET`	none	SCM preset (none/slow/medium/fast/ultra)
`SGLANG_CACHE_DIT_SCM_POLICY`	dynamic	SCM caching policy
`SGLANG_CACHE_DIT_SCM_COMPUTE_BINS`	not set	Custom SCM compute bins
`SGLANG_CACHE_DIT_SCM_CACHE_BINS`	not set	Custom SCM cache bins

Supported Models

SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:

Model Family	Example Models
Wan	Wan2.1, Wan2.2
Flux	FLUX.1-dev, FLUX.2-dev, FLUX.2-Klein
Z-Image	Z-Image-Turbo
Qwen	Qwen-Image, Qwen-Image-Edit
GLM	GLM-Image
Hunyuan	HunyuanVideo

Performance Tips

Start with defaults: The default parameters work well for most models
Use TaylorSeer: It typically improves both speed and quality
Tune R threshold: Lower values = better quality, higher values = faster
SCM for extra speed: Use medium preset for good speed/quality balance
Warmup matters: Higher warmup = more stable caching decisions

Limitations

Single GPU only: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically disabled when world_size > 1
SCM minimum steps: SCM requires >= 8 inference steps to be effective
Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported

Troubleshooting

Distributed environment warning

WARNING: cache-dit is disabled in distributed environment (world_size=N)

This is expected behavior. Cache-DiT currently only supports single-GPU inference.

SCM disabled for low step count

For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.

References

Profiling Multimodal Generation

This guide covers profiling techniques for multimodal generation pipelines in SGLang.

PyTorch Profiler

PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics.

Denoising Stage Profiling

Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step):

sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --seed 0 \
  --profile

Parameters:

--profile: Enable profiling for the denoising stage
--num-profiled-timesteps N: Number of timesteps to profile after warmup (default: 5)
- Smaller values reduce trace file size
- Example: --num-profiled-timesteps 10 profiles 10 steps after 1 warmup step

Full Pipeline Profiling

Profile all pipeline stages (text encoding, denoising, VAE decoding, etc.):

sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --seed 0 \
  --profile \
  --profile-all-stages

Parameters:

--profile-all-stages: Used with --profile, profile all pipeline stages instead of just denoising

Output Location

By default, trace files are saved in the ./logs/ directory. The exact output file path will be shown in the console output, for example:

[mm-dd hh:mm:ss] Saved profiler traces to: /sgl-workspace/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz

View Traces

Load and visualize trace files at:

https://ui.perfetto.dev/ (recommended)
chrome://tracing (Chrome only)

For large trace files, reduce --num-profiled-timesteps or avoid using --profile-all-stages.

`--perf-dump-path` (Stage/Step Timing Dump)

Besides profiler traces, you can also dump a lightweight JSON report that contains:

stage-level timing breakdown for the full pipeline
step-level timing breakdown for the denoising stage (per diffusion step)

This is useful to quickly identify which stage dominates end-to-end latency, and whether denoising steps have uniform runtimes (and if not, which step has an abnormal spike). The dumped JSON contains a denoise_steps_ms field formatted as an array of objects, each with a step key (the step index) and a duration_ms key. Example:

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "<PROMPT>" \
  --perf-dump-path perf.json

Nsight Systems

Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns.

Installation

See the SGLang profiling guide for installation instructions.

Basic Profiling

Profile the entire pipeline execution:

nsys profile \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  --force-overwrite=true \
  -o QwenImage \
  sglang generate \
    --model-path Qwen/Qwen-Image \
    --prompt "A Logo With Bold Large Text: SGL Diffusion" \
    --seed 0

Targeted Stage Profiling

Use --delay and --duration to capture specific stages and reduce file size:

nsys profile \
  --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  --force-overwrite=true \
  --delay 10 \
  --duration 30 \
  -o QwenImage_denoising \
  sglang generate \
    --model-path Qwen/Qwen-Image \
    --prompt "A Logo With Bold Large Text: SGL Diffusion" \
    --seed 0

Parameters:

--delay N: Wait N seconds before starting capture (skip initialization overhead)
--duration N: Capture for N seconds (focus on specific stages)
--force-overwrite: Overwrite existing output files

Notes

Reduce trace size: Use --num-profiled-timesteps with smaller values or --delay/--duration with Nsight Systems
Stage-specific analysis: Use --profile alone for denoising stage, add --profile-all-stages for full pipeline
Multiple runs: Profile with different prompts and resolutions to identify bottlenecks across workloads

FAQ

If you are profiling sglang generate with Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model’s inference steps to extend the execution time.

Contributing to SGLang Diffusion

This guide outlines the requirements for contributing to the SGLang Diffusion module (sglang.multimodal_gen).

1. Commit Message Convention

We follow a structured commit message format to maintain a clean history. Format:

[diffusion] <scope>: <subject>

Examples:

[diffusion] cli: add --perf-dump-path argument
[diffusion] scheduler: fix deadlock in batch processing
[diffusion] model: support Stable Diffusion 3.5

Rules:

Prefix: Always start with [diffusion].
Scope (Optional): cli, scheduler, model, pipeline, docs, etc.
Subject: Imperative mood, short and clear (e.g., “add feature” not “added feature”).

2. Performance Reporting

For PRs that impact latency, throughput, or memory usage, you should provide a performance comparison report.

How to Generate a Report

Baseline: run the benchmark (for a single generation task)

$ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path baseline.json

New: run the same benchmark, without modifying any server_args or sampling_params

$ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path new.json

Compare: run the compare script, which will print a Markdown table to the console

$ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...]
### Performance Comparison Report
...

Paste: paste the table into the PR description

3. CI-Based Change Protection

Consider adding tests to the pr-test or nightly-test suites to safeguard your changes, especially for PRs that:

support a new model
support or fix important features
significantly improve performance

See test for examples

How to Support New Diffusion Models

SGLang diffusion uses a modular pipeline architecture built around two key concepts:

ComposedPipeline: Orchestrates PipelineStages to define the complete generation process
PipelineStage: Modular components (prompt encoding, denoising loop, VAE decoding, etc.)

To add a new model, you’ll need to define:

PipelineConfig: Static model configurations (paths, precision settings)
SamplingParams: Runtime generation parameters (prompt, guidance_scale, steps)
ComposedPipeline: Chain together pipeline stages
Modules: Model components (text_encoder, transformer, vae, scheduler)

For the complete implementation guide with examples, see: How to Support New Diffusion Models

Getting Started

Basic Usage

Advanced Features

Supported Models

Hardware Platforms

Developer Guide

References

​Key Features

​Install SGLang-diffusion

​Method 1: With pip or uv

​Method 2: From source

​Method 3: Using Docker

​ROCm quickstart for sgl-diffusion

​Compatibility Matrix

​Models x Optimization

​Video Generation Models

​Image Generation Models

​Verified LoRA Examples

​Verified LoRAs by Base Model

​Special Requirements

​SGLang diffusion CLI Inference

​Prerequisites

​Supported Arguments

​Server Arguments

​Sampling Parameters

​Image/Video Configuration

​Output Options

​Using Configuration Files

​Serve

​Start the server

​Generate

​Diffusers Backend

​Arguments

​Example: Running Ovis-Image-7B

​Extra Diffusers Arguments

​SGLang Diffusion OpenAI API

​Serve

​Start the server

​Get Model Information

​Endpoints

​Image Generation

​Create an image

​Edit an image

​Download image content

​Video Generation

​Create a video

​List videos

​Download video content

​LoRA Management

​Set LoRA Adapter

​Merge LoRA Weights

​Unmerge LoRA Weights

​List LoRA Adapters

​Example: Switching LoRAs

​Attention Backends

​Overview

​Backend options

​Selection priority

​Platform support matrix

​Usage

​Select a backend via CLI

​Using Sliding Tile Attention (STA)

​Notes for ROCm / MPS

​Cache-DiT Acceleration

​Overview

​Basic Usage

​Advanced Configuration

​DBCache Parameters

​TaylorSeer Configuration

​Combined Configuration Example

​SCM (Step Computation Masking)

​SCM Presets

Usage

​Custom SCM Bins

​SCM Policy

​Environment Variables

​Supported Models

​Performance Tips

​Limitations

​Troubleshooting

Key Features

Install SGLang-diffusion

Method 1: With pip or uv

Method 2: From source

Method 3: Using Docker

ROCm quickstart for sgl-diffusion

Compatibility Matrix

Models x Optimization

Video Generation Models

Image Generation Models

Verified LoRA Examples

Verified LoRAs by Base Model

Special Requirements

SGLang diffusion CLI Inference

Prerequisites

Supported Arguments

Server Arguments

Sampling Parameters

Image/Video Configuration

Output Options

Using Configuration Files

Serve

Start the server

Generate

Diffusers Backend

Arguments

Example: Running Ovis-Image-7B

Extra Diffusers Arguments

SGLang Diffusion OpenAI API

Serve

Start the server

Get Model Information

Endpoints

Image Generation

Create an image

Edit an image

Download image content

Video Generation

Create a video

List videos

Download video content

LoRA Management

Set LoRA Adapter

Merge LoRA Weights

Unmerge LoRA Weights

List LoRA Adapters

Example: Switching LoRAs

Attention Backends

Overview

Backend options

Selection priority

Platform support matrix

Usage

Select a backend via CLI

Using Sliding Tile Attention (STA)

Notes for ROCm / MPS

Cache-DiT Acceleration

Overview

Basic Usage

Advanced Configuration

DBCache Parameters

TaylorSeer Configuration

Combined Configuration Example

SCM (Step Computation Masking)

SCM Presets

Custom SCM Bins

SCM Policy

Environment Variables

Supported Models

Performance Tips

Limitations

Troubleshooting

Distributed environment warning

SCM disabled for low step count

References

Profiling Multimodal Generation

PyTorch Profiler

Denoising Stage Profiling

Full Pipeline Profiling

Output Location