1. Model Introduction
Wan2.2 series are the most popular and open and advanced large-scale video generative models. This generation delivers comprehensive upgrades across the board:- Effective MoE Architecture: Introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost.
- Cinematic-level Aesthetics: Incorporates meticulously curated aesthetic data, complete with detailed labels for lighting, composition, contrast, color tone, and more. This allows for more precise and controllable cinematic style generation, facilitating the creation of videos with customizable aesthetic preferences.
- Complex Motion Generation: Trained on a significantly larger data, with +65.6% more images and +83.2% more videos. This expansion notably enhances the model’s generalization across multiple dimensions such as motions, semantics, and aesthetics, achieving TOP performance among all open-sourced and closed-sourced models.
- Efficient High-Definition Hybrid TI2V: Open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of 16×16×4. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can also run on consumer-grade graphics cards like 4090. It is one of the fastest 720P@24fps models currently available, capable of serving both the industrial and academic sectors simultaneously.
2. SGLang-diffusion Installation
SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang-diffusion installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The Wan2.2 series offers models in various sizes, architectures and input types, optimized for different hardware platforms. The recommended launch configurations vary by hardware and model size. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size.3.2 Configuration Tips
Current supported optimzation all listed here.--vae-path: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path.--num-gpus {NUM_GPUS}: Number of GPUs to use--tp-size {TP_SIZE}: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)--sp-degree {SP_SIZE}: Sequence parallelism size (typically should match the number of GPUs)--ulysses-degree {ULYSSES_DEGREE}: The degree of DeepSpeed-Ulysses-style SP in USP--ring-degree {RING_DEGREE}: The degree of ring attention-style SP in USP
4. Model Invocation
4.1 Basic Usage
For more API usage and request examples, please refer to: SGLang Diffusion OpenAI API4.1.1 Launch a server and then send requests
4.1.2 Generate a video without launching a server
4.2 Advanced Usage
4.2.1 Cache-DiT Acceleration
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can setSGLANG_CACHE_DIT_ENABLED=True to enable it. For more details, please refer to the SGLang Cache-DiT documentation.
Basic Usage
-
DBCache Parameters: DBCache controls block-level caching behavior:
Parameter Env Variable Default Description Fn SGLANG_CACHE_DIT_FN1 Number of first blocks to always compute Bn SGLANG_CACHE_DIT_BN0 Number of last blocks to always compute W SGLANG_CACHE_DIT_WARMUP4 Warmup steps before caching starts R SGLANG_CACHE_DIT_RDT0.24 Residual difference threshold MC SGLANG_CACHE_DIT_MC3 Maximum continuous cached steps -
TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion:
Combined Configuration Example:
Parameter Env Variable Default Description Enable SGLANG_CACHE_DIT_TAYLORSEERfalse Enable TaylorSeer calibrator Order SGLANG_CACHE_DIT_TS_ORDER1 Taylor expansion order (1 or 2)
4.2.2 GPU Optimization
--dit-cpu-offload: Use CPU offload for DiT inference. Enable if run out of memory with FSDP.--text-encoder-cpu-offload: Use CPU offload for text encoder inference. Enable if run out of memory with FSDP.--image-encoder-cpu-offload: Use CPU offload for image encoder inference. Enable if run out of memory with FSDP.--vae-cpu-offload: Use CPU offload for VAE. Enable if run out of memory.--pin-cpu-memory: Pin memory for CPU offload. Only added as a temp workaround if it throws “CUDA error: invalid argument”.
4.2.3 Supported LoRA Registry
| origin model | supported LoRA |
|---|---|
| Wan-AI/Wan2.2-I2V-A14B-Diffusers | lightx2v/Wan2.2-Distill-Loras |
| Wan-AI/Wan2.2-T2V-A14B-Diffusers | Cseti/wan2.2-14B-Arcane_Jinx-lora-v1 |
| Example: |
5. Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (1x)
- Model: Wan-AI/Wan2.2-T2V-A14B-Diffusers
- sglang diffusion version: 0.5.6.post2
