Launch commands for SGLang
Below are suggested launch commands tailored for different hardware / precision modesFP8 (quantised) mode
For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported:Non-FP8 (BF16 / full precision) mode
For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used):Hardware-specific notes / recommendations
- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency.
- On A100 / H100 with BF16 (non-FP8): It’s recommended to use
--mm-max-concurrent-callsto control parallel throughput and GPU memory usage during image/video inference. - On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing.
Sending Image/Video Requests
Image input:
Video Input:
Important Server Parameters and Flags
When launching the model server for multimodal support, you can use the following command-line arguments to fine-tune performance and behavior:--mm-attention-backend: Specify multimodal attention backend. Eg.fa3(Flash Attention 3)--mm-max-concurrent-calls <value>: Specifies the maximum number of concurrent asynchronous multimodal data processing calls allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference.--mm-per-request-timeout <seconds>: Defines the timeout duration (in seconds) for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated.--keep-mm-feature-on-device: Instructs the server to retain multimodal feature tensors on the GPU after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads.SGLANG_USE_CUDA_IPC_TRANSPORT=1: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency.
