Skip to main content
This page provides a list of server arguments used in the command line to configure the behavior and performance of the language model server during deployment. These arguments enable users to customize key aspects of the server, including model selection, parallelism policies, memory management, and optimization techniques. You can find all arguments by python3 -m sglang.launch_server --help

Common launch commands

  • To use a configuration file, create a YAML file with your server arguments and specify it with --config. CLI arguments will override config file values.
    # Create config.yaml
    cat > config.yaml << EOF
    model-path: meta-llama/Meta-Llama-3-8B-Instruct
    host: 0.0.0.0
    port: 30000
    tensor-parallel-size: 2
    enable-metrics: true
    log-requests: true
    EOF
    
    # Launch server with config file
    python -m sglang.launch_server --config config.yaml
    
  • To enable multi-GPU tensor parallelism, add --tp 2. If it reports the error “peer access is not supported between these two devices”, add --enable-p2p-check to the server launch command.
    python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
    
  • To enable multi-GPU data parallelism, add --dp 2. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Model Gateway (former Router) for data parallelism.
    python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
    
  • If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static. The default value is 0.9.
    python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
    
  • See hyperparameter tuning on tuning hyperparameters for better performance.
  • For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See --shm-size for docker and /dev/shm size update for Kubernetes manifests.
  • If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
    python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
    
  • To enable torch.compile acceleration, add --enable-torch-compile. It accelerates small models on small batch sizes. By default, the cache path is located at /tmp/torchinductor_root, you can customize it using environment variable TORCHINDUCTOR_CACHE_DIR. For more details, please refer to PyTorch official documentation and Enabling cache for torch.compile.
  • To enable torchao quantization, add --torchao-config int4wo-128. It supports other quantization strategies (INT8/FP8) as well.
  • To enable fp8 weight quantization, add --quantization fp8 on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
  • To enable fp8 kv cache quantization, add --kv-cache-dtype fp8_e5m2.
  • To enable deterministic inference and batch invariant operations, add --enable-deterministic-inference. More details can be found in deterministic inference document.
  • If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template. If the tokenizer has multiple named templates (e.g., ‘default’, ‘tool_use’), you can select one using --hf-chat-template-name tool_use.
  • To run tensor parallelism on multiple nodes, add --nnodes 2. If you have two nodes with two GPUs on each node and want to run TP=4, let sgl-dev-0 be the hostname of the first node and 50000 be an available port, you can use the following commands. If you meet deadlock, please try to add --disable-cuda-graph
    # Node 0
    python -m sglang.launch_server \
      --model-path meta-llama/Meta-Llama-3-8B-Instruct \
      --tp 4 \
      --dist-init-addr sgl-dev-0:50000 \
      --nnodes 2 \
      --node-rank 0
    
    # Node 1
    python -m sglang.launch_server \
      --model-path meta-llama/Meta-Llama-3-8B-Instruct \
      --tp 4 \
      --dist-init-addr sgl-dev-0:50000 \
      --nnodes 2 \
      --node-rank 1
    
Please consult the documentation below and server_args.py to learn more about the arguments you may provide when launching a server.

Model and tokenizer

ArgumentDescriptionDefaultsOptions
--model-path
--modelThe path of the model weights. This can be a local folder or a Hugging Face repo ID.NoneType: str
--tokenizer-pathThe path of the tokenizer.NoneType: str
--tokenizer-modeTokenizer mode. ‘auto’ will use the fast tokenizer if available, and ‘slow’ will always use the slow tokenizer.autoauto, slow
--tokenizer-worker-numThe worker num of the tokenizer manager.1Type: int
--skip-tokenizer-initIf set, skip init tokenizer and pass input_ids in generate request.Falsebool flag (set to enable)
--load-formatThe format of the model weights to load. “auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available. “pt” will load the weights in the pytorch bin format. “safetensors” will load the weights in the safetensors format. “npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading. “dummy” will initialize the weights with random values, which is mainly for profiling.”gguf” will load the weights in the gguf format. “bitsandbytes” will load the weights using bitsandbytes quantization.”layered” loads weights layer by layer so that one can quantize a layer before loading another to make the peak memory envelope smaller. “flash_rl” will load the weights in flash_rl format. “fastsafetensors” and “private” are also supported.autoauto, pt, safetensors, npcache, dummy, sharded_state, gguf, bitsandbytes, layered, flash_rl, remote, remote_instance, fastsafetensors, private
--model-loader-extra-configExtra config for model loader. This will be passed to the model loader corresponding to the chosen load_format.{}Type: str
--trust-remote-codeWhether or not to allow for custom models defined on the Hub in their own modeling files.Falsebool flag (set to enable)
--context-lengthThe model’s maximum context length. Defaults to None (will use the value from the model’s config.json instead).NoneType: int
--is-embeddingWhether to use a CausalLM as an embedding model.Falsebool flag (set to enable)
--enable-multimodalEnable the multimodal functionality for the served model. If the model being served is not multimodal, nothing will happenNonebool flag (set to enable)
--revisionThe specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.NoneType: str
--model-implWhich implementation of the model to use. * “auto” will try to use the SGLang implementation if it exists and fall back to the Transformers implementation if no SGLang implementation is available. * “sglang” will use the SGLang model implementation. * “transformers” will use the Transformers model implementation.autoType: str

HTTP server

ArgumentDescriptionDefaultsOptions
--hostThe host of the HTTP server.127.0.0.1Type: str
--portThe port of the HTTP server.30000Type: int
--fastapi-root-pathApp is behind a path based routing proxy.""Type: str
--grpc-modeIf set, use gRPC server instead of HTTP server.Falsebool flag (set to enable)
--skip-server-warmupIf set, skip warmup.Falsebool flag (set to enable)
--warmupsSpecify custom warmup functions (csv) to run before server starts eg. —warmups=warmup_name1,warmup_name2 will run the functions warmup_name1 and warmup_name2 specified in warmup.py before the server starts listening for requestsNoneType: str
--nccl-portThe port for NCCL distributed environment setup. Defaults to a random port.NoneType: int
--checkpoint-engine-wait-weights-before-readyIf set, the server will wait for initial weights to be loaded via checkpoint-engine or other update methods before serving inference requests.Falsebool flag (set to enable)

Quantization and data type

ArgumentDescriptionDefaultsOptions
--dtypeData type for model weights and activations. * “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. * “half” for FP16. Recommended for AWQ quantization. * “float16” is the same as “half”. * “bfloat16” for a balance between precision and range. * “float” is shorthand for FP32 precision. * “float32” for FP32 precision.autoauto, half, float16, bfloat16, float, float32
--quantizationThe quantization method.Noneawq, fp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf, modelopt, modelopt_fp8, modelopt_fp4, petit_nvfp4, w8a8_int8, w8a8_fp8, moe_wna16, qoq, w4afp8, mxfp4, auto-round, compressed-tensors, modelslim, quark_int4fp8_moe
--quantization-param-pathPath to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling factors default to 1.0, which may cause accuracy issues.NoneType: Optional[str]
--kv-cache-dtypeData type for kv cache storage. “auto” will use model data type. “bf16” or “bfloat16” for BF16 KV cache. “fp8_e5m2” and “fp8_e4m3” are supported for CUDA 11.8+. “fp4_e2m1” (only mxfp4) is supported for CUDA 12.8+ and PyTorch 2.8.0+autoauto, fp8_e5m2, fp8_e4m3, bf16, bfloat16, fp4_e2m1
--enable-fp32-lm-headIf set, the LM head outputs (logits) are in FP32.Falsebool flag (set to enable)
--modelopt-quantThe ModelOpt quantization configuration. Supported values: ‘fp8’, ‘int4_awq’, ‘w4a8_awq’, ‘nvfp4’, ‘nvfp4_awq’. This requires the NVIDIA Model Optimizer library to be installed: pip install nvidia-modeloptNoneType: str
--modelopt-checkpoint-restore-pathPath to restore a previously saved ModelOpt quantized checkpoint. If provided, the quantization process will be skipped and the model will be loaded from this checkpoint.NoneType: str
--modelopt-checkpoint-save-pathPath to save the ModelOpt quantized checkpoint after quantization. This allows reusing the quantized model in future runs.NoneType: str
--modelopt-export-pathPath to export the quantized model in HuggingFace format after ModelOpt quantization. The exported model can then be used directly with SGLang for inference. If not provided, the model will not be exported.NoneType: str
--quantize-and-serveQuantize the model with ModelOpt and immediately serve it without exporting. This is useful for development and prototyping. For production, it’s recommended to use separate quantization and deployment steps.Falsebool flag (set to enable)
--rl-quant-profilePath to the FlashRL quantization profile. Required when using —load-format flash_rl.NoneType: str

Memory and scheduling

ArgumentDescriptionDefaultsOptions
--mem-fraction-staticThe fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors.NoneType: float
--max-running-requestsThe maximum number of running requests.NoneType: int
--max-queued-requestsThe maximum number of queued requests. This option is ignored when using disaggregation-mode.NoneType: int
--max-total-tokensThe maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes.NoneType: int
--chunked-prefill-sizeThe maximum number of tokens in a chunk for the chunked prefill. Setting this to -1 means disabling chunked prefill.NoneType: int
--prefill-max-requestsThe maximum number of requests in a prefill batch. If not specified, there is no limit.NoneType: int
--enable-dynamic-chunkingEnable dynamic chunk size adjustment for pipeline parallelism. When enabled, chunk sizes are dynamically calculated based on fitted function to maintain consistent execution time across chunks.Falsebool flag (set to enable)
--max-prefill-tokensThe maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model’s maximum context length.16384Type: int
--schedule-policyThe scheduling policy of the requests.fcfslpm, random, fcfs, dfs-weight, lof, priority, routing-key
--enable-priority-schedulingEnable priority scheduling. Requests with higher priority integer values will be scheduled first by default.Falsebool flag (set to enable)
--abort-on-priority-when-disabledIf set, abort requests that specify a priority when priority scheduling is disabled.Falsebool flag (set to enable)
--schedule-low-priority-values-firstIf specified with —enable-priority-scheduling, the scheduler will schedule requests with lower priority integer values first.Falsebool flag (set to enable)
--priority-scheduling-preemption-thresholdMinimum difference in priorities for an incoming request to have to preempt running request(s).10Type: int
--schedule-conservativenessHow conservative the schedule policy is. A larger value means more conservative scheduling. Use a larger value if you see requests being retracted frequently.1.0Type: float
--page-sizeThe number of tokens in a page.1Type: int
--swa-full-tokens-ratioThe ratio of SWA layer KV tokens / full layer KV tokens, regardless of the number of swa:full layers. It should be between 0 and 1. E.g. 0.5 means if each swa layer has 50 tokens, then each full layer has 100 tokens.0.8Type: float
--disable-hybrid-swa-memoryDisable the hybrid SWA memory.Falsebool flag (set to enable)
--radix-eviction-policyThe eviction policy of radix trees. ‘lru’ stands for Least Recently Used, ‘lfu’ stands for Least Frequently Used.lrulru, lfu
--enable-prefill-delayerEnable prefill delayer for DP attention to reduce idle time.Falsebool flag (set to enable)
--prefill-delayer-max-delay-passesMaximum forward passes to delay prefill.30Type: int
--prefill-delayer-token-usage-low-watermarkToken usage low watermark for prefill delayer.NoneType: float
--prefill-delayer-forward-passes-bucketsCustom buckets for prefill delayer forward passes histogram. 0 and max_delay_passes-1 will be auto-added.NoneList[float]
--prefill-delayer-wait-seconds-bucketsCustom buckets for prefill delayer wait seconds histogram. 0 will be auto-added.NoneList[float]

Runtime options

ArgumentDescriptionDefaultsOptions
--deviceThe device to use (‘cuda’, ‘xpu’, ‘hpu’, ‘npu’, ‘cpu’). Defaults to auto-detection if not specified.NoneType: str
--tensor-parallel-size
--tp-sizeThe tensor parallelism size.1Type: int
--pipeline-parallel-size
--pp-sizeThe pipeline parallelism size.1Type: int
--pp-max-micro-batch-sizeThe maximum micro batch size in pipeline parallelism.NoneType: int
--pp-async-batch-depthThe async batch depth of pipeline parallelism.0Type: int
--stream-intervalThe interval (or buffer size) for streaming in terms of the token length. A smaller value makes streaming smoother, while a larger value makes the throughput higher1Type: int
--stream-outputWhether to output as a sequence of disjoint segments.Falsebool flag (set to enable)
--random-seedThe random seed.NoneType: int
--constrained-json-whitespace-pattern(outlines and llguidance backends only) Regex pattern for syntactic whitespaces allowed in JSON constrained output. For example, to allow the model to generate consecutive whitespaces, set the pattern to [\n\t ]*NoneType: str
--constrained-json-disable-any-whitespace(xgrammar and llguidance backends only) Enforce compact representation in JSON constrained output.Falsebool flag (set to enable)
--watchdog-timeoutSet watchdog timeout in seconds. If a forward batch takes longer than this, the server will crash to prevent hanging.300Type: float
--soft-watchdog-timeoutSet soft watchdog timeout in seconds. If a forward batch takes longer than this, the server will dump information for debugging.NoneType: float
--dist-timeoutSet timeout for torch.distributed initialization.NoneType: int
--download-dirModel download directory for huggingface.NoneType: str
--model-checksumModel file integrity verification. If provided without value, uses model-path as HF repo ID. Otherwise, provide checksums JSON file path or HuggingFace repo ID.NoneType: str
--base-gpu-idThe base GPU ID to start allocating GPUs from. Useful when running multiple instances on the same machine.0Type: int
--gpu-id-stepThe delta between consecutive GPU IDs that are used. For example, setting it to 2 will use GPU 0,2,4,…1Type: int
--sleep-on-idleReduce CPU usage when sglang is idle.Falsebool flag (set to enable)
--custom-sigquit-handlerRegister a custom sigquit handler so you can do additional cleanup after the server is shutdown. This is only available for Engine, not for CLI.NoneType: str

Logging

ArgumentDescriptionDefaultsOptions
--log-levelThe logging level of all loggers.infoType: str
--log-level-httpThe logging level of HTTP server. If not set, reuse —log-level by default.NoneType: str
--log-requestsLog metadata, inputs, outputs of all requests. The verbosity is decided by —log-requests-levelFalsebool flag (set to enable)
--log-requests-level0: Log metadata (no sampling parameters). 1: Log metadata and sampling parameters. 2: Log metadata, sampling parameters and partial input/output. 3: Log every input/output.20, 1, 2, 3
--log-requests-formatFormat for request logging: ‘text’ (human-readable) or ‘json’ (structured)texttext, json
--log-requests-targetTarget(s) for request logging: ‘stdout’ and/or directory path(s) for file output. Can specify multiple targets, e.g., ‘—log-requests-target stdout /my/path’.NoneList[str]
--uvicorn-access-log-exclude-prefixesExclude uvicorn access logs whose request path starts with any of these prefixes. Defaults to empty (disabled).[]List[str]
--crash-dump-folderFolder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled.NoneType: str
--show-time-costShow time cost of custom marks.Falsebool flag (set to enable)
--enable-metricsEnable log prometheus metrics.Falsebool flag (set to enable)
--enable-metrics-for-all-schedulersEnable —enable-metrics-for-all-schedulers when you want schedulers on all TP ranks (not just TP 0) to record request metrics separately. This is especially useful when dp_attention is enabled, as otherwise all metrics appear to come from TP 0.Falsebool flag (set to enable)
--tokenizer-metrics-custom-labels-headerSpecify the HTTP header for passing custom labels for tokenizer metrics.x-custom-labelsType: str
--tokenizer-metrics-allowed-custom-labelsThe custom labels allowed for tokenizer metrics. The labels are specified via a dict in ‘—tokenizer-metrics-custom-labels-header’ field in HTTP requests, e.g., {‘label1’: ‘value1’, ‘label2’: ‘value2’} is allowed if ‘—tokenizer-metrics-allowed-custom-labels label1 label2’ is set.NoneList[str]
--bucket-time-to-first-tokenThe buckets of time to first token, specified as a list of floats.NoneList[float]
--bucket-inter-token-latencyThe buckets of inter-token latency, specified as a list of floats.NoneList[float]
--bucket-e2e-request-latencyThe buckets of end-to-end request latency, specified as a list of floats.NoneList[float]
--collect-tokens-histogramCollect prompt/generation tokens histogram.Falsebool flag (set to enable)
--prompt-tokens-bucketsThe buckets rule of prompt tokens. Supports 3 rule types: ‘default’ uses predefined buckets; ‘tse <middle> <base> <count>’ generates two sides exponential distributed buckets (e.g., ‘tse 1000 2 8’ generates buckets [984.0, 992.0, 996.0, 998.0, 1000.0, 1002.0, 1004.0, 1008.0, 1016.0]).); ‘custom <value1> <value2> …’ uses custom bucket values (e.g., ‘custom 10 50 100 500’).NoneList[str]
--generation-tokens-bucketsThe buckets rule for generation tokens histogram. Supports 3 rule types: ‘default’ uses predefined buckets; ‘tse <middle> <base> <count>’ generates two sides exponential distributed buckets (e.g., ‘tse 1000 2 8’ generates buckets [984.0, 992.0, 996.0, 998.0, 1000.0, 1002.0, 1004.0, 1008.0, 1016.0]).); ‘custom <value1> <value2> …’ uses custom bucket values (e.g., ‘custom 10 50 100 500’).NoneList[str]
--gc-warning-threshold-secsThe threshold for long GC warning. If a GC takes longer than this, a warning will be logged. Set to 0 to disable.0.0Type: float
--decode-log-intervalThe log interval of decode batch.40Type: int
--enable-request-time-stats-loggingEnable per request time stats loggingFalsebool flag (set to enable)
--kv-events-configConfig in json format for NVIDIA dynamo KV event publishing. Publishing will be enabled if this flag is used.NoneType: str
--enable-traceEnable opentelemetry traceFalsebool flag (set to enable)
--otlp-traces-endpointConfig opentelemetry collector endpoint if —enable-trace is set. format: <ip>:<port>localhost:4317Type: str

RequestMetricsExporter configuration

ArgumentDescriptionDefaultsOptions
--export-metrics-to-fileExport performance metrics for each request to local file (e.g. for forwarding to external systems).Falsebool flag (set to enable)
--export-metrics-to-file-dirDirectory path for writing performance metrics files (required when —export-metrics-to-file is enabled).NoneType: str
ArgumentDescriptionDefaultsOptions
--api-keySet API key of the server. It is also used in the OpenAI API compatible server.NoneType: str
--admin-api-keySet admin API key for administrative/control endpoints (e.g., weights update, cache flush, /get_server_info). Endpoints marked as admin-only require Authorization: Bearer <admin_api_key> when this is set.NoneType: str
--served-model-nameOverride the model name returned by the v1/models endpoint in OpenAI API server.NoneType: str
--weight-versionVersion identifier for the model weights. Defaults to ‘default’ if not specified.defaultType: str
--chat-templateThe buliltin chat template name or the path of the chat template file. This is only used for OpenAI-compatible API server.NoneType: str
--hf-chat-template-nameWhen the HuggingFace tokenizer has multiple chat templates (e.g., ‘default’, ‘tool_use’, ‘rag’), specify which named template to use. If not set, the first available template is used.NoneType: str
--completion-templateThe buliltin completion template name or the path of the completion template file. This is only used for OpenAI-compatible API server. only for code completion currently.NoneType: str
--file-storage-pathThe path of the file storage in backend.sglang_storageType: str
--enable-cache-reportReturn number of cached tokens in usage.prompt_tokens_details for each openai request.Falsebool flag (set to enable)
--reasoning-parserSpecify the parser for reasoning models. Supported parsers: [deepseek-r1, deepseek-v3, glm45, gpt-oss, kimi, qwen3, qwen3-thinking, step3].Nonedeepseek-r1, deepseek-v3, glm45, gpt-oss, kimi, qwen3, qwen3-thinking, step3
--tool-call-parserSpecify the parser for handling tool-call interactions. Supported parsers: [deepseekv3, deepseekv31, glm, glm45, glm47, gpt-oss, kimi_k2, llama3, mistral, pythonic, qwen, qwen25, qwen3_coder, step3].Nonedeepseekv3, deepseekv31, glm, glm45, glm47, gpt-oss, kimi_k2, llama3, mistral, pythonic, qwen, qwen25, qwen3_coder, step3
--tool-serverEither ‘demo’ or a comma-separated list of tool server urls to use for the model. If not specified, no tool server will be used.NoneType: str
--sampling-defaultsWhere to get default sampling parameters. ‘openai’ uses SGLang/OpenAI defaults (temperature=1.0, top_p=1.0, etc.). ‘model’ uses the model’s generation_config.json to get the recommended sampling parameters if available. Default is ‘model’.modelopenai, model

Data parallelism

ArgumentDescriptionDefaultsOptions
--data-parallel-size
--dp-sizeThe data parallelism size.1Type: int
--load-balance-methodThe load balancing strategy for data parallelism. The total_tokens algorithm can only be used when DP attention is applied. This algorithm performs load balancing based on the real-time token load of the DP workers.autoauto, round_robin, follow_bootstrap_room, total_requests, total_tokens

Multi-node distributed serving

ArgumentDescriptionDefaultsOptions
--dist-init-addr
--nccl-init-addrThe host address for initializing distributed backend (e.g., 192.168.0.2:25000).NoneType: str
--nnodesThe number of nodes.1Type: int
--node-rankThe node rank.0Type: int

Model override args

ArgumentDescriptionDefaultsOptions
--json-model-override-argsA dictionary in JSON string format used to override default model configurations.{}Type: str
--preferred-sampling-paramsjson-formatted sampling settings that will be returned in /get_model_infoNoneType: str

LoRA

ArgumentDescriptionDefaultsOptions
--enable-loraEnable LoRA support for the model. This argument is automatically set to True if --lora-paths is provided for backward compatibility.FalseBool flag (set to enable)
--enable-lora-overlap-loadingEnable asynchronous LoRA weight loading in order to overlap H2D transfers with GPU compute. This should be enabled if you find that your LoRA workloads are bottlenecked by adapter weight loading, for example when frequently loading large LoRA adapters.FalseBool flag (set to enable)
--max-lora-rankThe maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in --lora-paths. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.NoneType: int
--lora-target-modulesThe union set of all target modules where LoRA should be applied (e.g., q_proj, k_proj, gate_proj). If not specified, it will be automatically inferred from the adapters provided in --lora-paths. You can also set it to all to enable LoRA for all supported modules; note this may introduce minor performance overhead.Noneq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, qkv_proj, gate_up_proj, all
--lora-pathsThe list of LoRA adapters to load. Each adapter must be specified in one of the following formats: <PATH> | <NAME>=<PATH> | JSON with schema {"lora_name": str, "lora_path": str, "pinned": bool}.NoneType: List[str] / JSON objects
--max-loras-per-batchMaximum number of adapters for a running batch, including base-only requests.8Type: int
--max-loaded-lorasIf specified, limits the maximum number of LoRA adapters loaded in CPU memory at a time. Must be ≥ --max-loras-per-batch.NoneType: int
--lora-eviction-policyLoRA adapter eviction policy when the GPU memory pool is full.lrulru, fifo
--lora-backendChoose the kernel backend for multi-LoRA serving.csgmvtriton, csgmv, ascend, torch_native
--max-lora-chunk-sizeMaximum chunk size for the ChunkedSGMV LoRA backend. Only used when --lora-backend is csgmv. Larger values may improve performance.1616, 32, 64, 128

Kernel Backends (Attention, Sampling, Grammar, GEMM)

ArgumentDescriptionDefaultsOptions
--attention-backendChoose the kernels for attention layers.Nonetriton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend
--prefill-attention-backendChoose the kernels for prefill attention layers (have priority over —attention-backend).Nonetriton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend
--decode-attention-backendChoose the kernels for decode attention layers (have priority over —attention-backend).Nonetriton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend
--sampling-backendChoose the kernels for sampling layers.Noneflashinfer, pytorch, ascend
--grammar-backendChoose the backend for grammar-guided decoding.Nonexgrammar, outlines, llguidance, none
--mm-attention-backendSet multimodal attention backend.Nonesdpa, fa3, fa4, triton_attn, ascend_attn, aiter_attn
--nsa-prefill-backendChoose the NSA backend for the prefill stage (overrides --attention-backend when running DeepSeek NSA-style attention).flashmla_sparseflashmla_sparse, flashmla_kv, flashmla_auto, fa3, tilelang, aiter
--nsa-decode-backendChoose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides --attention-backend for decoding.fa3flashmla_sparse, flashmla_kv, fa3, tilelang, aiter
--fp8-gemm-backendChoose the runner backend for Blockwise FP8 GEMM operations. Options: ‘auto’ (default, auto-selects based on hardware), ‘deep_gemm’ (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), ‘flashinfer_trtllm’ (optimal for Blackwell and low-latency), ‘cutlass’ (optimal for Hopper/Blackwell GPUs and high-throughput), ‘triton’ (fallback, widely compatible), ‘aiter’ (ROCm only). NOTE: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8.autoauto, deep_gemm, flashinfer_trtllm, cutlass, triton, aiter
--fp4-gemm-backendChoose the runner backend for NVFP4 GEMM operations. Options: ‘auto’ (default, auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), ‘flashinfer_cudnn’ (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), ‘flashinfer_cutlass’ (FlashInfer CUTLASS backend, optimal on CUDA 12), ‘flashinfer_trtllm’ (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. NOTE: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND.autoauto, flashinfer_cudnn, flashinfer_cutlass, flashinfer_trtllm
--disable-flashinfer-autotuneFlashinfer autotune is enabled by default. Set this flag to disable the autotune.Falsebool flag (set to enable)

Speculative decoding

ArgumentDescriptionDefaultsOptions
--speculative-algorithmSpeculative algorithm.NoneEAGLE, EAGLE3, NEXTN, STANDALONE, NGRAM
--speculative-draft-model-path
--speculative-draft-modelThe path of the draft model weights. This can be a local folder or a Hugging Face repo ID.NoneType: str
--speculative-draft-model-revisionThe specific draft model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.NoneType: str
--speculative-draft-load-formatThe format of the draft model weights to load. If not specified, will use the same format as —load-format. Use ‘dummy’ to initialize draft model weights with random values for profiling.NoneSame as —load-format options
--speculative-num-stepsThe number of steps sampled from draft model in Speculative Decoding.NoneType: int
--speculative-eagle-topkThe number of tokens sampled from the draft model in eagle2 each step.NoneType: int
--speculative-num-draft-tokensThe number of tokens sampled from the draft model in Speculative Decoding.NoneType: int
--speculative-accept-threshold-singleAccept a draft token if its probability in the target model is greater than this threshold.1.0Type: float
--speculative-accept-threshold-accThe accept probability of a draft token is raised from its target probability p to min(1, p / threshold_acc).1.0Type: float
--speculative-token-mapThe path of the draft model’s small vocab table.NoneType: str
--speculative-attention-modeAttention backend for speculative decoding operations (both target verify and draft extend). Can be one of ‘prefill’ (default) or ‘decode’.prefillprefill, decode
--speculative-draft-attention-backendAttention backend for speculative decoding drafting.NoneSame as attention backend options
--speculative-moe-runner-backendMOE backend for EAGLE speculative decoding, see —moe-runner-backend for options. Same as moe runner backend if unset.NoneSame as —moe-runner-backend options
--speculative-moe-a2a-backendMOE A2A backend for EAGLE speculative decoding, see —moe-a2a-backend for options. Same as moe a2a backend if unset.NoneSame as —moe-a2a-backend options
--speculative-draft-model-quantizationThe quantization method for speculative model.NoneSame as —quantization options

Ngram speculative decoding

ArgumentDescriptionDefaultsOptions
--speculative-ngram-min-match-window-sizeThe minimum window size for pattern matching in ngram speculative decoding.1Type: int
--speculative-ngram-max-match-window-sizeThe maximum window size for pattern matching in ngram speculative decoding.12Type: int
--speculative-ngram-min-bfs-breadthThe minimum breadth for BFS (Breadth-First Search) in ngram speculative decoding.1Type: int
--speculative-ngram-max-bfs-breadthThe maximum breadth for BFS (Breadth-First Search) in ngram speculative decoding.10Type: int
--speculative-ngram-match-typeThe match type for cache tree.BFSBFS, PROB
--speculative-ngram-branch-lengthThe branch length for ngram speculative decoding.18Type: int
--speculative-ngram-capacityThe cache capacity for ngram speculative decoding.10000000Type: int

Multi-layer Eagle speculative decoding

ArgumentDescriptionDefaultsOptions
--enable-multi-layer-eagleEnable multi-layer Eagle speculative decoding.Falsebool flag (set to enable)

MoE

ArgumentDescriptionDefaultsOptions
--expert-parallel-size
--ep-size
--epThe expert parallelism size.1Type: int
--moe-a2a-backendSelect the backend for all-to-all communication for expert parallelism.nonenone, deepep, mooncake, ascend_fuseep
--moe-runner-backendChoose the runner backend for MoE.autoauto, deep_gemm, triton, triton_kernel, flashinfer_trtllm, flashinfer_cutlass, flashinfer_mxfp4, flashinfer_cutedsl, cutlass
--flashinfer-mxfp4-moe-precisionChoose the computation precision of flashinfer mxfp4 moedefaultdefault, bf16
--enable-flashinfer-allreduce-fusionEnable FlashInfer allreduce fusion with Residual RMSNorm.Falsebool flag (set to enable)
--deepep-modeSelect the mode when enable DeepEP MoE, could be normal, low_latency or auto. Default is auto, which means low_latency for decode batch and normal for prefill batch.autonormal, low_latency, auto
--ep-num-redundant-expertsAllocate this number of redundant experts in expert parallel.0Type: int
--ep-dispatch-algorithmThe algorithm to choose ranks for redundant experts in expert parallel.NoneType: str
--init-expert-locationInitial location of EP experts.trivialType: str
--enable-eplbEnable EPLB algorithmFalsebool flag (set to enable)
--eplb-algorithmChosen EPLB algorithmautoType: str
--eplb-rebalance-num-iterationsNumber of iterations to automatically trigger a EPLB re-balance.1000Type: int
--eplb-rebalance-layers-per-chunkNumber of layers to rebalance per forward pass.NoneType: int
--eplb-min-rebalancing-utilization-thresholdMinimum threshold for GPU average utilization to trigger EPLB rebalancing. Must be in the range [0.0, 1.0].1.0Type: float
--expert-distribution-recorder-modeMode of expert distribution recorder.NoneType: str
--expert-distribution-recorder-buffer-sizeCircular buffer size of expert distribution recorder. Set to -1 to denote infinite buffer.NoneType: int
--enable-expert-distribution-metricsEnable logging metrics for expert balancednessFalsebool flag (set to enable)
--deepep-configTuned DeepEP config suitable for your own cluster. It can be either a string with JSON content or a file path.NoneType: str
--moe-dense-tp-sizeTP size for MoE dense MLP layers. This flag is useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports.NoneType: int
--elastic-ep-backendSpecify the collective communication backend for elastic EP. Currently supports ‘mooncake’.nonenone, mooncake
--mooncake-ib-deviceThe InfiniBand devices for Mooncake Backend transfer, accepts multiple comma-separated devices (e.g., —mooncake-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when Mooncake Backend is enabled.NoneType: str

Mamba Cache

ArgumentDescriptionDefaultsOptions
--max-mamba-cache-sizeThe maximum size of the mamba cache.NoneType: int
--mamba-ssm-dtypeThe data type of the SSM states in mamba cache.float32float32, bfloat16
--mamba-full-memory-ratioThe ratio of mamba state memory to full kv cache memory.0.9Type: float
--mamba-scheduler-strategyThe strategy to use for mamba scheduler. auto currently defaults to no_buffer. 1. no_buffer does not support overlap scheduler due to not allocating extra mamba state buffers. Branching point caching support is feasible but not implemented. 2. extra_buffer supports overlap schedule by allocating extra mamba state buffers to track mamba state for caching (mamba state usage per running req becomes 2x for non-spec; 1+(1/(2+speculative_num_draft_tokens))x for spec dec (e.g. 1.16x if speculative_num_draft_tokens==4)). 2a. extra_buffer is strictly better for non-KV-cache-bound cases; for KV-cache-bound cases, the tradeoff depends on whether enabling overlap outweighs reduced max running requests. 2b. mamba caching at radix cache branching point is strictly better than non-branch but requires kernel support (currently only FLA backend), currently only extra_buffer supports branching.autoauto, no_buffer, extra_buffer
--mamba-track-intervalThe interval (in tokens) to track the mamba state during decode. Only used when --mamba-scheduler-strategy is extra_buffer. Must be divisible by page_size if set, and must be >= speculative_num_draft_tokens when using speculative decoding.256Type: int

Hierarchical cache

ArgumentDescriptionDefaultsOptions
--enable-hierarchical-cacheEnable hierarchical cacheFalsebool flag (set to enable)
--hicache-ratioThe ratio of the size of host KV cache memory pool to the size of device pool.2.0Type: float
--hicache-sizeThe size of host KV cache memory pool in gigabytes, which will override the hicache_ratio if set.0Type: int
--hicache-write-policyThe write policy of hierarchical cache.write_throughwrite_back, write_through, write_through_selective
--hicache-io-backendThe IO backend for KV cache transfer between CPU and GPUkerneldirect, kernel, kernel_ascend
--hicache-mem-layoutThe layout of host memory pool for hierarchical cache.layer_firstlayer_first, page_first, page_first_direct, page_first_kv_split, page_head
--hicache-storage-backendThe storage backend for hierarchical KV cache. Built-in backends: file, mooncake, hf3fs, nixl, aibrix. For dynamic backend, use —hicache-storage-backend-extra-config to specify: backend_name (custom name), module_path (Python module path), class_name (backend class name).Nonefile, mooncake, hf3fs, nixl, aibrix, dynamic, eic
--hicache-storage-prefetch-policyControl when prefetching from the storage backend should stop.best_effortbest_effort, wait_complete, timeout
--hicache-storage-backend-extra-configA dictionary in JSON string format, or a string starting with a @ followed by a config file in JSON/YAML/TOML format, containing extra configuration for the storage backend.NoneType: str

Hierarchical sparse attention

ArgumentDescriptionDefaultsOptions
--hierarchical-sparse-attention-extra-configA dictionary in JSON string format for hierarchical sparse attention configuration. Required fields: algorithm (str), backend (str). All other fields are algorithm-specific and passed to the algorithm constructor.NoneType: str

LMCache

ArgumentDescriptionDefaultsOptions
--enable-lmcacheUsing LMCache as an alternative hierarchical cache solutionFalsebool flag (set to enable)

Ktransformers

ArgumentDescriptionDefaultsOptions
--kt-weight-path[ktransformers parameter] The path of the quantized expert weights for amx kernel. A local folder.NoneType: str
--kt-method[ktransformers parameter] Quantization formats for CPU execution.AMXINT4Type: str
--kt-cpuinfer[ktransformers parameter] The number of CPUInfer threads.NoneType: int
--kt-threadpool-count[ktransformers parameter] One-to-one with the number of NUMA nodes (one thread pool per NUMA).2Type: int
--kt-num-gpu-experts[ktransformers parameter] The number of GPU experts.NoneType: int
--kt-max-deferred-experts-per-token[ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0.NoneType: int

Diffusion LLM

ArgumentDescriptionDefaultsOptions
--dllm-algorithmThe diffusion LLM algorithm, such as LowConfidence.NoneType: str
--dllm-algorithm-configThe diffusion LLM algorithm configurations. Must be a YAML file.NoneType: str

Double Sparsity

ArgumentDescriptionDefaultsOptions
--enable-double-sparsityEnable double sparsity attentionFalsebool flag (set to enable)
--ds-channel-config-pathThe path of the double sparsity channel configNoneType: str
--ds-heavy-channel-numThe number of heavy channels in double sparsity attention32Type: int
--ds-heavy-token-numThe number of heavy tokens in double sparsity attention256Type: int
--ds-heavy-channel-typeThe type of heavy channels in double sparsity attentionqkType: str
--ds-sparse-decode-thresholdThe minimum decode sequence length required before the double-sparsity backend switches from the dense fallback to the sparse decode kernel.4096Type: int

Offloading

ArgumentDescriptionDefaultsOptions
--cpu-offload-gbHow many GBs of RAM to reserve for CPU offloading.0Type: int
--offload-group-sizeNumber of layers per group in offloading.-1Type: int
--offload-num-in-groupNumber of layers to be offloaded within a group.1Type: int
--offload-prefetch-stepSteps to prefetch in offloading.1Type: int
--offload-modeMode of offloading.cpuType: str

Args for multi-item scoring

ArgumentDescriptionDefaultsOptions
--multi-item-scoring-delimiterDelimiter token ID for multi-item scoring. Used to combine Query and Items into a single sequence: Query<delimiter>Item1<delimiter>Item2<delimiter>… This enables efficient batch processing of multiple items against a single query.NoneType: int

Optimization/debug options

ArgumentDescriptionDefaultsOptions
--disable-radix-cacheDisable RadixAttention for prefix caching.Falsebool flag (set to enable)
--cuda-graph-max-bsSet the maximum batch size for cuda graph. It will extend the cuda graph capture batch size to this value.NoneType: int
--cuda-graph-bsSet the list of batch sizes for cuda graph.NoneList[int]
--disable-cuda-graphDisable cuda graph.Falsebool flag (set to enable)
--disable-cuda-graph-paddingDisable cuda graph when padding is needed. Still uses cuda graph when padding is not needed.Falsebool flag (set to enable)
--enable-profile-cuda-graphEnable profiling of cuda graph capture.Falsebool flag (set to enable)
--enable-cudagraph-gcEnable garbage collection during CUDA graph capture. If disabled (default), GC is frozen during capture to speed up the process.Falsebool flag (set to enable)
--enable-layerwise-nvtx-markerEnable layerwise NVTX profiling annotations for the model. This adds NVTX markers to every layer for detailed per-layer performance analysis with Nsight Systems.Falsebool flag (set to enable)
--enable-nccl-nvlsEnable NCCL NVLS for prefill heavy requests when available.Falsebool flag (set to enable)
--enable-symm-memEnable NCCL symmetric memory for fast collectives.Falsebool flag (set to enable)
--disable-flashinfer-cutlass-moe-fp4-allgatherDisables quantize before all-gather for flashinfer cutlass moe.Falsebool flag (set to enable)
--enable-tokenizer-batch-encodeEnable batch tokenization for improved performance when processing multiple text inputs. Do not use with image inputs, pre-tokenized input_ids, or input_embeds.Falsebool flag (set to enable)
--disable-tokenizer-batch-decodeDisable batch decoding when decoding multiple completions.Falsebool flag (set to enable)
--disable-outlines-disk-cacheDisable disk cache of outlines to avoid possible crashes related to file system or high concurrency.Falsebool flag (set to enable)
--disable-custom-all-reduceDisable the custom all-reduce kernel and fall back to NCCL.Falsebool flag (set to enable)
--enable-mscclppEnable using mscclpp for small messages for all-reduce kernel and fall back to NCCL.Falsebool flag (set to enable)
--enable-torch-symm-memEnable using torch symm mem for all-reduce kernel and fall back to NCCL. Only supports CUDA device SM90 and above. SM90 supports world size 4, 6, 8. SM10 supports world size 6, 8.Falsebool flag (set to enable)
--disable-overlap-scheduleDisable the overlap scheduler, which overlaps the CPU scheduler with GPU model worker.Falsebool flag (set to enable)
--enable-mixed-chunkEnabling mixing prefill and decode in a batch when using chunked prefill.Falsebool flag (set to enable)
--enable-dp-attentionEnabling data parallelism for attention and tensor parallelism for FFN. The dp size should be equal to the tp size. Currently DeepSeek-V2 and Qwen 2/3 MoE models are supported.Falsebool flag (set to enable)
--enable-dp-lm-headEnable vocabulary parallel across the attention TP group to avoid all-gather across DP groups, optimizing performance under DP attention.Falsebool flag (set to enable)
--enable-two-batch-overlapEnabling two micro batches to overlap.Falsebool flag (set to enable)
--enable-single-batch-overlapLet computation and communication overlap within one micro batch.Falsebool flag (set to enable)
--tbo-token-distribution-thresholdThe threshold of token distribution between two batches in micro-batch-overlap, determines whether to two-batch-overlap or two-chunk-overlap. Set to 0 denote disable two-chunk-overlap.0.48Type: float
--enable-torch-compileOptimize the model with torch.compile. Experimental feature.Falsebool flag (set to enable)
--enable-torch-compile-debug-modeEnable debug mode for torch compile.Falsebool flag (set to enable)
--enable-piecewise-cuda-graphOptimize the model with piecewise cuda graph for extend/prefill only. Experimental feature.Falsebool flag (set to enable)
--piecewise-cuda-graph-tokensSet the list of tokens when using piecewise cuda graph.NoneType: JSON list
--piecewise-cuda-graph-compilerSet the compiler for piecewise cuda graph. Choices are: eager, inductor.eagereager, inductor
--torch-compile-max-bsSet the maximum batch size when using torch compile.32Type: int
--piecewise-cuda-graph-max-tokensSet the maximum tokens when using piecewise cuda graph.4096Type: int
--torchao-configOptimize the model with torchao. Experimental feature. Current choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_rowType: str
--enable-nan-detectionEnable the NaN detection for debugging purposes.Falsebool flag (set to enable)
--enable-p2p-checkEnable P2P check for GPU access, otherwise the p2p access is allowed by default.Falsebool flag (set to enable)
--triton-attention-reduce-in-fp32Cast the intermediate attention results to fp32 to avoid possible crashes related to fp16. This only affects Triton attention kernels.Falsebool flag (set to enable)
--triton-attention-num-kv-splitsThe number of KV splits in flash decoding Triton kernel. Larger value is better in longer context scenarios. The default value is 8.8Type: int
--triton-attention-split-tile-sizeThe size of split KV tile in flash decoding Triton kernel. Used for deterministic inference.NoneType: int
--num-continuous-decode-stepsRun multiple continuous decoding steps to reduce scheduling overhead. This can potentially increase throughput but may also increase time-to-first-token latency. The default value is 1, meaning only run one decoding step at a time.1Type: int
--delete-ckpt-after-loadingDelete the model checkpoint after loading the model.Falsebool flag (set to enable)
--enable-memory-saverAllow saving memory using release_memory_occupation and resume_memory_occupationFalsebool flag (set to enable)
--enable-weights-cpu-backupSave model weights to CPU memory during release_weights_occupation and resume_weights_occupationFalsebool flag (set to enable)
--enable-draft-weights-cpu-backupSave draft model weights to CPU memory during release_weights_occupation and resume_weights_occupationFalsebool flag (set to enable)
--allow-auto-truncateAllow automatically truncating requests that exceed the maximum input length instead of returning an error.Falsebool flag (set to enable)
--enable-custom-logit-processorEnable users to pass custom logit processors to the server (disabled by default for security)Falsebool flag (set to enable)
--flashinfer-mla-disable-raggedNot using ragged prefill wrapper when running flashinfer mlaFalsebool flag (set to enable)
--disable-shared-experts-fusionDisable shared experts fusion optimization for deepseek v3/r1.Falsebool flag (set to enable)
--disable-chunked-prefix-cacheDisable chunked prefix cache feature for deepseek, which should save overhead for short sequences.Falsebool flag (set to enable)
--disable-fast-image-processorAdopt base image processor instead of fast image processor.Falsebool flag (set to enable)
--keep-mm-feature-on-deviceKeep multimodal feature tensors on device after processing to save D2H copy.Falsebool flag (set to enable)
--enable-return-hidden-statesEnable returning hidden states with responses.Falsebool flag (set to enable)
--enable-return-routed-expertsEnable returning routed experts of each layer with responses.Falsebool flag (set to enable)
--scheduler-recv-intervalThe interval to poll requests in scheduler. Can be set to >1 to reduce the overhead of this.1Type: int
--numa-nodeSets the numa node for the subprocesses. i-th element corresponds to i-th subprocess.NoneList[int]
--enable-deterministic-inferenceEnable deterministic inference mode with batch invariant ops.Falsebool flag (set to enable)
--rl-on-policy-targetThe training system that SGLang needs to match for true on-policy.Nonefsdp
--enable-attn-tp-input-scatteredAllow input of attention to be scattered when only using tensor parallelism, to reduce the computational load of operations such as qkv latent.Falsebool flag (set to enable)
--enable-nsa-prefill-context-parallelEnable context parallelism used in the long sequence prefill phase of DeepSeek v3.2.Falsebool flag (set to enable)
--nsa-prefill-cp-modeToken splitting mode for the prefill phase of DeepSeek v3.2 under context parallelism. Optional values: in-seq-split (default), round-robin-split. round-robin-split distributes tokens across ranks based on token_idx % cp_size. It supports multi-batch prefill, fused MoE, and FP8 KV cache.in-seq-splitin-seq-split, round-robin-split
--enable-fused-qk-norm-ropeEnable fused qk normalization and rope rotary embedding.Falsebool flag (set to enable)
--enable-precise-embedding-interpolationEnable corner alignment for resize of embeddings grid to ensure more accurate(but slower) evaluation of interpolated embedding values.Falsebool flag (set to enable)

Dynamic batch tokenizer

ArgumentDescriptionDefaultsOptions
--enable-dynamic-batch-tokenizerEnable async dynamic batch tokenizer for improved performance when multiple requests arrive concurrently.Falsebool flag (set to enable)
--dynamic-batch-tokenizer-batch-size[Only used if —enable-dynamic-batch-tokenizer is set] Maximum batch size for dynamic batch tokenizer.32Type: int
--dynamic-batch-tokenizer-batch-timeout[Only used if —enable-dynamic-batch-tokenizer is set] Timeout in seconds for batching tokenization requests.0.002Type: float

Debug tensor dumps

ArgumentDescriptionDefaultsOptions
--debug-tensor-dump-output-folderThe output folder for dumping tensors.NoneType: str
--debug-tensor-dump-layersThe layer ids to dump. Dump all layers if not specified.NoneType: JSON list
--debug-tensor-dump-input-fileThe input filename for dumping tensorsNoneType: str
--debug-tensor-dump-injectInject the outputs from jax as the input of every layer.FalseType: str

PD disaggregation

ArgumentDescriptionDefaultsOptions
--disaggregation-modeOnly used for PD disaggregation. “prefill” for prefill-only server, and “decode” for decode-only server. If not specified, it is not PD disaggregatednullnull, prefill, decode
--disaggregation-transfer-backendThe backend for disaggregation transfer. Default is mooncake.mooncakemooncake, nixl, ascend, fake
--disaggregation-bootstrap-portBootstrap server port on the prefill server. Default is 8998.8998Type: int
--disaggregation-decode-tpDecode tp size. If not set, it matches the tp size of the current engine. This is only set on the prefill server.NoneType: int
--disaggregation-decode-dpDecode dp size. If not set, it matches the dp size of the current engine. This is only set on the prefill server.NoneType: int
--disaggregation-prefill-ppPrefill pp size. If not set, it is default to 1. This is only set on the decode server.1Type: int
--disaggregation-ib-deviceThe InfiniBand devices for disaggregation transfer, accepts single device (e.g., —disaggregation-ib-device mlx5_0) or multiple comma-separated devices (e.g., —disaggregation-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when mooncake backend is enabled.NoneType: str
--disaggregation-decode-enable-offload-kvcacheEnable async KV cache offloading on decode server (PD mode).Falsebool flag (set to enable)
--disaggregation-decode-enable-fake-autoAuto enable FAKE mode for decode node testing, no need to pass bootstrap_host and bootstrap_room in request.Falsebool flag (set to enable)
--num-reserved-decode-tokensNumber of decode tokens that will have memory reserved when adding new request to the running batch.512Type: int
--disaggregation-decode-polling-intervalThe interval to poll requests in decode server. Can be set to >1 to reduce the overhead of this.1Type: int

Encode prefill disaggregation

ArgumentDescriptionDefaultsOptions
--encoder-onlyFor MLLM with an encoder, launch an encoder-only serverFalsebool flag (set to enable)
--language-onlyFor VLM, load weights for the language model only.Falsebool flag (set to enable)
--encoder-transfer-backendThe backend for encoder disaggregation transfer. Default is zmq_to_scheduler.zmq_to_schedulerzmq_to_scheduler, zmq_to_tokenizer, mooncake
--encoder-urlsList of encoder server urls.[]Type: JSON list

Custom weight loader

ArgumentDescriptionDefaultsOptions
--custom-weight-loaderThe custom dataloader which used to update the model. Should be set with a valid import path, such as my_package.weight_load_funcNoneList[str]
--weight-loader-disable-mmapDisable mmap while loading weight using safetensors.Falsebool flag (set to enable)
--remote-instance-weight-loader-seed-instance-ipThe ip of the seed instance for loading weights from remote instance.NoneType: str
--remote-instance-weight-loader-seed-instance-service-portThe service port of the seed instance for loading weights from remote instance.NoneType: int
--remote-instance-weight-loader-send-weights-group-portsThe communication group ports for loading weights from remote instance.NoneType: JSON list
--remote-instance-weight-loader-backendThe backend for loading weights from remote instance. Can be ‘transfer_engine’ or ‘nccl’. Default is ‘nccl’.nccltransfer_engine, nccl
--remote-instance-weight-loader-start-seed-via-transfer-engineStart seed server via transfer engine backend for remote instance weight loader.Falsebool flag (set to enable)

For PD-Multiplexing

ArgumentDescriptionDefaultsOptions
--enable-pdmuxEnable PD-Multiplexing, PD running on greenctx stream.Falsebool flag (set to enable)
--pdmux-config-pathThe path of the PD-Multiplexing config file.NoneType: str
--sm-group-numNumber of sm partition groups.8Type: int

Configuration file support

ArgumentDescriptionDefaultsOptions
--configRead CLI options from a config file. Must be a YAML file with configuration options.NoneType: str

For Multi-Modal

ArgumentDescriptionDefaultsOptions
--mm-max-concurrent-callsThe max concurrent calls for async mm data processing.32Type: int
--mm-per-request-timeoutThe timeout for each multi-modal request in seconds.10.0Type: int
--enable-broadcast-mm-inputs-processEnable broadcast mm-inputs process in scheduler.Falsebool flag (set to enable)
--mm-process-configMultimodal preprocessing config, a json config contains keys: image, video, audio.{}Type: JSON / Dict
--mm-enable-dp-encoderEnabling data parallelism for mm encoder. The dp size will be set to the tp size automatically.Falsebool flag (set to enable)
--limit-mm-data-per-requestLimit the number of multimodal inputs per request. e.g. ’{“image”: 1, “video”: 1, “audio”: 1}‘NoneType: JSON / Dict

For checkpoint decryption

ArgumentDescriptionDefaultsOptions
--decrypted-config-fileThe path of the decrypted config file.NoneType: str
--decrypted-draft-config-fileThe path of the decrypted draft config file.NoneType: str
--enable-prefix-mm-cacheEnable prefix multimodal cache. Currently only supports mm-only.Falsebool flag (set to enable)

Forward hooks

ArgumentDescriptionDefaultsOptions
--forward-hooksJSON-formatted list of forward hook specifications. Each element must include target_modules (list of glob patterns matched against model.named_modules() names) and hook_factory (Python import path to a factory, e.g. my_package.hooks:make_hook). An optional name field is used for logging, and an optional config object is passed as a dict to the factory.NoneType: JSON list

Deprecated arguments

ArgumentDescriptionDefaultsOptions
--enable-ep-moeNOTE: —enable-ep-moe is deprecated. Please set --ep-size to the same value as --tp-size instead.NoneN/A
--enable-deepep-moeNOTE: —enable-deepep-moe is deprecated. Please set --moe-a2a-backend to ‘deepep’ instead.NoneN/A
--prefill-round-robin-balanceNote: Note: —prefill-round-robin-balance is deprecated now.NoneN/A
--enable-flashinfer-cutlass-moeNOTE: —enable-flashinfer-cutlass-moe is deprecated. Please set --moe-runner-backend to ‘flashinfer_cutlass’ instead.NoneN/A
--enable-flashinfer-cutedsl-moeNOTE: —enable-flashinfer-cutedsl-moe is deprecated. Please set --moe-runner-backend to ‘flashinfer_cutedsl’ instead.NoneN/A
--enable-flashinfer-trtllm-moeNOTE: —enable-flashinfer-trtllm-moe is deprecated. Please set --moe-runner-backend to ‘flashinfer_trtllm’ instead.NoneN/A
--enable-triton-kernel-moeNOTE: —enable-triton-kernel-moe is deprecated. Please set --moe-runner-backend to ‘triton_kernel’ instead.NoneN/A
--enable-flashinfer-mxfp4-moeNOTE: —enable-flashinfer-mxfp4-moe is deprecated. Please set --moe-runner-backend to ‘flashinfer_mxfp4’ instead.NoneN/A
--crash-on-nanCrash the server on nan logprobs.FalseType: str
--hybrid-kvcache-ratioMix ratio in [0,1] between uniform and hybrid kv buffers (0.0 = pure uniform: swa_size / full_size = 1)(1.0 = pure hybrid: swa_size / full_size = local_attention_size / context_length)NoneOptional[float]
--load-watch-intervalThe interval of load watching in seconds.0.1Type: float
--nsa-prefillChoose the NSA backend for the prefill stage (overrides --attention-backend when running DeepSeek NSA-style attention).flashmla_sparseflashmla_sparse, flashmla_decode, fa3, tilelang, aiter
--nsa-decodeChoose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides --attention-backend for decoding.flashmla_kvflashmla_prefill, flashmla_kv, fa3, tilelang, aiter