Method 1: With pip or uv
It is recommended to use uv for faster installation:- In some cases (e.g., GB200), the above command might install a wrong torch version (e.g., the CPU version) due to dependency resolution. To fix this, you can first run the above command and then force-reinstall the correct PyTorch with the following:
- For CUDA 13, Docker is recommended (see the Method 3 note on B300/GB300/CUDA 13). If you do not have Docker access, installing the matching
sgl_kernelwheel from the sgl-project whl releases after installing SGLang also works. ReplaceX.Y.Zwith thesgl_kernelversion required by your SGLang (you can find this by runninguv pip show sgl_kernel). Examples: - If you encounter
OSError: CUDA_HOME environment variable is not set, set it to your CUDA install root with either of the following solutions:- Use
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>to set theCUDA_HOMEenvironment variable. - Install FlashInfer first following FlashInfer installation doc, then install SGLang as described above.
- Use
Method 2: From source
- If you want to develop SGLang, you can try the dev docker image. Please refer to setup docker container. The docker image is
lmsysorg/sglang:dev.
Method 3: Using docker
The docker images are available on Docker Hub at lmsysorg/sglang, built from Dockerfile. Replace<secret> below with your huggingface hub token.
runtime variant which is significantly smaller (~40% reduction) by excluding build tools and development dependencies:
- On B300/GB300 (SM103) or CUDA 13 environment, we recommend using the nightly image at
lmsysorg/sglang:dev-cu13or stable image atlmsysorg/sglang:latest-cu130-runtime. Please, do not re-install the project as editable inside the docker image, since it will override the version of libraries specified by the cu13 docker image.
Method 4: Using Kubernetes
Please check out OME, a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).More
More
-
Option 1: For single node serving (typically when the model size fits into GPUs on one node)
Execute command
kubectl apply -f docker/k8s-sglang-service.yaml, to create k8s deployment and service, with llama-31-8b as example. -
Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as
DeepSeek-R1) Modify the LLM model path and arguments as necessary, then execute commandkubectl apply -f docker/k8s-sglang-distributed-sts.yaml, to create two nodes k8s statefulset and serving service.
Method 5: Using docker compose
More
More
This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.
- Copy the compose.yml to your local machine
- Execute the command
docker compose up -din your terminal.
Method 6: Run on Kubernetes or Clouds with SkyPilot
More
More
To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.
- Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot’s documentation.
- Deploy on your own infra with a single command and get the HTTP API endpoint:
SkyPilot YAML: sglang.yaml
SkyPilot YAML: sglang.yaml
- To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.
Method 7: Run on AWS SageMaker
More
More
To deploy on SGLang on AWS SageMaker, check out AWS SageMaker InferenceAmazon Web Services provide supports for SGLang containers along with routine security patching. For available SGLang containers, check out AWS SGLang DLCsTo host a model with your own container, follow the following steps:
- Build a docker container with sagemaker.Dockerfile alongside the serve script.
- Push your container onto AWS ECR.
Dockerfile Build Script: build-and-push.sh
Dockerfile Build Script: build-and-push.sh
- Deploy a model for serving on AWS Sagemaker, refer to deploy_and_serve_endpoint.py. For more information, check out sagemaker-python-sdk.
- By default, the model server on SageMaker will run with the following command:
python3 -m sglang.launch_server --model-path opt/ml/model --host 0.0.0.0 --port 8080. This is optimal for hosting your own model with SageMaker. - To modify your model serving parameters, the serve script allows for all available options within
python3 -m sglang.launch_server --helpcli by specifying environment variables with prefixSM_SGLANG_. - The serve script will automatically convert all environment variables with prefix
SM_SGLANG_fromSM_SGLANG_INPUT_ARGUMENTinto--input-argumentto be parsed intopython3 -m sglang.launch_servercli. - For example, to run Qwen/Qwen3-0.6B with reasoning parser, simply add additional environment variables
SM_SGLANG_MODEL_PATH=Qwen/Qwen3-0.6BandSM_SGLANG_REASONING_PARSER=qwen3.
- By default, the model server on SageMaker will run with the following command:
Common Notes
- FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding
--attention-backend triton --sampling-backend pytorchand open an issue on GitHub. - To reinstall flashinfer locally, use the following command:
pip3 install --upgrade flashinfer-python --force-reinstall --no-depsand then delete the cache withrm -rf ~/.cache/flashinfer. - When encountering
ptxas fatal : Value 'sm_103a' is not defined for option 'gpu-name'on B300/GB300, fix it withexport TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas.
