NVIDIA Jetson Orin

Prerequisites
Installing and running SGLang with Jetson Containers
Running Inference
Running quantization with TorchAO
Structured output with XGrammar
References

Prerequisites

Before starting, ensure the following:

NVIDIA Jetson AGX Orin Devkit is set up with JetPack 6.1 or later.
CUDA Toolkit and cuDNN are installed.
Verify that the Jetson AGX Orin is in high-performance mode:

sudo nvpmodel -m 0

Installing and running SGLang with Jetson Containers

Clone the jetson-containers github repository:

git clone https://github.com/dusty-nv/jetson-containers.git

Run the installation script:

bash jetson-containers/install.sh

Build the container image:

jetson-containers build sglang

Run the container:

jetson-containers run $(autotag sglang)

Or you can also manually run a container with this command:

docker run --runtime nvidia -it --rm --network=host IMAGE_NAME

Running Inference

Launch the server:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --device cuda \
  --dtype half \
  --attention-backend flashinfer \
  --mem-fraction-static 0.8 \
  --context-length 8192

The quantization and limited context length (--dtype half --context-length 8192) are due to the limited computational resources in Nvidia jetson kit. A detailed explanation can be found in Server Arguments. After launching the engine, refer to Chat completions to test the usability.

Running quantization with TorchAO

TorchAO is suggested to NVIDIA Jetson Orin.

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --device cuda \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.8 \
    --context-length 8192 \
    --torchao-config int4wo-128

This enables TorchAO’s int4 weight-only quantization with a 128-group size. The usage of --torchao-config int4wo-128 is also for memory efficiency.

Structured output with XGrammar

Please refer to SGLang doc structured output.

Thanks to the support from Nurgaliyev Shakhizat, Dustin Franklin and Johnny Núñez Cano.

References

NVIDIA Jetson AGX Orin Documentation

TPU SGLang installation with NPUs support

Getting Started

Basic Usage

Advanced Features

Supported Models

Hardware Platforms

Developer Guide

References

Prerequisites

Installing and running SGLang with Jetson Containers

Running Inference

Running quantization with TorchAO

Structured output with XGrammar

References

Getting Started

Basic Usage

Advanced Features

Supported Models

Hardware Platforms

Developer Guide

References

​Prerequisites

​Installing and running SGLang with Jetson Containers

​Running Inference

​Running quantization with TorchAO

​Structured output with XGrammar

​References

Prerequisites

Installing and running SGLang with Jetson Containers

Running Inference

Running quantization with TorchAO

Structured output with XGrammar

References