Skip to main content
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. If you want a high-level endpoint that can automatically handle chat templates, consider using the OpenAI Compatible API.

/generate Endpoint

The /generate endpoint accepts the following parameters in JSON format. For detailed usage, see the native API doc. The object is defined at io_struct.py::GenerateReqInput. You can also read the source code to find more arguments and docs.
ArgumentType/DefaultDescription
textOptional[Union[List[str], str]] = NoneThe input prompt. Can be a single prompt or a batch of prompts.
input_idsOptional[Union[List[List[int]], List[int]]] = NoneThe token IDs for text; one can specify either text or input_ids.
input_embedsOptional[Union[List[List[List[float]]], List[List[float]]]] = NoneThe embeddings for input_ids; one can specify either text, input_ids, or input_embeds.
image_dataOptional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = NoneThe image input. Supports three formats: (1) Raw images: PIL Image, file path, URL, or base64 string; (2) Processor output: Dict with format: "processor_output" containing HuggingFace processor outputs; (3) Precomputed embeddings: Dict with format: "precomputed_embedding" and feature containing pre-calculated visual embeddings. Can be a single image, list of images, or list of lists of images. See Multimodal Input Formats for details.
audio_dataOptional[Union[List[AudioDataItem], AudioDataItem]] = NoneThe audio input. Can be a file name, URL, or base64 encoded string.
sampling_paramsOptional[Union[List[Dict], Dict]] = NoneThe sampling parameters as described in the sections below.
ridOptional[Union[List[str], str]] = NoneThe request ID.
return_logprobOptional[Union[List[bool], bool]] = NoneWhether to return log probabilities for tokens.
logprob_start_lenOptional[Union[List[int], int]] = NoneIf return_logprob, the start location in the prompt for returning logprobs. Default is “-1”, which returns logprobs for output tokens only.
top_logprobs_numOptional[Union[List[int], int]] = NoneIf return_logprob, the number of top logprobs to return at each position.
token_ids_logprobOptional[Union[List[List[int]], List[int]]] = NoneIf return_logprob, the token IDs to return logprob for.
return_text_in_logprobsbool = FalseWhether to detokenize tokens in text in the returned logprobs.
streambool = FalseWhether to stream output.
lora_pathOptional[Union[List[Optional[str]], Optional[str]]] = NoneThe path to the LoRA.
custom_logit_processorOptional[Union[List[Optional[str]], str]] = NoneCustom logit processor for advanced sampling control. Must be a serialized instance of CustomLogitProcessor using its to_str() method. For usage see below.
return_hidden_statesUnion[List[bool], bool] = FalseWhether to return hidden states.
return_routed_expertsbool = FalseWhether to return routed experts for MoE models. Requires --enable-return-routed-experts server flag. Returns base64-encoded int32 expert IDs as a flattened array with logical shape [num_tokens, num_layers, top_k].

Sampling parameters

The object is defined at sampling_params.py::SamplingParams. You can also read the source code to find more arguments and docs.

Note on defaults

By default, SGLang initializes several sampling parameters from the model’s generation_config.json (when the server is launched with --sampling-defaults model, which is the default). To use SGLang/OpenAI constant defaults instead, start the server with --sampling-defaults openai. You can always override any parameter per request via sampling_params.
# Use model-provided defaults from generation_config.json (default behavior)
python -m sglang.launch_server --model-path <MODEL> --sampling-defaults model

# Use SGLang/OpenAI constant defaults instead
python -m sglang.launch_server --model-path <MODEL> --sampling-defaults openai

Core parameters

ArgumentType/DefaultDescription
max_new_tokensint = 128The maximum output length measured in tokens.
stopOptional[Union[str, List[str]]] = NoneOne or multiple stop words. Generation will stop if one of these words is sampled.
stop_token_idsOptional[List[int]] = NoneProvide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled.
stop_regexOptional[Union[str, List[str]]] = NoneStop when hitting any of the regex patterns in this list
temperaturefloat (model default; fallback 1.0)Temperature when sampling the next token. temperature = 0 corresponds to greedy sampling, a higher temperature leads to more diversity.
top_pfloat (model default; fallback 1.0)Top-p selects tokens from the smallest sorted set whose cumulative probability exceeds top_p. When top_p = 1, this reduces to unrestricted sampling from all tokens.
top_kint (model default; fallback -1)Top-k randomly selects from the k highest-probability tokens.
min_pfloat (model default; fallback 0.0)Min-p samples from tokens with probability larger than min_p * highest_token_probability.

Penalizers

ArgumentType/DefaultDescription
frequency_penaltyfloat = 0.0Penalizes tokens based on their frequency in generation so far. Must be between -2 and 2 where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token.
presence_penaltyfloat = 0.0Penalizes tokens if they appeared in the generation so far. Must be between -2 and 2 where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occurred.
repetition_penaltyfloat = 1.0Scales the logits of previously generated tokens to discourage (values > 1) or encourage (values < 1) repetition. Valid range is [0, 2]; 1.0 leaves probabilities unchanged.
min_new_tokensint = 0Forces the model to generate at least min_new_tokens until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens.

Constrained decoding

Please refer to our dedicated guide on constrained decoding for the following parameters.
ArgumentType/DefaultDescription
json_schemaOptional[str] = NoneJSON schema for structured outputs.
regexOptional[str] = NoneRegex for structured outputs.
ebnfOptional[str] = NoneEBNF for structured outputs.
structural_tagOptional[str] = NoneThe structal tag for structured outputs.

Other options

ArgumentType/DefaultDescription
nint = 1Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.)
ignore_eosbool = FalseDon’t stop generation when EOS token is sampled.
skip_special_tokensbool = TrueRemove special tokens during decoding.
spaces_between_special_tokensbool = TrueWhether or not to add spaces between special tokens during detokenization.
no_stop_trimbool = FalseDon’t trim stop words or EOS token from the generated text.
custom_paramsOptional[List[Optional[Dict[str, Any]]]] = NoneUsed when employing CustomLogitProcessor. For usage, see below.

Examples

Normal

Launch a server:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
Send a request:
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
Detailed example in send request.

Streaming

Send a request and stream the output:
import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")
Detailed example in openai compatible api.

Multimodal

Launch a server:
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov
Download an image:
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true
Send a request:
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
The image_data can be a file name, a URL, or a base64 encoded string. See also python/sglang/srt/utils.py:load_image. Streaming is supported in a similar manner as above. Detailed example in OpenAI API Vision.

Structured Outputs (JSON, Regex, EBNF)

You can specify a JSON schema, regular expression or EBNF to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (json_schema, regex, or ebnf) can be specified for a request. SGLang supports two grammar backends:
  • XGrammar (default): Supports JSON schema, regular expression, and EBNF constraints.
  • Outlines: Supports JSON schema and regular expression constraints.
If instead you want to initialize the Outlines backend, you can use --grammar-backend outlines flag:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: xgrammar)
import json
import requests

json_schema = json.dumps({
    "type": "object",
    "properties": {
        "name": {"type": "string", "pattern": "^[\\w]+$"},
        "population": {"type": "integer"},
    },
    "required": ["name", "population"],
})

# JSON (works with both Outlines and XGrammar)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

# Regular expression (Outlines backend only)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())

# EBNF (XGrammar backend only)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Write a greeting.",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
        },
    },
)
print(response.json())
Detailed example in structured outputs.

Custom logit processor

Launch a server with --enable-custom-logit-processor flag on.
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3-8B-Instruct \
  --port 30000 \
  --enable-custom-logit-processor
Define a custom logit processor that will always sample a specific token id.
from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor

class DeterministicLogitProcessor(CustomLogitProcessor):
    """A dummy logit processor that changes the logits to always
    sample the given token id.
    """

    def __call__(self, logits, custom_param_list):
        # Check that the number of logits matches the number of custom parameters
        assert logits.shape[0] == len(custom_param_list)
        key = "token_id"

        for i, param_dict in enumerate(custom_param_list):
            # Mask all other tokens
            logits[i, :] = -float("inf")
            # Assign highest probability to the specified token
            logits[i, param_dict[key]] = 0.0
        return logits
Send a request:
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
        "sampling_params": {
            "temperature": 0.0,
            "max_new_tokens": 32,
            "custom_params": {"token_id": 5},
        },
    },
)
print(response.json())
Send an OpenAI chat completion request:
import openai
from sglang.utils import print_highlight

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0.0,
    max_tokens=32,
    extra_body={
        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
        "custom_params": {"token_id": 5},
    },
)

print_highlight(f"Response: {response}")