1. Model Introduction
NVIDIA Nemotron3-Nano is a 30B-parameter hybrid LLM that mixes Mixture-of-Experts (MoE) feed-forward layers, Mamba2 sequence-modeling layers, and standard self-attention layers in a single stack rather than classic “attention + MLP” transformer blocks.
The BF16 variant (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) is designed as a high-fidelity reference model, while the FP8 variant (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8) targets optimized inference performance on modern NVIDIA GPUs.
At a high level:
- Hybrid layer stack (Mamba2 + MoE + attention): The network is composed of interleaved layers that are either Mamba2, or MoE feed-forward, or attention-only.
- Non-uniform layer ordering: The order and mix of these specialized layers is not a simple, rigid pattern, enabling the model to trade off sequence modeling, routing capacity, and expressivity across depth.
- Deployment-friendly precision: Use BF16 for accuracy-sensitive and evaluation workloads; use FP8 for latency- and throughput-critical serving on recent NVIDIA GPUs.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. For a quick start, please install the nightly wheel for SGLang:3. Model Deployment
This section provides a progressive guide from quick deployment to performance tuning.3.1 Basic Configuration
Interactive Command Generator: select hardware, model variant, and common knobs to generate a launch command.3.2 Configuration Tips
- Attention backend: H200/B200: use flashinfer attention backend by default.
-
TP support:
To set tp size, use
--tp <1|2|4|8>. -
FP8 KV cache:
To enable fp8 kv cache, please append
--kv-cache-dtype fp8_e4m3.
4. Model Invocation
4.1 Basic Usage (OpenAI-Compatible API)
SGLang provides an OpenAI-compatible endpoint. Example with the OpenAI Python client:4.2 Reasoning
To enable reasoning,--reasoning-parser nano_v3 should be appended to the launching command. The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below.
4.3 Tool calling
To enable reasoning,--tool-call-parser qwen3_coder should be appended to the launching command. Call functions using the OpenAI Tools schema and inspect returned tool_calls.
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU
- Model Deployment Command:
- Benchmark Command:
- Test Results:
- Model Deployment Command:
- Benchmark Command:
- Test Results:
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
Environment- Hardware: NVIDIA B200 GPU
- Model: BF16 checkpoint
