Overview
The checkpoint engine integration allows SGLang to:- Load model weights in parallel using multiple processes
- Distribute weight loading across multiple nodes to increase effective disk bandwidth
- Overlap weight loading with other initialization tasks like CUDA graph capture
- Support both single-node and multi-node deployments
Installation
First, install the checkpoint engine package:Architecture
The system consists of two main components:- SGLang Server: Runs with
--wait-for-initial-weightsflag to wait for weights before becoming ready - Checkpoint Engine Workers: Separate processes (managed by torchrun) that load and distribute model weights
- Broadcast mode: Weights are broadcast from loading processes to inference processes
- P2P mode: Direct peer-to-peer weight transfer between processes
- All mode: Combination of both broadcast and P2P methods
Usage Examples
Single Node Setup
Terminal 1 - Launch SGLang Server:Multi-Node Setup (2 Nodes)
Node 0: Launch SGLang server:Multi-Node Setup with Tensor Parallelism (TP=16)
Node 0: Launch SGLang server:Configuration Options
SGLang Server Options
--load-format dummy: Use dummy format for initial loading (allows overlapping with other tasks)--wait-for-initial-weights: Wait for checkpoint engine to provide weights before becoming ready--host: Host address for multi-node setups--dist-init-addr: Distributed initialization address for tensor parallelism
Checkpoint Engine Options
--update-method: Weight update method (broadcast,p2p, orall)--checkpoint-path: Path to model checkpoint directory--inference-parallel-size: Number of inference parallel processes--endpoint: SGLang server endpoint (default:http://localhost:19730)--checkpoint-name: Name for the checkpoint (default:my-checkpoint-iter-0)--save-metas-file: File to save checkpoint metadata--load-metas-file: File to load checkpoint metadata from--uds: Unix domain socket path for communication--weight-version: Version identifier for weights
Performance Benefits
The checkpoint engine provides significant time savings in two main aspects:- Multi-node Loading: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
- Single Process Optimization: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
Troubleshooting
- Ensure checkpoint engine package is installed:
pip install 'checkpoint-engine[p2p]' - Verify network connectivity between nodes in multi-node setups
- Check that the checkpoint path contains valid model files
- Monitor logs for connection errors between SGLang server and checkpoint engine
- Use
--sleep-timeparameter to add delays if needed for debugging
