Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU, including the most notable open-source models like Llama series, Qwen series, and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.| Model Name | BF16 | W8A8_INT8 | FP8 |
|---|---|---|---|
| DeepSeek-R1 | meituan/DeepSeek-R1-Channel-INT8 | deepseek-ai/DeepSeek-R1 | |
| DeepSeek-V3.1-Terminus | IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 | deepseek-ai/DeepSeek-V3.1-Terminus | |
| Llama-3.2-3B | meta-llama/Llama-3.2-3B-Instruct | RedHatAI/Llama-3.2-3B-quantized.w8a8 | |
| Llama-3.1-8B | meta-llama/Llama-3.1-8B-Instruct | RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 | |
| QwQ-32B | RedHatAI/QwQ-32B-quantized.w8a8 | ||
| DeepSeek-Distilled-Llama | RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | ||
| Qwen3-235B | Qwen/Qwen3-235B-A22B-FP8 |
Installation
Install Using Docker
It is recommended to use Docker for setting up the SGLang environment. A Dockerfile is provided to facilitate the installation. Replace<secret> below with your HuggingFace access token.
Install From Source
If you prefer to install SGLang in a bare metal environment, the setup process is as follows: Please install the required packages and libraries beforehand if they are not already present on your system. You can refer to the Ubuntu-based installation commands in the Dockerfile for guidance.- Install
uvpackage manager, then create and activate a virtual environment:
- Create a config file to direct the installation channel
(a.k.a. index-url) of
torchrelated packages:
vim, paste the following content into the created file
vim, press ‘esc’ to exit insert mode, then ‘:x+Enter’),
and set it as the default uv config.
- Clone the
sglangsource code and build the packages
- Set the required environment variables
-
Note that the environment variable
SGLANG_USE_CPU_ENGINE=1is required to enable the SGLang service with the CPU engine. -
If you encounter code compilation issues during the
sgl-kernelbuilding process, please check yourgccandg++versions and upgrade them if they are outdated. It is recommended to usegcc-13andg++-13as they have been verified in the official Docker container. -
The system library path is typically located in one of the following directories:
~/.local/lib/,/usr/local/lib/,/usr/local/lib64/,/usr/lib/,/usr/lib64/and/usr/lib/x86_64-linux-gnu/. In the above example commands,/usr/lib/x86_64-linux-gnuis used. Please adjust the path according to your server configuration. -
It is recommended to add the following to your
~/.bashrcfile to avoid setting these variables every time you open a new terminal:
Launch of the Serving Engine
Example command to launch SGLang serving:-
For running W8A8 quantized models, please add the flag
--quantization w8a8_int8. -
The flag
--tp 6specifies that tensor parallelism will be applied using 6 ranks (TP6). The number of TP specified is how many TP ranks will be used during the execution. On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). Usually we can get the SNC information (How many available) from the Operating System. Users can specify TP to be no more than the total available SNCs in current system. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the firstnSNCs. Note thatncannot exceed the total SNC number, doing so will result in an error. To specify the cores to be used, we need to explicitly set the environment variableSGLANG_CPU_OMP_THREADS_BIND. For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server, which has 43-43-42 cores on the 3 SNCs of a socket, we should set:Please beware that with SGLANG_CPU_OMP_THREADS_BIND set, the available memory amounts of the ranks may not be determined in prior. You may need to set proper--max-total-tokensto avoid the out-of-memory error. -
For optimizing decoding with torch.compile, please add the flag
--enable-torch-compile. To specify the maximum batch size when usingtorch.compile, set the flag--torch-compile-max-bs. For example,--enable-torch-compile --torch-compile-max-bs 4means usingtorch.compileand setting the maximum batch size to 4. Currently the maximum applicable batch size for optimizing withtorch.compileis 16. -
A warmup step is automatically triggered when the service is started.
The server is ready when you see the log
The server is fired up and ready to roll!.
Benchmarking with Requests
You can benchmark the performance via thebench_serving script.
Run the command in another terminal. An example command would be:
curl) or through your own scripts.
Example Usage Commands
Large Language Models can range from fewer than 1 billion to several hundred billion parameters. Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer, or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common 4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.Example: Running DeepSeek-V3.1-Terminus
An example command to launch service of W8A8_INT8 DeepSeek-V3.1-Terminus on a Xeon® 6980P server:--torch-compile-max-bs to the maximum desired batch size for your deployment,
which can be up to 16. The value 4 in the examples is illustrative.
Example: Running Llama-3.2-3B
An example command to launch service of Llama-3.2-3B with BF16 precision:--torch-compile-max-bs and --tp settings are examples that should be adjusted for your setup.
For instance, use --tp 3 to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
Once the server have been launched, you can test it using the bench_serving command or create
your own commands or scripts following the benchmarking example.