1. Model Introduction
Llama 3.1 is a collection of pretrained and instruction tuned generative models, released in July 2024 by Meta. These models are available in 8B, 70B and 405B sizes, with the 405B variant being the most capable fully-open source model at the time. These models bring open intelligence to all, with several new features and improvements:- Stronger General Intelligence: These models showcase significant improvements in coding, state-of-the-art tool use, and overall stronger reasoning capabilities.
- Extended Context Length: Llama 3.1 extends the context length to 128K tokens to improve performance over long context tasks such as summarization and code reasoning.
- Tool Use: Llama 3.1 is trained to interact with a search engine, python interpreter and mathematical engine, and also improves zero-shot tool use capabilities to interact with potentially unseen tools.
- Multilinguality: Llama 3.1 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to generate a launch command for Llama 3.1 collection of models.3.2 Configuration Tips
Speculative Decoding (NVIDIA GPUs):- Using Speculative Decoding for latency-sensitive scenarios:
--speculative-algorithm EAGLE3: Speculative decoding algorithm--speculative-num-steps 3: Number of speculative verification rounds--speculative-eagle-topk 1: Top-k sampling for draft tokens--speculative-num-draft-tokens 4: Number of draft tokens per step--speculative-draft-model-path: The path of the draft model weights. This can be a local folder or a Hugging Face repo ID such asyuhuili/EAGLE3-LLaMA3.1-Instruct-8B.
- Hardware-Aware TP: MI355X (256GB memory) supports lower TP values compared to MI300X/MI325X (192GB)
- Verified TP Configurations:
- MI300X/MI325X: 405B BF16 (TP=8), 405B FP8 (TP=4), 70B/8B (TP=1)
- MI355X: 405B BF16 (TP=4), 405B FP8 (TP=2), 70B/8B (TP=1)
- FP8 Model Variants:
- 405B: Use Meta’s official
meta-llama/Llama-3.1-405B-Instruct-FP8 - 70B/8B: Use AMD’s optimized
amd/Llama-3.1-{size}-Instruct-FP8-KV
- 405B: Use Meta’s official
- Tool Calling: Enable with
--tool-call-parser llama3for Instruct models
4. Model Invocation
4.1 Basic Usage
SGLang exposes an OpenAI-compatible endpoint. First, start the server4.2 Advanced Usage
4.2.1 Tool Calling
Llama3 supports tool calling capabilities. First, start the server with tool call parser enabled:5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA A100 GPU (8x)
- Model: Meta-Llama/Llama-3.1-70B
- Tensor Parallelism: 8
- sglang version: 0.5.6
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
5.1.1.1 Low Concurrency
- Benchmark Command:
- Test Results:
5.1.1.2 Medium Concurrency
- Test Results:
5.1.1.3 High Concurrency
- Test Results:
5.1.2 Summarization Scenario Benchmark
5.1.2.1 Low Concurrency
5.1.2.2 Medium Concurrency
5.1.2.3 High Concurrency
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
- Results:
