1. Model Introduction
Llama-3.3-70B-Instruct is Meta’s latest 70 billion parameter instruction-tuned language model, featuring improved performance and efficiency over Llama 3.1. With a 128K token context window and enhanced capabilities across reasoning, coding, and multilingual tasks, Llama 3.3 delivers state-of-the-art results while maintaining accessibility for production deployment. Key Features:- Enhanced Performance: Improved instruction following, reasoning, and task completion over Llama 3.1
- Tool Calling: Native support for function calling and tool use scenarios
- Multilingual Support: Optimized for 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai)
- Extended Context: 128K token context window for processing long documents and complex tasks
- Efficient Deployment: 70B parameters enable deployment on single GPU with AMD MI300X
2. SGLang Installation
Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for AMD GPUs (MI300X, MI325X, MI355X).3.1 Interactive Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your AMD GPU setup.3.2 Configuration Tips
AMD GPU Deployment:- All AMD GPUs (MI300X, MI325X, MI355X) support TP=1 for both BF16 and FP8 variants
- FP8 Model Variant: Use AMD’s optimized
amd/Llama-3.3-70B-Instruct-FP8-KV - Tool Calling: Enable with
--tool-call-parser llama3for function calling support - Higher Throughput: Optional TP=2 or TP=4 can be used for increased throughput
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Tool Calling
Llama 3.3 70B Instruct supports native tool calling. Enable the tool parser during deployment:4.2.2 Long Context Processing
Leverage the 128K context window for processing long documents:5. Benchmarking
Use the SGLang benchmarking suite to test model performance with different workload patterns:5.1 Basic Benchmark Command
5.2 Adjusting Benchmark Parameters
Input/Output Length: Adjust--random-input and --random-output to test different workload patterns:
- Short conversations:
--random-input 1024 --random-output 1024 - Long outputs:
--random-input 1024 --random-output 8192 - Long inputs:
--random-input 8192 --random-output 1024
--max-concurrency to test different load scenarios:
- Low concurrency (latency-focused):
--max-concurrency 1 --num-prompts 100 - Medium concurrency (balanced):
--max-concurrency 16 --num-prompts 1000 - High concurrency (throughput-focused):
--max-concurrency 100 --num-prompts 2000
