Skip to main content

1. Model Introduction

Qwen2.5-VL series is a vision-language model from the Qwen team, offering significant improvements over its predecessor in understanding, reasoning, and multi-modal processing. This generation delivers comprehensive upgrades across the board:
  • Enhanced Visual Understanding: Strong performance in document understanding, chart analysis, and scene recognition.
  • Improved Reasoning: Logical reasoning and mathematical problem-solving capabilities in multi-modal contexts.
  • Multiple Sizes: Available in 3B, 7B, 32B, and 72B variants to suit different deployment needs.
  • ROCm Support: Compatible with AMD MI300X GPUs via SGLang (verified).
For more details, please refer to the official Qwen2.5-VL collection.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for AMD MI300X hardware platforms and different use cases.

3.1 Basic Configuration

The Qwen2.5-VL series offers models in various sizes. The following configurations have been verified on AMD MI300X GPUs. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and model size.

3.2 Configuration Tips

  • Memory Management: For the 72B model on MI300X, we have verified successful deployment with --context-length 128000. Smaller context lengths can be used to reduce memory usage if needed.
  • Multi-GPU Deployment: Use Tensor Parallelism (--tp) to scale across multiple GPUs. For example, use --tp 8 for the 72B model and --tp 2 for the 32B model on MI300X.

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Multi-Modal Inputs

Qwen2.5-VL supports image inputs. Here’s a basic example with image input:
import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:30000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Example Output:
Response costs: 2.31s
Generated text: Auntie Anne's

CINNAMON SUGAR
1 x 17,000
SUB TOTAL
17,000

GRAND TOTAL
17,000

CASH IDR
20,000

CHANGE DUE
3,000
Multi-Image Input Example: Qwen2.5-VL can process multiple images in a single request for comparison or analysis:
import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:30000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://www.civitatis.com/f/china/hong-kong/guia/taxi.jpg"
                }
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.cheapoguides.com/wp-content/uploads/sites/7/2025/05/GettyImages-509614603-1280x600.jpg"
                }
            },
            {
                "type": "text",
                "text": "Compare these two images and describe the differences in 100 words or less."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Example Output:
Response costs: 13.79s
Generated text: The first image shows a single red taxi driving on a street with a few other taxis in the background. The second image shows a large number of taxis parked in a lot, with some appearing to be in various states of repair. The first image has a single taxi with a visible license plate, while the second image has multiple taxis with different license plates. The first image has a clear view of the street and surrounding area, while the second image is taken from an elevated perspective, showing a wider view of the parking lot and the surrounding area.
Note:
  • You can also provide local file paths using file:// protocol
  • For larger images, you may need more memory; adjust --mem-fraction-static accordingly