- Basic Call: Directly pass images and text.
- Processor Output: Use HuggingFace processor for data preprocessing.
- Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.
Understanding the Three Input Formats
SGLang supports three ways to pass visual data, each optimized for different scenarios:1. Raw Images - Simplest approach
- Pass PIL Images, file paths, URLs, or base64 strings directly
- SGLang handles all preprocessing automatically
- Best for: Quick prototyping, simple applications
2. Processor Output - For custom preprocessing
- Pre-process images with HuggingFace processor
- Pass the complete processor output dict with
format: "processor_output" - Best for: Custom image transformations, integration with existing pipelines
- Requirement: Must use
input_idsinstead of text prompt
3. Precomputed Embeddings - For maximum performance
- Pre-calculate visual embeddings using the vision encoder
- Pass embeddings with
format: "precomputed_embedding" - Best for: Repeated queries on same images, caching, high-throughput serving
- Performance gain: Avoids redundant vision encoder computation (30-50% speedup)
Querying Qwen2.5-VL Model
Basic Offline Engine API Call
Call with Processor Output
Using a HuggingFace processor to preprocess text and images, and passing theprocessor_output directly into Engine.generate.
