sglang can fall back to using models that are available in transformers. This works for most decoder-style language models and support for vision-language models is coming soon!
Example launch Command
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting--model-impl to transformers.
Supported features
Quantization
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See Quantization page for more information about supported quantization in SGLang.Remote code
This fallback also means that any model on the hub that can be used intransformers with trust_remote_code=True that correctly implements attention can be used in production!
A model just needs the following two things:
- The config is loaded
MyModelpython class is loaded from theauto_map, and we check that the model_supports_attention_backend.- The
TransformersModelbackend is used. See/srt/models/transformers, which leveragesself.config._attn_implementation = "sglang", thus the need to useALL_ATTENTION_FUNCTIONS.
