
About SpecBundle
Speculative decoding, especially EAGLE3, offer strong theoretical guarantees alongside consistent empirical improvements in token acceptance rate and end-to-end inference speed. However, despite these advances, adoption of speculative decoding—especially EAGLE3—remains limited in the open-source ecosystem, due primarily to three key factors.- Lack of production-ready training infrastructure: Existing speculative decoding toolchains are largely research prototypes, offering limited system-level optimization and inadequate support for diverse architectures and large-scale models.
- Scarcity of high-quality draft models: Effective speculative decoding depends on strong draft models, yet publicly available EAGLE3-compatible checkpoints are extremely limited, primarily originating from the original authors.
- Insufficient training scale of existing drafts: Most available draft models are trained on small or curated datasets and fail to generalize to the large, diverse corpora used in modern LLM training, resulting in low token acceptance rates and diminished practical speedups.
Installation
Usage
Launch SGLang Server with SpecBundle models
You can use the following command to launch the SGLang server with SpecBundle models. Please add--tp, --ep and --mem-fraction-static arguments when you encounter memory issues.
Use SpecBundle to compare the performance of Speculative Decoding draft models
We provide a benchmark suite to evaluate the performance of SpecBundle draft models here.Example:
- Launch a SGLang Server
- Use the benchmark suite to evaluate the performance of SpecBundle draft models
bench_eagle3.py can help you launch a SGLang server process and a Benchmarking process concurrently. In this way, you don’t have to launch the SGLang server manually, this script will manually handle the SGLang launch under different speculative decoding configurations. Some important arguments are:
--model-path: the path to the target model.--speculative-draft-model-path: the path to the draft model.--port: the port to launch the SGLang server.--trust-remote-code: trust the remote code.--mem-fraction-static: the memory fraction for the static memory.--tp-size: the tensor parallelism size.--attention-backend: the attention backend.--config-list: the list of speculative decoding configuration to test, the format is<batch-size>,<num-steps>,<topk>,<num-draft-tokens>.--benchmark-list: the list of benchmarks to test, the format is<benchmark-name>:<num-prompts>:<subset>.
