Why HiCache Matters
SGLang HiCache extends the traditional RadixAttention with a three-tier hierarchical KV caching system that dramatically improves performance for long-context and multi-turn conversation scenarios. By intelligently managing KV caches across GPU memory, host memory, and external storage backends, HiCache addresses the fundamental capacity bottleneck that limits cache hit rates in conventional systems.Configuration Guidelines
Core HiCache Parameters
- Besides configuring
--hicache-storage-backendat startup, SGLang also supports runtime attach/detach of the HiCache storage backend (no restart required) via HTTP admin endpoints. See Runtime Attach/Detach HiCache Storage Backend.
Key Configurations with Storage Backends Enabled
Memory Layout Optimization
page_first: Only compatible withkernelI/O backend, automatically switches tolayer_firstwithdirectbackendpage_first_direct: Specifically designed fordirectI/O backend with optimized memory organization
Prefetch Policies
Integration with PD Disaggregation
HiCache works seamlessly with PD Disaggregation. You can choose between two configurations:- Prefill-only HiCache: Enable HiCache only on Prefill nodes, allowing KV cache sharing among Prefill instances
- Full HiCache with async offloading: Enable HiCache on Prefill nodes and async KV cache offloading on Decode nodes, allowing Prefill nodes to reuse KV caches from Decode nodes in multi-turn dialogue scenarios
Deployment with HF3FS
Here is an example of deploying DeepSeek-R1 with HiCache-HF3FS. For more details, see the HF3FS Documentation.Deployment with Mooncake
Here is an example of deploying Qwen3-235B-A22B-Instruct-2507 with Mooncake. For more details, see the Mooncake Documentation.Custom Storage Backend Integration
To integrate a new storage backend:-
Implement three core methods:
get(key): Retrieve value by keyexists(key): Check key existenceset(key, value): Store key-value pair
- Register your backend: Add your storage backend to the HiCache BackendFactory
Dynamic Backend Loading
Alternatively, you can use dynamic loading to avoid hard-coding your backend in the repository:--hicache-storage-backend: Set todynamic--hicache-storage-backend-extra-config: JSON configuration with:backend_name: Custom backend identifiermodule_path: Python module path to your implementationclass_name: Your HiCache implementation class nameinterface_v1: 0 (disable) or 1 (enable) to control usage of batch_get_v1 and batch_set_v1 methods
Community and Support
- GitHub Issues: Report bugs and feature requests
- Slack Channel: Join community discussions in #sgl-kv-cache-store
- Documentation: Refer to storage backend-specific guides
This document will be continuously updated based on community feedback and new features. Contributions and suggestions are welcome!
