Quickstart-SGLang#
This document describes how to install unified-cache-management with SGLang on cuda platform.
Prerequisites#
SGLang >= v0.5.9, device=cuda
Step 1: UCM Installation#
We offer 3 options to install UCM.
Option 1: Setup from docker#
Official pre-built image#
docker pull unifiedcachemanager/ucm-sglang:latest
Then run your container using following command.
# Use `--ipc=host` to make sure the shared memory is large enough.
docker run --rm \
--gpus all \
--network=host \
--ipc=host \
-v <path_to_your_models>:/home/model \
-v <path_to_your_storage>:/home/storage \
--name <name_of_your_container> \
-it unifiedcachemanager/ucm-sglang:latest
Build image from source#
Download the pre-built lmsysorg/sglang:v0.5.9 docker image and build unified-cache-management docker image by commands below:
# Build docker image using source code, replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
docker build -t ucm-sglang:latest -f ./docker/Dockerfile.ucm-sglang-cuda-v0.5.5 ./
The Dockerfile automatically invokes the build script (scripts/build_sglang.sh) to compile the wheel, installs from the built package, and applies the SGLang integration patch.
Build image from pre-built package#
If you have a pre-built tar package (e.g. from CI), extract it and build the image in package mode:
mkdir -p /tmp/ucm-pkg && tar xzf AI-Storage-Kit_*.tar.gz -C /tmp/ucm-pkg
docker build --build-arg INSTALL_MODE=package \
-t ucm-sglang:latest -f /tmp/ucm-pkg/docker/Dockerfile.ucm-sglang-cuda-v0.5.5 /tmp/ucm-pkg
Option 2: Build from source#
Prepare SGLang Environment
For the sake of environment isolation and simplicity, we recommend preparing the SGLang environment by pulling the official, pre-built SGLang Docker image.
docker pull lmsysorg/sglang:v0.5.9
Use the following command to run your own container:
# Use `--ipc=host` to make sure the shared memory is large enough. docker run \ --gpus all \ --network=host \ --ipc=host \ -v <path_to_your_models>:/home/model \ -v <path_to_your_storage>:/home/storage \ --entrypoint /bin/bash \ --name <name_of_your_container> \ -it lmsysorg/sglang:v0.5.9
Refer to Using docker for more information to run your own SGLang container.
Build from source code
Follow commands below to install unified-cache-management:
# Replace <branch_or_tag_name> with the branch or tag name needed git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git cd unified-cache-management export PLATFORM=cuda pip install -v -e . --no-build-isolation
Option 3: Install by pip#
Prepare SGLang Environment
It is recommended to use a pre-build SGLang docker image, please follow the guide in Option 2.
Install by pip
Install by pip or find the pre-build wheels on Pypi.
export PLATFORM=cuda pip install uc-manager
Step 2: Configuration#
Feature : Prefix Caching#
UCM configuration is passed to SGLang via --hicache-storage-backend-extra-config in JSON format:
HICACHE_CONFIG='{
"backend_name":"unifiedcache",
"module_path":"ucm.integration.sglang.unifiedcache_store",
"class_name":"UnifiedCacheStore",
"interface_v1":1,
"kv_connector_extra_config":{
"ucm_connector_name":"UcmPipelineStore",
"ucm_connector_config":{
"storage_backends":"/mnt/test"
}
}
}'
Note: Replace /mnt/test with your actual storage directory.
Step 3: Launching Inference#
Offline Inference
SGLang already provides an offline batch inference example. No UCM-specific code changes are required; just pass the same hierarchical cache flags as the server.
# Prefix cache config (reuse from Step 2)
HICACHE_CONFIG='{
"backend_name":"unifiedcache",
"module_path":"ucm.integration.sglang.unifiedcache_store",
"class_name":"UnifiedCacheStore",
"interface_v1":1,
"kv_connector_extra_config":{
"ucm_connector_name":"UcmPipelineStore",
"ucm_connector_config":{
"storage_backends":"/mnt/test"
}
}
}'
python3 /path/to/sglang/examples/runtime/engine/offline_batch_inference.py \
--model-path Qwen/Qwen2.5-14B-Instruct \
--tensor-parallel-size 2 \
--page-size 128 \
--trust-remote-code \
--enable-hierarchical-cache \
--hicache-mem-layout page_first \
--hicache-write-policy write_through \
--hicache-storage-backend dynamic \
--hicache-storage-prefetch-policy wait_complete \
--hicache-storage-backend-extra-config "$HICACHE_CONFIG"
⚠️ Make sure to replace Qwen/Qwen2.5-14B-Instruct with your actual model path or HF repo ID.
⚠️ Make sure to replace /mnt/test (inside HICACHE_CONFIG) with your actual storage directory.
OpenAI-Compatible Online API
To start the SGLang server with the Qwen/Qwen2.5-14B-Instruct model, run:
# Prefix cache config (reuse from Step 2)
HICACHE_CONFIG='{
"backend_name":"unifiedcache",
"module_path":"ucm.integration.sglang.unifiedcache_store",
"class_name":"UnifiedCacheStore",
"interface_v1":1,
"kv_connector_extra_config":{
"ucm_connector_name":"UcmPipelineStore",
"ucm_connector_config":{
"storage_backends":"/mnt/test"
}
}
}'
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-14B-Instruct \
--tensor-parallel-size 2 \
--page-size 128 \
--port 7800 \
--trust-remote-code \
--enable-hierarchical-cache \
--hicache-mem-layout page_first \
--hicache-write-policy write_through \
--hicache-storage-backend dynamic \
--hicache-storage-prefetch-policy wait_complete \
--hicache-storage-backend-extra-config "$HICACHE_CONFIG"
⚠️ Make sure to replace Qwen/Qwen2.5-14B-Instruct with your actual model path or HF repo ID.
⚠️ Make sure to replace /mnt/test (inside HICACHE_CONFIG) with your actual storage directory.
If you see logs like:
INFO: Started server process [32890]
INFO: Waiting for application startup.
INFO: Application startup complete.
Then you can interact with the API:
curl http://localhost:7800/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct",
"prompt": "Hello!",
"max_tokens": 64,
"temperature": 0
}'