Quickstart-vLLM-Ascend#
This document describes how to install unified-cache-management with vllm-ascend on ascend platform.
Prerequisites#
vllm-ascend: >=v0.9.1 (vllm == 0.9.2 to use the Sparse Feature)
Please refer to the vLLM-Ascend Installation guide to meet the required dependencies, and prepare the corresponding version of the vllm-ascend environment as needed.
Step 1: UCM Installation#
We offer 3 options to install UCM.
Option 1: Build from source#
1、Follow commands below to install unified-cache-management from source code:
Note: The sparse module was not compiled by default. To enable it, set the environment variable export ENABLE_SPARSE=TRUE before you build.
# Replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
export PLATFORM=ascend
pip install -v -e . --no-build-isolation
cd ..
Note: For the Atlas A3 series, the
PLATFORMvariable should be set toascend-a3.
2、Apply vLLM and vLLM-Ascend Integration Patches (Not required for versions >= v0.17.0rc1) To enable Unified Cache Management (UCM) integration, you need to apply patches to both vLLM and vLLM-Ascend source trees.
Option A: Monkey Patch (Recommended)#
This method enables UCM features dynamically at runtime via environment variables, requiring no source code modifications.
Enable Monkey Patch:
export ENABLE_UCM_PATCH=1
Note: Enabling ENABLE_UCM_PATCH is required to use the Prefix Caching feature with UCM.
Enable Sparse Attention (supported on v0.11.0):
export ENABLE_SPARSE=1
Option B: Manual Git Patch (Legacy/Alternative)#
If you prefer modifying the source code directly, follow these steps:
Step 1: Apply the vLLM Patch
First, apply the standard vLLM integration patch in the vLLM source directory:
cd <path_to_vllm>
# Replace <vLLM_VERSION> with 0.9.2 or 0.11.0
git apply <patch_to_ucm>/ucm/integration/vllm/patch/<vLLM_VERSION>/vllm-adapt.patch
Step 2: Apply the vLLM-Ascend Patch
Then, switch to the vLLM-Ascend source directory and apply the Ascend-specific patch:
cd <path_to_vllm_ascend>
# Replace <vLLM_VERSION> with 0.9.2 or 0.11.0
git apply <patch_to_ucm>/ucm/integration/vllm/patch/<vLLM_VERSION>/vllm-ascend-adapt.patch
Note: The ReRoPE algorithm is not supported on Ascend at the moment. Only the standard UCM integration is applicable for vLLM-Ascend.
Option 2: Install by pip#
Install by pip or find the pre-build wheels on Pypi.
export PLATFORM=ascend
pip install uc-manager
Note: If installing via
pip install, you need to manually add theconfig.yamlfile, similar tounified-cache-management/examples/ucm_config_example.yaml, because PyPI packages do not include YAML files.
Option 3: Setup from docker#
Build image from source#
Use following command to build UCM with vLLM-Ascend(v0.17.0rc1):
docker build -t ucm-vllm:latest -f ./docker/Dockerfile.ucm-vllm-ascend.a2-v0.17.0 ./
For vLLM-Ascend(v0.11.0) with sparse attention support:
docker build -t ucm-vllm-sparse:latest -f ./docker/Dockerfile.ucm-vllm-ascend.a2-v0.11.0 ./
The Dockerfile automatically invokes the build script (scripts/build_ascend.sh) to compile the wheel and installs from the built package.
Build image from pre-built package#
If you have a pre-built tar package (e.g. from CI), extract it and build the image in package mode:
mkdir -p /tmp/ucm-pkg && tar xzf AI-Storage-Kit_*.tar.gz -C /tmp/ucm-pkg
docker build --build-arg INSTALL_MODE=package \
-t ucm-vllm:latest -f /tmp/ucm-pkg/docker/Dockerfile.ucm-vllm-ascend.a2-v0.17.0 /tmp/ucm-pkg
vllm-ascend provides two variants: Ubuntu and openEuler.
The Dockerfile.ucm-vllm-ascend.a2-v0.17.0 uses the Ubuntu variant by default.
If you want to use the openEuler variant, override the base image with --build-arg IMAGE_NAME_VERSION:
quay.io/ascend/vllm-ascend:v0.11.0-openeuler
Then run your container using following command. You can add or remove Docker parameters as needed.
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
docker run --rm \
--network=host \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v <path_to_your_models>:/app/model \
-v <path_to_your_storage>:/app/storage \
--name <name_of_your_container> \
-it <image_id> bash
Step 2: Configuration#
Features Overview#
UCM supports two key features: Prefix Cache and Sparse attention. Each feature supports both Offline Inference and Online API modes. More details are available via the links
For quick start, just follow the guide below to launch your own inference experience;
Feature 1: Prefix Caching#
You may directly edit the example file at unified-cache-management/examples/ucm_config_example.yaml. For more please refer to Prefix Cache with NFS Store and Prefix Cache with Pipeline Store document.
⚠️ Make sure to replace /mnt/test with your actual storage directory.
Feature 2: Sparsity#
The sparse module was not compiled by default. To enable it, set the environment variable export ENABLE_SPARSE=TRUE and re-compile the code you built. And uncomment ucm_sparse_config code block in unified-cache-management/examples/ucm_config_example.yaml. Additionally, if you want to run GSA, you also need to set the environment variable export VLLM_HASH_ATTENTION=1.
Step 3: Launching Inference#
Offline Inference
In the examples/ directory, you will find the offline_inference.py script used for offline inference. Before executing the script, locate line 25 and replace the UCM_CONFIG_FILE value with the path to your own configuration file.
def build_llm_with_uc(module_path: str, name: str, model: str):
ktc = KVTransferConfig(
kv_connector=name,
kv_connector_module_path=module_path,
kv_role="kv_both",
kv_connector_extra_config={
"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"
},
)
Then run following commands:
cd examples/
# Change the model path to your own model path
python offline_inference.py
OpenAI-Compatible Online API
For online inference, vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.
To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:
vllm serve Qwen/Qwen2.5-14B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--block_size 128 \
--trust-remote-code \
--port 7800 \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
"kv_connector": "UCMConnector",
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
}'
⚠️ The parameter --no-enable-prefix-caching is for SSD performance testing, please remove it for production.
⚠️ Make sure to replace "/workspace/unified-cache-management/examples/ucm_config_example.yaml" with your actual config file path.
⚠️ The log files of UCM module will be put under log directory of the path you start vllm service. To use a custom log path, set export UCM_LOG_PATH=my_log_dir.
If you see log as below:
INFO: Started server process [32890]
INFO: Waiting for application startup.
INFO: Application startup complete.
Congratulations, you have successfully started the vLLM server with UCM!
After successfully started the vLLM server,You can interact with the API as following:
curl http://localhost:7800/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct",
"prompt": "You are a highly specialized assistant whose mission is to faithfully reproduce English literary texts verbatim, without any deviation, paraphrasing, or omission. Your primary responsibility is accuracy: every word, every punctuation mark, and every line must appear exactly as in the original source. Core Principles: Verbatim Reproduction: If the user asks for a passage, you must output the text word-for-word. Do not alter spelling, punctuation, capitalization, or line breaks. Do not paraphrase, summarize, modernize, or \"improve\" the language. Consistency: The same input must always yield the same output. Do not generate alternative versions or interpretations. Clarity of Scope: Your role is not to explain, interpret, or critique. You are not a storyteller or commentator, but a faithful copyist of English literary and cultural texts. Recognizability: Because texts must be reproduced exactly, they will carry their own cultural recognition. You should not add labels, introductions, or explanations before or after the text. Coverage: You must handle passages from classic literature, poetry, speeches, or cultural texts. Regardless of tone—solemn, visionary, poetic, persuasive—you must preserve the original form, structure, and rhythm by reproducing it precisely. Success Criteria: A human reader should be able to compare your output directly with the original and find zero differences. The measure of success is absolute textual fidelity. Your function can be summarized as follows: verbatim reproduction only, no paraphrase, no commentary, no embellishment, no omission. Please reproduce verbatim the opening sentence of the United States Declaration of Independence (1776), starting with \"When in the Course of human events\" and continuing word-for-word without paraphrasing.",
"max_tokens": 100,
"temperature": 0
}'