Quickstart-vLLM

Quickstart-vLLM#

This document describes how to install unified-cache-management with vllm on cuda platform.

Prerequisites#

vllm >=0.9.1, device=cuda (Sparse Feature is supported in vllm 0.9.2 and v0.11.0)

Step 1: UCM Installation#

We offer 3 options to install UCM.

Option 1: Setup from docker#

Official pre-built image#

docker pull unifiedcachemanager/ucm:latest

Then run your container using following command.

# Use `--ipc=host` to make sure the shared memory is large enough.
docker run --rm \
    --gpus all \
    --network=host \
    --ipc=host \
    -v <path_to_your_models>:/home/model \
    -v <path_to_your_storage>:/home/storage \
    --name <name_of_your_container> \
    -it unifiedcachemanager/ucm:latest

Build image from source#

git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management

Use following command to build UCM with vLLM(v0.17.0):

docker build -t ucm-vllm:latest -f ./docker/Dockerfile.ucm-vllm-cuda-v0.17.0 ./

For vLLM(v0.11.0) with sparse attention support:

docker build -t ucm-vllm-sparse:latest -f ./docker/Dockerfile.ucm-vllm-cuda-v0.11.0 ./

The Dockerfile automatically invokes the build script (scripts/build_cuda.sh) to compile the wheel and installs from the built package.

Build image from pre-built package#

If you have a pre-built tar package (e.g. from CI), extract it and build the image in package mode:

mkdir -p /tmp/ucm-pkg && tar xzf AI-Storage-Kit_*.tar.gz -C /tmp/ucm-pkg
docker build --build-arg INSTALL_MODE=package \
  -t ucm-vllm:latest -f /tmp/ucm-pkg/docker/Dockerfile.ucm-vllm-cuda-v0.17.0 /tmp/ucm-pkg

Option 2: Build from source#

Prepare vLLM Environment

For the sake of environment isolation and simplicity, we recommend preparing the vLLM environment by pulling the official, pre-built vLLM Docker image.

docker pull vllm/vllm-openai:<vllm_version>

Use the following command to run your own container:

# Use `--ipc=host` to make sure the shared memory is large enough.
docker run \
    --gpus all \
    --network=host \
    --ipc=host \
    -v <path_to_your_models>:/home/model \
    -v <path_to_your_storage>:/home/storage \
    --entrypoint /bin/bash \
    --name <name_of_your_container> \
    -it vllm/vllm-openai:<vllm_version>

Refer to Set up using docker for more information to run your own vLLM container.

Build From Source Code

Follow commands below to install unified-cache-management:

Note: The sparse module was not compiled by default. To enable it, set the environment variable export ENABLE_SPARSE=TRUE before you build.

# Replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
export PLATFORM=cuda
pip install -v -e . --no-build-isolation

Apply vLLM Integration Patches (Not required for versions > 0.11.0)

To integrate UCM with vLLM 0.11.0, you can choose between a dynamic monkey patch (recommended) and a manual git patch.

Recommendation: We highly recommend the Monkey Patch approach for its non-invasive nature and ease of use.

Option A: Monkey Patch (Recommended)

This method enables UCM features dynamically at runtime via environment variables, requiring no source code modifications.
1. Enable Monkey Patch:
```
export ENABLE_UCM_PATCH=1
```
1. Enable Sparse Attention (Optional):
```
export ENABLE_SPARSE=1
```
Note:
- Monkey patch is only available for vLLM 0.11.0.
- Enabling ENABLE_UCM_PATCH is required to use the Prefix Caching feature with UCM on vLLM 0.11.0.
- ReRoPE support is currently only available via the Git Patch method.
Option B: Manual Git Patch (Legacy/Alternative)

If you prefer modifying the source code directly, follow these steps:

1. Navigate to the vLLM source directory:
```
cd <path_to_vllm>
```
2. Apply the patch that corresponds to your vLLM version and requirements:

vLLM 0.9.2
- Full UCM integration (recommended):
```
git apply <path_to_ucm>/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
```
- Sparse attention only:
```
git apply <path_to_ucm>/ucm/integration/vllm/patch/0.9.2/vllm-adapt-sparse.patch
```
- ReRoPE support only:
```
git apply <path_to_ucm>/ucm/integration/vllm/patch/0.9.2/vllm-adapt-rerope.patch
```
vLLM 0.11.0

v0.11.0 only requires the sparse attention patch:
```
git apply <path_to_ucm>/ucm/integration/vllm/patch/0.11.0/vllm-adapt-sparse.patch
```
- ReRoPE support only (optional):
```
git apply <path_to_ucm>/ucm/integration/vllm/patch/0.11.0/vllm-adapt-rerope.patch
```
Choose the patch according to your development needs. If you are working on sparse attention or ReRoPE independently, applying only the corresponding patch is sufficient.

Option 3: Install by pip#

Prepare vLLM Environment

It is recommended to use a pre-build vllm docker image, please follow the guide in Option 2.
install by pip

Install by pip or find the pre-build wheels on Pypi.
```
export PLATFORM=cuda
pip install uc-manager
```

Note: If installing via pip install, you need to manually add the config.yaml file, similar to unified-cache-management/examples/ucm_config_example.yaml, because PyPI packages do not include YAML files.

Step 2: Configuration#

Features Overview#

UCM supports two key features: Prefix Cache and Sparse attention. Each feature supports both Offline Inference and Online API modes. More details are available via the links

For quick start, just follow the guide below to launch your own inference experience.

Feature 1: Prefix Caching#

You may directly edit the example file at unified-cache-management/examples/ucm_config_example.yaml. For more please refer to Prefix Cache with NFS Store and Prefix Cache with Pipeline Store document.

⚠️ Make sure to replace /mnt/test with your actual storage directory.

Feature 2: Sparsity#

The sparse module was not compiled by default. To enable it, set the environment variable export ENABLE_SPARSE=TRUE and re-compile the code you built. And uncomment ucm_sparse_config code block in unified-cache-management/examples/ucm_config_example.yaml.

Step 3: Launching Inference#

Offline Inference

In the examples/ directory, you will find the offline_inference.py script used for offline inference. Before executing the script, locate line 25 and replace the UCM_CONFIG_FILE value with the path to your own configuration file.

def build_llm_with_uc(module_path: str, name: str, model: str):
    ktc = KVTransferConfig(
        kv_connector=name,
        kv_connector_module_path=module_path,
        kv_role="kv_both",
        kv_connector_extra_config={
            "UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"
        },
    )

Then run following commands:

cd examples/
# Change the model path to your own model path
python offline_inference.py

OpenAI-Compatible Online API

For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:

vllm serve Qwen/Qwen2.5-14B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--block_size 128 \
--trust-remote-code \
--port 7800 \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
    "kv_connector": "UCMConnector",
    "kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
}'

⚠️ The parameter --no-enable-prefix-caching is for SSD performance testing, please remove it for production.

⚠️ Make sure to replace "/workspace/unified-cache-management/examples/ucm_config_example.yaml" with your actual config file path.

⚠️ The log files of UCM module will be put under log directory of the path you start vllm service. To use a custom log path, set export UCM_LOG_PATH=my_log_dir.

If you see log as below:

INFO:     Started server process [32890]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Congratulations, you have successfully started the vLLM server with UCM!

After successfully started the vLLM server，You can interact with the API as following:

curl http://localhost:7800/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-14B-Instruct",
    "prompt": "You are a highly specialized assistant whose mission is to faithfully reproduce English literary texts verbatim, without any deviation, paraphrasing, or omission. Your primary responsibility is accuracy: every word, every punctuation mark, and every line must appear exactly as in the original source. Core Principles: Verbatim Reproduction: If the user asks for a passage, you must output the text word-for-word. Do not alter spelling, punctuation, capitalization, or line breaks. Do not paraphrase, summarize, modernize, or \"improve\" the language. Consistency: The same input must always yield the same output. Do not generate alternative versions or interpretations. Clarity of Scope: Your role is not to explain, interpret, or critique. You are not a storyteller or commentator, but a faithful copyist of English literary and cultural texts. Recognizability: Because texts must be reproduced exactly, they will carry their own cultural recognition. You should not add labels, introductions, or explanations before or after the text. Coverage: You must handle passages from classic literature, poetry, speeches, or cultural texts. Regardless of tone—solemn, visionary, poetic, persuasive—you must preserve the original form, structure, and rhythm by reproducing it precisely. Success Criteria: A human reader should be able to compare your output directly with the original and find zero differences. The measure of success is absolute textual fidelity. Your function can be summarized as follows: verbatim reproduction only, no paraphrase, no commentary, no embellishment, no omission. Please reproduce verbatim the opening sentence of the United States Declaration of Independence (1776), starting with \"When in the Course of human events\" and continuing word-for-word without paraphrasing.",
    "max_tokens": 100,
    "temperature": 0
  }'