# Quickstart-vLLM-Ascend
This document describes how to install unified-cache-management with vllm-ascend on ascend platform.

## Prerequisites
vllm-ascend: >=v0.9.1 (vllm == 0.9.2 to use the Sparse Feature)

**Please refer to the [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) guide to meet the required dependencies, and prepare the corresponding version of the vllm-ascend environment as needed.**

## Step 1: UCM Installation

We offer 3 options to install UCM.

### Option 1: Build from source

1、Follow commands below to install unified-cache-management from source code:
**Note:** The sparse module was not compiled by default. To enable it, set the environment variable `export ENABLE_SPARSE=TRUE` before you build.
```bash
# Replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
export PLATFORM=ascend
pip install -v -e . --no-build-isolation
cd ..
```

>**Note:** For the Atlas A3 series, the `PLATFORM` variable should be set to `ascend-a3`.

2、Apply vLLM and vLLM-Ascend Integration Patches (Not required for versions >= v0.17.0rc1)
To enable Unified Cache Management (UCM) integration, you need to apply patches to both vLLM and vLLM-Ascend source trees.

#### Option A: Monkey Patch (Recommended)

This method enables UCM features dynamically at runtime via environment variables, requiring no source code modifications.

1. Enable Monkey Patch:
```bash
export ENABLE_UCM_PATCH=1
```
>**Note:** Enabling ENABLE_UCM_PATCH is required to use the Prefix Caching feature with UCM.

2. Enable Sparse Attention (supported on v0.11.0):
```bash
export ENABLE_SPARSE=1
```

#### Option B: Manual Git Patch (Legacy/Alternative)

If you prefer modifying the source code directly, follow these steps:

**Step 1:** Apply the vLLM Patch

First, apply the standard vLLM integration patch in the vLLM source directory:
    
```bash
cd <path_to_vllm>
# Replace <vLLM_VERSION> with 0.9.2 or 0.11.0
git apply <patch_to_ucm>/ucm/integration/vllm/patch/<vLLM_VERSION>/vllm-adapt.patch
```
    
**Step 2:** Apply the vLLM-Ascend Patch

Then, switch to the vLLM-Ascend source directory and apply the Ascend-specific patch:

```bash
cd <path_to_vllm_ascend>
# Replace <vLLM_VERSION> with 0.9.2 or 0.11.0
git apply <patch_to_ucm>/ucm/integration/vllm/patch/<vLLM_VERSION>/vllm-ascend-adapt.patch
```

>**Note:**
    The ReRoPE algorithm is not supported on Ascend at the moment.
    Only the standard UCM integration is applicable for vLLM-Ascend.


### Option 2: Install by pip
Install by pip or find the pre-build wheels on [Pypi](https://pypi.org/project/uc-manager/).
```
export PLATFORM=ascend
pip install uc-manager
```
> **Note:** If installing via `pip install`, you need to manually add the `config.yaml` file, similar to `unified-cache-management/examples/ucm_config_example.yaml`, because PyPI packages do not include YAML files.

### Option 3: Setup from docker

#### Build image from source
Use following command to build UCM with vLLM-Ascend(v0.17.0rc1):
```bash
docker build -t ucm-vllm:latest -f ./docker/Dockerfile.ucm-vllm-ascend.a2-v0.17.0 ./
```

For vLLM-Ascend(v0.11.0) with sparse attention support:
```bash
docker build -t ucm-vllm-sparse:latest -f ./docker/Dockerfile.ucm-vllm-ascend.a2-v0.11.0 ./
```

The Dockerfile automatically invokes the build script (`scripts/build_ascend.sh`) to compile the wheel and installs from the built package.

#### Build image from pre-built package

If you have a pre-built tar package (e.g. from CI), extract it and build the image in `package` mode:
```bash
mkdir -p /tmp/ucm-pkg && tar xzf AI-Storage-Kit_*.tar.gz -C /tmp/ucm-pkg
docker build --build-arg INSTALL_MODE=package \
  -t ucm-vllm:latest -f /tmp/ucm-pkg/docker/Dockerfile.ucm-vllm-ascend.a2-v0.17.0 /tmp/ucm-pkg
```

vllm-ascend provides two variants: **Ubuntu** and **openEuler**.
The `Dockerfile.ucm-vllm-ascend.a2-v0.17.0` uses the **Ubuntu** variant by default.

If you want to use the **openEuler** variant, override the base image with `--build-arg IMAGE_NAME_VERSION`:

```text
quay.io/ascend/vllm-ascend:v0.11.0-openeuler
```
Then run your container using following command. You can add or remove Docker parameters as needed.
```bash
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
docker run --rm \
    --network=host \
    --device $DEVICE \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -v <path_to_your_models>:/app/model \
    -v <path_to_your_storage>:/app/storage \
    --name <name_of_your_container> \
    -it <image_id> bash
```
## Step 2: Configuration

### Features Overview

UCM supports two key features: **Prefix Cache** and **Sparse attention**. Each feature supports both **Offline Inference** and **Online API** modes. More details are available via the links
- [Prefix Cache](../user-guide/prefix-cache/index.md)
- [GSA Sparsity](../user-guide/sparse-attention/gsa.md)

For quick start, just follow the guide below to launch your own inference experience;

### Feature 1:  Prefix Caching

You may directly edit the example file at `unified-cache-management/examples/ucm_config_example.yaml`. For more please refer to [Prefix Cache with NFS Store](../user-guide/prefix-cache/nfs_store.md) and [Prefix Cache with Pipeline Store](../user-guide/prefix-cache/pipeline_store.md) document.

⚠️ Make sure to replace `/mnt/test` with your actual storage directory. 

### Feature 2:  Sparsity

The sparse module was not compiled by default. To enable it, set the environment variable `export ENABLE_SPARSE=TRUE` and re-compile the code you built. And uncomment `ucm_sparse_config` code block in `unified-cache-management/examples/ucm_config_example.yaml`. Additionally, if you want to run GSA, you also need to set the environment variable `export VLLM_HASH_ATTENTION=1`.

## Step 3: Launching Inference

<details open>
<summary><b>Offline Inference</b></summary>

In the `examples/` directory, you will find the `offline_inference.py` script used for offline inference. Before executing the script, locate line 25 and replace the `UCM_CONFIG_FILE` value with the path to your own configuration file.
```bash
def build_llm_with_uc(module_path: str, name: str, model: str):
    ktc = KVTransferConfig(
        kv_connector=name,
        kv_connector_module_path=module_path,
        kv_role="kv_both",
        kv_connector_extra_config={
            "UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"
        },
    )
```
Then run following commands:

```bash
cd examples/
# Change the model path to your own model path
python offline_inference.py
```

</details>



<details open>
<summary><b>OpenAI-Compatible Online API</b></summary>

For online inference, vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:

```bash
vllm serve Qwen/Qwen2.5-14B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--block_size 128 \
--trust-remote-code \
--port 7800 \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
    "kv_connector": "UCMConnector",
    "kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"}
}'
```
**⚠️ The parameter `--no-enable-prefix-caching` is for SSD performance testing, please remove it for production.**

**⚠️ Make sure to replace `"/workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**

**⚠️ The log files of UCM module will be put under `log` directory of the path you start vllm service. To use a custom log path, set `export UCM_LOG_PATH=my_log_dir`.**

If you see log as below:

```bash
INFO:     Started server process [32890]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

Congratulations, you have successfully started the vLLM server with UCM!

After successfully started the vLLM server，You can interact with the API as following:

```bash
curl http://localhost:7800/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-14B-Instruct",
    "prompt": "You are a highly specialized assistant whose mission is to faithfully reproduce English literary texts verbatim, without any deviation, paraphrasing, or omission. Your primary responsibility is accuracy: every word, every punctuation mark, and every line must appear exactly as in the original source. Core Principles: Verbatim Reproduction: If the user asks for a passage, you must output the text word-for-word. Do not alter spelling, punctuation, capitalization, or line breaks. Do not paraphrase, summarize, modernize, or \"improve\" the language. Consistency: The same input must always yield the same output. Do not generate alternative versions or interpretations. Clarity of Scope: Your role is not to explain, interpret, or critique. You are not a storyteller or commentator, but a faithful copyist of English literary and cultural texts. Recognizability: Because texts must be reproduced exactly, they will carry their own cultural recognition. You should not add labels, introductions, or explanations before or after the text. Coverage: You must handle passages from classic literature, poetry, speeches, or cultural texts. Regardless of tone—solemn, visionary, poetic, persuasive—you must preserve the original form, structure, and rhythm by reproducing it precisely. Success Criteria: A human reader should be able to compare your output directly with the original and find zero differences. The measure of success is absolute textual fidelity. Your function can be summarized as follows: verbatim reproduction only, no paraphrase, no commentary, no embellishment, no omission. Please reproduce verbatim the opening sentence of the United States Declaration of Independence (1776), starting with \"When in the Course of human events\" and continuing word-for-word without paraphrasing.",
    "max_tokens": 100,
    "temperature": 0
  }'

```
</details>
