Observability#
UCM (Unified Cache Management) provides detailed metrics monitoring through Prometheus endpoints, allowing in-depth monitoring of cache performance and behavior. This document describes how to enable and configure observability from the embedded vLLM /metrics API endpoint.
Quick Start Guide#
1) On UCM Side#
First, set the PROMETHEUS_MULTIPROC_DIR environment variable.
export PROMETHEUS_MULTIPROC_DIR=/vllm-workspace
Then, you should uncomment metrics_config_path in ucm’s config.yaml—this path specifies which metrics need to be collected.
After completing the two steps above, you can start the service to collect metrics.
export CUDA_VISIBLE_DEVICES=0
vllm serve /home/models/Qwen2.5-14B-Instruct \
--max-model-len 5000 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.87 \
--trust-remote-code \
--disable-log-requests \
--no-enable-prefix-caching \
--enforce-eager \
--max-num-batched-tokens 40000 \
--max-num-seqs 10 \
--host 0.0.0.0 \
--port 8000 \
--kv-transfer-config \
'{
"kv_connector": "UCMConnector",
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config.yaml"
}
}'
You can use the vllm bench serve command to run benchmarks:
vllm bench serve \
--backend vllm \
--model /home/models/Qwen2.5-14B-Instruct \
--host 127.0.0.1 \
--port 8000 \
--dataset-name random \
--num-prompts 20 \
--random-input-len 200 \
--random-output-len 10 \
--request-rate 1 \
--ignore-eos
Once the HTTP server is running, you can access the UCM metrics at the /metrics endpoint.
curl http://$<vllm-worker-ip>:8000/metrics | grep ucm:
You will also find some .db files in the $PROMETHEUS_MULTIPROC_DIR directory, which are temporary files used by Prometheus.
2) Start Prometheus and Grafana with Docker Compose#
Create Docker Compose Configuration Files#
First, create the docker-compose.yaml file:
# docker-compose.yaml
version: "3"
services:
prometheus:
image: prom/prometheus:latest
extra_hosts:
- "host.docker.internal:host-gateway"
ports:
- "9090:9090"
volumes:
- ${PWD}/prometheus.yaml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
depends_on:
- prometheus
ports:
- "3000:3000"
Then, create the prometheus.yaml configuration file:
# prometheus.yaml
global:
scrape_interval: 5s
evaluation_interval: 30s
scrape_configs:
- job_name: vllm
static_configs:
- targets:
- 'host.docker.internal:8000'
Note: Make sure the port number in prometheus.yaml matches the port number used when starting the vLLM service.
Start Services#
Run the following command in the directory containing docker-compose.yaml and prometheus.yaml:
docker compose up
This will start Prometheus and Grafana services.
3) Configure Grafana Dashboard#
Access Grafana#
Navigate to http://<your-host>:3000. Log in with the default username (admin) and password (admin). You will be prompted to change the password on first login.
Add Prometheus Data Source#
Navigate to
http://<your-host>:3000/connections/datasources/newand select Prometheus.On the Prometheus configuration page, add the Prometheus server URL in the Connection section. For this Docker Compose setup, Grafana and Prometheus run in separate containers, but Docker creates DNS names for each container. You can directly use
http://prometheus:9090.Click Save & Test. You should see a green checkmark showing “Successfully queried the Prometheus API.”
Import Dashboard#
Navigate to
http://<your-host>:3000/dashboard/import.Click Upload JSON file, then upload the
unified-cache-management/examples/metrics/grafana.jsonfile.Select the Prometheus data source configured earlier.
Click Import to complete the import.
You should now be able to see the UCM monitoring dashboard with real-time visualization of all 9 metrics.
Available Metrics#
UCM exposes various metrics to monitor its performance. The following table lists all available metrics organized by category:
Metric Name |
Type |
Description |
|---|---|---|
Load Operation Metrics |
||
|
Histogram |
Number of requests loaded per |
|
Histogram |
Number of blocks loaded per |
|
Histogram |
Time to load KV cache from UCM (milliseconds) |
|
Histogram |
Speed of loading from UCM (GB/s) |
Save Operation Metrics |
||
|
Histogram |
Number of requests saved per |
|
Histogram |
Number of blocks saved per |
|
Histogram |
Time to save to UCM (milliseconds) |
|
Histogram |
Speed of saving to UCM (GB/s) |
Lookup Hit Rate Metrics |
||
|
Histogram |
Hit rate of UCM lookup requests |
Prometheus Configuration#
Metrics configuration is defined in the unified-cache-management/examples/metrics/metrics_configs.yaml file:
log_interval: 5 # Interval in seconds for logging metrics
multiproc_dir: "/vllm-workspace" # Prometheus directory
metric_prefix: "ucm:" # Metric name prefix
histograms:
- name: "load_requests_num"
documentation: "Number of requests loaded from ucm"
buckets: [1, 5, 10, 20, 50, 100, 200, 500, 1000]
# ... other metric configurations