How to Add A New Metric#

UCM allows developers to add new metrics for monitoring service health status, and this doc provides the methods for adding new metrics.

Getting Started#

Step 1: Define New Metrics in YAML#

Prometheus provides three fundamental metric types: Counter, Gauge, and Histogram. UCM implements corresponding wrappers for each type. After defining new metric in yaml, it will be registered to Prometheus automatically by below function:

def _register_metrics_by_type(self, metric_type):
        """
        Register metrics by different metric types.
        """
        metric_cls, default_kwargs = self.metric_type_config[metric_type]
        cfg_list = self.config.get(metric_type, [])

        for cfg in cfg_list:
            name = cfg.get("name")
            doc = cfg.get("documentation", "")
            # Prometheus metric name with prefix
            prometheus_name = f"{self.metric_prefix}{name}"
            ucmmetrics.create_stats(name, metric_type)

            metric_kwargs = {
                "name": prometheus_name,
                "documentation": doc,
                "labelnames": self.labelnames,
                **default_kwargs,
                **{k: v for k, v in cfg.items() if k in default_kwargs},
            }

            self.metric_mappings[name] = metric_cls(**metric_kwargs)

Example of yaml below:

# Prometheus Metrics Configuration
# This file defines which metrics should be enabled and their configurations
log_interval: 5  # Interval in seconds for logging metrics

multiproc_dir: "/vllm-workspace"  # Directory for Prometheus multiprocess mode

metric_prefix: "ucm:" 

histogram_max_length: 10000  # Maximum length of the vector for each histogram metric

# Counter metrics configuration
# counter:
#   - name: "received_requests"
#     documentation: "Total number of requests sent to ucm"

# Gauge metrics configuration
# gauge:
#   - name: "lookup_hit_rate"
#     documentation: "Hit rate of ucm lookup requests since last log"
#     multiprocess_mode: "livemostrecent"

# Histogram metrics configuration
histogram:
  - name: "load_requests_num"
    documentation: "Number of requests loaded from ucm"
    buckets: [1, 5, 10, 20, 50, 100, 200, 500, 1000]
  - name: "d2s_bandwidth"
    documentation: "Band width of uc store task d2s, copy tensors from device to storage"
    buckets: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
  - name: "s2d_bandwidth"
    documentation: "Band width of uc store task s2d, copy tensors from storage to device"
    buckets: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

Please refer to the example YAML for more detailed information.

Step 2: Use Metrics APIs to Update Stats#

After defining metrics in yaml, users only need to link metrics/import ucmmetrics and update them in suitable position, while observability component is responsible for fetching the stats.

Example: Import the ucmmetrics and then use update_stats to update new metrics.

# 1. Import ucmmetrics
from ucm.shared.metrics import ucmmetrics

# 2. Update a stat
ucmmetrics.update_stats(
  {"interval_lookup_hit_rates": external_hit_blocks / len(ucm_block_ids)},
)

# 2. Update stats
ucmmetrics.update_stats(
  {
      "load_requests_total": num_loaded_request,
      "load_blocks_total": num_loaded_block,
      "load_duration": load_end_time - load_start_time,
      "load_speed": load_speed,
  }
)

See more detailed example in test case.

Example: UCM supports custom metrics by following steps:

  • Step 1: linking the static library metrics

     target_link_libraries(xxxstore PUBLIC storeinfra metrics)
    
  • Step 2: Update using function UpdateStats

// 1. Include metrics api head file
#include "metrics_api.h"

// 2. Update metrics defined in yaml
auto Epilog(const size_t ioSize) const noexcept
  {
      auto total = ioSize * number_;
      auto costs = NowTp() - startTp;
      auto bw = double(total) / costs / 1e9;
      switch (type)
      {
      case Type::DUMP:
          UC::Metrics::UpdateStats("d2s_bandwidth", bw);
          break;
      case Type::LOAD:
          UC::Metrics::UpdateStats("s2d_bandwidth", bw);
          break;
      default:
          break;
      }
      return fmt::format("Task({},{},{},{}) finished, costs={:.06f}s, bw={:.06f}GB/s.", id,
                          brief_, number_, total, costs, bw);
  }

See more detailed example in test case.

How to See New Metrics#

After completing the above two steps, developers can view the newly added metrics via the /metrics endpoint.

Developers can also add a new panel in grafana.json to display the newly added metrics. Refer to grafana example for more information.