Prometheus metrics
OpenClaw can expose diagnostics metrics through the official diagnostics-prometheus plugin. It listens to trusted internal diagnostics and renders a Prometheus text endpoint at:
GET /api/diagnostics/prometheusContent type is text/plain; version=0.0.4; charset=utf-8, the standard Prometheus exposition format.
For traces, logs, OTLP push, and OpenTelemetry GenAI semantic attributes, see OpenTelemetry export.
Quick start
Section titled “Quick start”Install the plugin
Terminal window openclaw plugins install clawhub:@openclaw/diagnostics-prometheusEnable the plugin
{plugins: {allow: ["diagnostics-prometheus"],entries: {"diagnostics-prometheus": { enabled: true },},},diagnostics: {enabled: true,},}Terminal window openclaw plugins enable diagnostics-prometheusRestart the Gateway
The HTTP route is registered at plugin startup, so reload after enabling.
Scrape the protected route
Send the same gateway auth your operator clients use:
Terminal window curl -H "Authorization: Bearer $OPENCLAW_GATEWAY_TOKEN" \http://127.0.0.1:18789/api/diagnostics/prometheusWire Prometheus
prometheus.yml scrape_configs:- job_name: openclawscrape_interval: 30smetrics_path: /api/diagnostics/prometheusauthorization:credentials_file: /etc/prometheus/openclaw-gateway-tokenstatic_configs:- targets: ["openclaw-gateway:18789"]
Metrics exported
Section titled “Metrics exported”| Metric | Type | Labels |
|---|---|---|
openclaw_run_completed_total | counter | channel, model, outcome, provider, trigger |
openclaw_run_duration_seconds | histogram | channel, model, outcome, provider, trigger |
openclaw_model_call_total | counter | api, error_category, model, outcome, provider, transport |
openclaw_model_call_duration_seconds | histogram | api, error_category, model, outcome, provider, transport |
openclaw_model_tokens_total | counter | agent, channel, model, provider, token_type |
openclaw_gen_ai_client_token_usage | histogram | model, provider, token_type |
openclaw_model_cost_usd_total | counter | agent, channel, model, provider |
openclaw_skill_used_total | counter | activation, agent, skill, source |
openclaw_tool_execution_total | counter | error_category, outcome, params_kind, tool, tool_owner, tool_source |
openclaw_tool_execution_duration_seconds | histogram | error_category, outcome, params_kind, tool, tool_owner, tool_source |
openclaw_harness_run_total | counter | channel, error_category, harness, model, outcome, phase, plugin, provider |
openclaw_harness_run_duration_seconds | histogram | channel, error_category, harness, model, outcome, phase, plugin, provider |
openclaw_message_received_total | counter | channel, source |
openclaw_message_dispatch_started_total | counter | channel, source |
openclaw_message_dispatch_completed_total | counter | channel, outcome, reason, source |
openclaw_message_dispatch_duration_seconds | histogram | channel, outcome, reason, source |
openclaw_message_processed_total | counter | channel, outcome, reason |
openclaw_message_processed_duration_seconds | histogram | channel, outcome, reason |
openclaw_message_delivery_started_total | counter | channel, delivery_kind |
openclaw_message_delivery_total | counter | channel, delivery_kind, error_category, outcome |
openclaw_message_delivery_duration_seconds | histogram | channel, delivery_kind, error_category, outcome |
openclaw_talk_event_total | counter | brain, event_type, mode, provider, transport |
openclaw_talk_event_duration_seconds | histogram | brain, event_type, mode, provider, transport |
openclaw_talk_audio_bytes | histogram | brain, event_type, mode, provider, transport |
openclaw_queue_lane_size | gauge | lane |
openclaw_queue_lane_wait_seconds | histogram | lane |
openclaw_session_state_total | counter | reason, state |
openclaw_session_queue_depth | gauge | state |
openclaw_session_turn_created_total | counter | agent, channel, trigger |
openclaw_session_recovery_total | counter | action, active_work_kind, state, status |
openclaw_session_recovery_age_seconds | histogram | action, active_work_kind, state, status |
openclaw_memory_bytes | gauge | kind |
openclaw_memory_rss_bytes | histogram | none |
openclaw_memory_pressure_total | counter | level, reason |
openclaw_telemetry_exporter_total | counter | exporter, reason, signal, status |
openclaw_prometheus_series_dropped_total | counter | none |
Label policy
Section titled “Label policy”Bounded, low-cardinality labels
Prometheus labels stay bounded and low-cardinality. The exporter does not emit raw diagnostic identifiers such as runId, sessionKey, sessionId, callId, toolCallId, message IDs, chat IDs, or provider request IDs.
Label values are redacted and must match OpenClaw’s low-cardinality character policy. Values that fail the policy are replaced with unknown, other, or none, depending on the metric. Labels that look like scoped agent session keys are also replaced with unknown.
Series cap and overflow accounting
The exporter caps retained time series in memory at 2048 series across counters, gauges, and histograms combined. New series beyond that cap are dropped, and openclaw_prometheus_series_dropped_total increments by one each time.
Watch this counter as a hard signal that an attribute upstream is leaking high-cardinality values. The exporter never lifts the cap automatically; if it climbs, fix the source rather than disabling the cap.
What never appears in Prometheus output
- prompt text, response text, tool inputs, tool outputs, system prompts
- Talk transcripts, audio payloads, call ids, room ids, handoff tokens, turn ids, and raw session ids
- raw provider request IDs (only bounded hashes, where applicable, on spans — never on metrics)
- session keys and session IDs
- hostnames, file paths, secret values
PromQL recipes
Section titled “PromQL recipes”# Tokens per minute, split by providersum by (provider) (rate(openclaw_model_tokens_total[1m]))
# Spend (USD) over the last hour, by modelsum by (model) (increase(openclaw_model_cost_usd_total[1h]))
# 95th percentile model run durationhistogram_quantile( 0.95, sum by (le, provider, model) (rate(openclaw_run_duration_seconds_bucket[5m])))
# Queue wait time SLO (95p under 2s)histogram_quantile( 0.95, sum by (le, lane) (rate(openclaw_queue_lane_wait_seconds_bucket[5m]))) < 2
# Skill usage, split by bounded sourcesum by (skill, source) (increase(openclaw_skill_used_total[24h]))
# Dropped Prometheus series (cardinality alarm)increase(openclaw_prometheus_series_dropped_total[15m]) > 0Choosing between Prometheus and OpenTelemetry export
Section titled “Choosing between Prometheus and OpenTelemetry export”OpenClaw supports both surfaces independently. You can run either, both, or neither.
- Pull model: Prometheus scrapes
/api/diagnostics/prometheus. - No external collector required.
- Authenticated through normal Gateway auth.
- Surface is metrics only (no traces or logs).
- Best for stacks already standardized on Prometheus + Grafana.
- Push model: OpenClaw sends OTLP/HTTP to a collector or OTLP-compatible backend.
- Surface includes metrics, traces, and logs.
- Bridges to Prometheus through an OpenTelemetry Collector (
prometheusorprometheusremotewriteexporter) when you need both. - See OpenTelemetry export for the full catalog.
Troubleshooting
Section titled “Troubleshooting”Empty response body
- Check
diagnostics.enabled: truein config. - Confirm the plugin is enabled and loaded with
openclaw plugins list --enabled. - Generate some traffic; counters and histograms only emit lines after at least one event.
401 / unauthorized
The endpoint requires the Gateway operator scope (auth: "gateway" with gatewayRuntimeScopeSurface: "trusted-operator"). Use the same token or password Prometheus uses for any other Gateway operator route. There is no public unauthenticated mode.
`openclaw_prometheus_series_dropped_total` is climbing
A new attribute is exceeding the 2048-series cap. Inspect recent metrics for an unexpectedly high-cardinality label and fix it at the source. The exporter intentionally drops new series instead of silently rewriting labels.
Prometheus shows stale series after a restart
The plugin keeps state in memory only. After a Gateway restart, counters reset to zero and gauges restart at their next reported value. Use PromQL rate() and increase() to handle resets cleanly.
Related
Section titled “Related”- Diagnostics export — local diagnostics zip for support bundles
- Health and readiness —
/healthzand/readyzprobes - Logging — file-based logging
- OpenTelemetry export — OTLP push for traces, metrics, and logs