All checks were successful
CI / Format (push) Successful in 40s
CI / CUDA type-check (push) Successful in 1m41s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m30s
build-prerelease / Build cortex binary (push) Successful in 2m49s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Test (push) Successful in 5m59s
build-prerelease / Build helexa-bench binary (push) Has been skipped
build-prerelease / Package helexa-bench RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 49s
Stage 1 accounting (#51): capture real per-request usage and feed it to the spend ledger + per-principal metrics. Establishes the reserve→settle lifecycle that budget enforcement (#52) will tighten. - cortex-gateway::metering: ReservationGuard makes reservation leaks impossible — settle() records actual spend + releases the remainder; dropping an un-settled guard releases the whole reservation, so any early return / error / dropped stream resolves it. UsageSink is the completion hook; principal_from_headers reconstructs the principal from the middleware-stamped headers (uniform across all proxy paths, no handler-signature churn); record_spend emits per-principal counters. - proxy::TokenMetrics gains an optional usage_sink, invoked exactly once in finish() with the observed (prompt, completion) — restructured so it always runs (even when no body/usage arrived → settle 0 → release), while preserving the existing per-model metric emissions unchanged. - All proxy paths metered: chat/completions/responses via proxy_with_metrics (reserve 0 → forward_request → settle in finish); Anthropic non-streaming settles from the buffered body; Anthropic streaming (anthropic_sse) now scans the upstream frames for the usage object (#48) — it captured none before — and settles at pump end. - This phase reserves 0 tokens (metering only, no enforcement); #52 flips the reserved amount to prompt+max_output and surfaces BudgetError. The settle/release plumbing is identical, so that change is localized. - New Prometheus counters: cortex_spend_tokens_total (+ prompt/completion splits), labelled by account/key. 2 integration tests: cumulative per-key spend after N requests with reservations settled to zero outstanding; anonymous requests record no spend. Local fmt/clippy/test all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
79 lines
2.7 KiB
Rust
79 lines
2.7 KiB
Rust
//! Prometheus metrics exporter.
|
|
//!
|
|
//! Runs on a separate port from the main API, exposing `/metrics`
|
|
//! in Prometheus text format.
|
|
|
|
use anyhow::Result;
|
|
use metrics_exporter_prometheus::PrometheusBuilder;
|
|
use std::net::SocketAddr;
|
|
|
|
/// Install the Prometheus metrics recorder and return a handle.
|
|
/// The `/metrics` endpoint is served by the exporter's built-in HTTP server.
|
|
pub fn install(listen: &str) -> Result<()> {
|
|
let addr: SocketAddr = listen.parse()?;
|
|
|
|
PrometheusBuilder::new()
|
|
.with_http_listener(addr)
|
|
.install()
|
|
.map_err(|e| anyhow::anyhow!("failed to install Prometheus exporter: {e}"))?;
|
|
|
|
tracing::info!("prometheus metrics exporter on {addr}");
|
|
describe_metrics();
|
|
Ok(())
|
|
}
|
|
|
|
/// Install a recorder for testing (no HTTP listener). Returns a handle
|
|
/// that can render the current metrics as Prometheus text.
|
|
pub fn install_test_recorder() -> Result<metrics_exporter_prometheus::PrometheusHandle> {
|
|
let handle = PrometheusBuilder::new()
|
|
.install_recorder()
|
|
.map_err(|e| anyhow::anyhow!("failed to install test recorder: {e}"))?;
|
|
describe_metrics();
|
|
Ok(handle)
|
|
}
|
|
|
|
fn describe_metrics() {
|
|
metrics::describe_histogram!(
|
|
"cortex_request_duration_seconds",
|
|
"Total request latency in seconds"
|
|
);
|
|
metrics::describe_histogram!(
|
|
"cortex_time_to_first_token_seconds",
|
|
"Time to first token in seconds"
|
|
);
|
|
metrics::describe_histogram!(
|
|
"cortex_tokens_per_second",
|
|
"Generation throughput in tokens per second"
|
|
);
|
|
metrics::describe_counter!("cortex_requests_total", "Total number of proxied requests");
|
|
metrics::describe_counter!(
|
|
"cortex_prompt_tokens_total",
|
|
"Total prompt tokens reported by upstream usage objects"
|
|
);
|
|
metrics::describe_counter!(
|
|
"cortex_completion_tokens_total",
|
|
"Total completion tokens reported by upstream usage objects"
|
|
);
|
|
metrics::describe_counter!(
|
|
"cortex_request_errors_total",
|
|
"Total number of failed proxy requests"
|
|
);
|
|
metrics::describe_counter!("cortex_evictions_total", "Total number of model evictions");
|
|
metrics::describe_counter!(
|
|
"cortex_cold_starts_total",
|
|
"Total number of cold-start model loads"
|
|
);
|
|
metrics::describe_counter!(
|
|
"cortex_spend_tokens_total",
|
|
"Total metered tokens (prompt + completion) per principal, labelled by account/key (#51)"
|
|
);
|
|
metrics::describe_counter!(
|
|
"cortex_spend_prompt_tokens_total",
|
|
"Metered prompt tokens per principal, labelled by account/key (#51)"
|
|
);
|
|
metrics::describe_counter!(
|
|
"cortex_spend_completion_tokens_total",
|
|
"Metered completion tokens per principal, labelled by account/key (#51)"
|
|
);
|
|
}
|