Files
helexa/crates/cortex-gateway/tests/metrics.rs
rob thijssen 6a36d15ef1
All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
feat(gateway): per-request token metrics — TTFT and tok/s (#21)
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.

The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.

Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
  decode window (first→last chunk); falls back to total request
  duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
  (counters)

The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).

Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).

Closes #21

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:11:52 +03:00

122 lines
4.3 KiB
Rust

mod common;
use serde_json::json;
use std::sync::OnceLock;
/// The metrics recorder is a process-wide global; both tests in this
/// binary run against one shared install. Assertions must therefore be
/// order-independent (presence of names / monotonic counters, not
/// "empty before").
fn recorder() -> &'static metrics_exporter_prometheus::PrometheusHandle {
static HANDLE: OnceLock<metrics_exporter_prometheus::PrometheusHandle> = OnceLock::new();
HANDLE.get_or_init(|| {
cortex_gateway::metrics::install_test_recorder().expect("recorder should install")
})
}
#[tokio::test]
async fn test_metrics_emitted_after_proxy() {
let handle = recorder();
let mock_url = common::spawn_mock_neuron().await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"messages": [{"role": "user", "content": "Hi"}]
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let _body: serde_json::Value = resp.json().await.unwrap();
let after = handle.render();
assert!(
after.contains("cortex_requests_total"),
"cortex_requests_total should be present after a request.\nMetrics:\n{after}"
);
assert!(
after.contains("cortex_request_duration_seconds"),
"cortex_request_duration_seconds should be present.\nMetrics:\n{after}"
);
assert!(
!after.contains("cortex_request_errors_total"),
"no errors expected for a successful request"
);
}
#[tokio::test]
async fn test_token_metrics_emitted_for_streamed_request() {
// #21: a streamed chat completion with a final usage chunk must
// produce TTFT + tok/s histograms and prompt/completion token
// counters, labelled with model and node. The recorder is global
// per-process, so this test runs in its own binary invocation —
// cargo's per-file integration binaries give us that as long as
// only one test in this file installs the recorder... it isn't:
// test_metrics_emitted_after_proxy also installs. Whichever wins
// the race, both render from the same recorder, so assert on
// delta-able names rather than exact totals.
let handle = recorder();
let mock_url = common::spawn_streaming_mock_neuron_with_usage(
5,
std::time::Duration::from_millis(40),
225,
42,
)
.await;
let gw_url = common::spawn_gateway(&mock_url).await;
let client = reqwest::Client::new();
let resp = client
.post(format!("{gw_url}/v1/chat/completions"))
.header("content-type", "application/json")
.json(&json!({
"model": "test-model",
"messages": [{"role": "user", "content": "Hi"}],
"stream": true
}))
.send()
.await
.expect("request should succeed");
assert_eq!(resp.status(), 200);
let body = resp.text().await.expect("stream should complete");
assert!(body.contains("[DONE]"));
let rendered = handle.render();
for needle in [
"cortex_time_to_first_token_seconds",
"cortex_tokens_per_second",
] {
assert!(
rendered.contains(needle),
"{needle} should be present.\nMetrics:\n{rendered}"
);
}
// The recorder is shared with the sibling test (same model/node
// labels), so counters are lower bounds, not exact values: this
// request contributed prompt=225 / completion=42.
let counter_value = |name: &str| -> u64 {
rendered
.lines()
.find(|l| l.starts_with(name) && l.contains(r#"model="test-model""#))
.and_then(|l| l.rsplit(' ').next())
.and_then(|v| v.parse().ok())
.unwrap_or_else(|| panic!("{name} should be present.\nMetrics:\n{rendered}"))
};
assert!(
counter_value("cortex_prompt_tokens_total") >= 225,
"prompt token counter should include this request's 225.\nMetrics:\n{rendered}"
);
assert!(
counter_value("cortex_completion_tokens_total") >= 42,
"completion token counter should include this request's 42.\nMetrics:\n{rendered}"
);
}