All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The deferred Phase 6b, and the unblock for the 7→8 milestone's benchmark work (#22): until cortex measures itself per request, nothing downstream can be benchmarked or graphed. The proxy wraps the upstream byte stream in a pass-through inspector (TokenMetricsStream): chunks are forwarded verbatim — never buffered or re-serialised — while the inspector records arrival times and keeps a bounded (64 KiB) tail of the body text. At stream end (or client disconnect, via Drop) it extracts the final OpenAI usage object — present on the last SSE chunk and non-streaming JSON bodies alike — for engine-truth token counts. Per request, labelled {model, node}: - cortex_time_to_first_token_seconds (histogram) — first body chunk - cortex_tokens_per_second (histogram) — completion tokens over the decode window (first→last chunk); falls back to total request duration for single-chunk non-streaming bodies - cortex_prompt_tokens_total / cortex_completion_tokens_total (counters) The extractor is pure and chunk-boundary-safe; quoted-needle matching keeps completion_tokens_details from shadowing completion_tokens, and the last usage object wins. Covers chat completions, completions, the Responses API, and the Anthropic streaming path (which currently proxies OpenAI SSE). Tests: 4 extractor unit tests; integration test with a streaming mock emitting a stream_options-style final usage chunk, asserting both histograms and exact-or-greater counter values (the test recorder is process-global and shared across the binary's tests). Closes #21 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
122 lines
4.3 KiB
Rust
122 lines
4.3 KiB
Rust
mod common;
|
|
|
|
use serde_json::json;
|
|
use std::sync::OnceLock;
|
|
|
|
/// The metrics recorder is a process-wide global; both tests in this
|
|
/// binary run against one shared install. Assertions must therefore be
|
|
/// order-independent (presence of names / monotonic counters, not
|
|
/// "empty before").
|
|
fn recorder() -> &'static metrics_exporter_prometheus::PrometheusHandle {
|
|
static HANDLE: OnceLock<metrics_exporter_prometheus::PrometheusHandle> = OnceLock::new();
|
|
HANDLE.get_or_init(|| {
|
|
cortex_gateway::metrics::install_test_recorder().expect("recorder should install")
|
|
})
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn test_metrics_emitted_after_proxy() {
|
|
let handle = recorder();
|
|
|
|
let mock_url = common::spawn_mock_neuron().await;
|
|
let gw_url = common::spawn_gateway(&mock_url).await;
|
|
|
|
let client = reqwest::Client::new();
|
|
let resp = client
|
|
.post(format!("{gw_url}/v1/chat/completions"))
|
|
.header("content-type", "application/json")
|
|
.json(&json!({
|
|
"model": "test-model",
|
|
"messages": [{"role": "user", "content": "Hi"}]
|
|
}))
|
|
.send()
|
|
.await
|
|
.expect("request should succeed");
|
|
assert_eq!(resp.status(), 200);
|
|
let _body: serde_json::Value = resp.json().await.unwrap();
|
|
|
|
let after = handle.render();
|
|
|
|
assert!(
|
|
after.contains("cortex_requests_total"),
|
|
"cortex_requests_total should be present after a request.\nMetrics:\n{after}"
|
|
);
|
|
assert!(
|
|
after.contains("cortex_request_duration_seconds"),
|
|
"cortex_request_duration_seconds should be present.\nMetrics:\n{after}"
|
|
);
|
|
assert!(
|
|
!after.contains("cortex_request_errors_total"),
|
|
"no errors expected for a successful request"
|
|
);
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn test_token_metrics_emitted_for_streamed_request() {
|
|
// #21: a streamed chat completion with a final usage chunk must
|
|
// produce TTFT + tok/s histograms and prompt/completion token
|
|
// counters, labelled with model and node. The recorder is global
|
|
// per-process, so this test runs in its own binary invocation —
|
|
// cargo's per-file integration binaries give us that as long as
|
|
// only one test in this file installs the recorder... it isn't:
|
|
// test_metrics_emitted_after_proxy also installs. Whichever wins
|
|
// the race, both render from the same recorder, so assert on
|
|
// delta-able names rather than exact totals.
|
|
let handle = recorder();
|
|
|
|
let mock_url = common::spawn_streaming_mock_neuron_with_usage(
|
|
5,
|
|
std::time::Duration::from_millis(40),
|
|
225,
|
|
42,
|
|
)
|
|
.await;
|
|
let gw_url = common::spawn_gateway(&mock_url).await;
|
|
|
|
let client = reqwest::Client::new();
|
|
let resp = client
|
|
.post(format!("{gw_url}/v1/chat/completions"))
|
|
.header("content-type", "application/json")
|
|
.json(&json!({
|
|
"model": "test-model",
|
|
"messages": [{"role": "user", "content": "Hi"}],
|
|
"stream": true
|
|
}))
|
|
.send()
|
|
.await
|
|
.expect("request should succeed");
|
|
assert_eq!(resp.status(), 200);
|
|
let body = resp.text().await.expect("stream should complete");
|
|
assert!(body.contains("[DONE]"));
|
|
|
|
let rendered = handle.render();
|
|
for needle in [
|
|
"cortex_time_to_first_token_seconds",
|
|
"cortex_tokens_per_second",
|
|
] {
|
|
assert!(
|
|
rendered.contains(needle),
|
|
"{needle} should be present.\nMetrics:\n{rendered}"
|
|
);
|
|
}
|
|
// The recorder is shared with the sibling test (same model/node
|
|
// labels), so counters are lower bounds, not exact values: this
|
|
// request contributed prompt=225 / completion=42.
|
|
let counter_value = |name: &str| -> u64 {
|
|
rendered
|
|
.lines()
|
|
.find(|l| l.starts_with(name) && l.contains(r#"model="test-model""#))
|
|
.and_then(|l| l.rsplit(' ').next())
|
|
.and_then(|v| v.parse().ok())
|
|
.unwrap_or_else(|| panic!("{name} should be present.\nMetrics:\n{rendered}"))
|
|
};
|
|
assert!(
|
|
counter_value("cortex_prompt_tokens_total") >= 225,
|
|
"prompt token counter should include this request's 225.\nMetrics:\n{rendered}"
|
|
);
|
|
assert!(
|
|
counter_value("cortex_completion_tokens_total") >= 42,
|
|
"completion token counter should include this request's 42.\nMetrics:\n{rendered}"
|
|
);
|
|
}
|