fix(gateway): full observability + stop leaking upstream bodies
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 42s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 4m53s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 42s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 4m53s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Comprehensive sweep across cortex-gateway's request handling. Every failure path now emits exactly one structured warn (or error) event on the cortex side with the wire-level detail an operator needs; the API response carries only a generic message plus, where useful, the upstream status code. proxy.rs::forward_request: - warn on network failure (network error, target URL). - warn on upstream non-2xx (status, target URL). Streaming body still passes through to the client; we just can't snippet without breaking the stream. - warn on response-build failure. - ProxyError::into_response no longer interpolates the inner error into the API body — generic "upstream request failed" / "failed to build response" instead. handlers.rs::chat_completions, handlers.rs::completions: - warn on missing model field, with handler= label. - warn on route resolve failure with model + error chain. The user-facing 404 keeps the RouteError Display string (which is short, informative, and contains no internal detail beyond the model id and config'd node names). handlers.rs::anthropic_messages: - warn on invalid Anthropic body, on translated-OpenAI serialise failure (which is internal), on route resolve, on upstream network error, on upstream non-2xx (with 512-char body snippet for parse errors), on upstream body read, on response parse. - All warns share consistent field shape: handler, model, node, url, status / error / body as applicable. - API response messages are now uniformly generic. - Adds an info-level "proxying request" log on the non-streaming path so successful proxies are also visible. handlers.rs::proxy_with_metrics: - still calls e.into_response() but proxy::forward_request already warn'd at the wire layer, so no double-log here. Tests: - All 32 existing unit tests + 22 gateway integration tests + 4 new router tests pass. - Tests that asserted on the "no healthy nodes" / "not found" strings still match because RouteError messages are preserved in the 404 user-facing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -12,6 +12,13 @@ use axum::response::{IntoResponse, Response};
|
||||
use reqwest::Client;
|
||||
|
||||
/// Proxy a request body to the resolved backend node and stream the response.
|
||||
///
|
||||
/// Logging contract: every call emits exactly one structured event at
|
||||
/// info / warn level for operator visibility, regardless of outcome.
|
||||
/// Network-level failures and non-2xx upstream statuses are warn'd here
|
||||
/// (closest to the wire); the user-facing response carries only the
|
||||
/// status code and a generic message — implementation detail (body,
|
||||
/// error chain) lives in the log, never in the API surface.
|
||||
pub async fn forward_request(
|
||||
client: &Client,
|
||||
route: &RouteDecision,
|
||||
@@ -37,10 +44,33 @@ pub async fn forward_request(
|
||||
req_builder = req_builder.header(key, value);
|
||||
}
|
||||
|
||||
let upstream_resp = req_builder.send().await.map_err(ProxyError::Upstream)?;
|
||||
let upstream_resp = match req_builder.send().await {
|
||||
Ok(r) => r,
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
node = %route.node_name,
|
||||
url = %url,
|
||||
error = %e,
|
||||
"proxy: upstream request failed (network)"
|
||||
);
|
||||
return Err(ProxyError::Upstream(e));
|
||||
}
|
||||
};
|
||||
|
||||
let status =
|
||||
StatusCode::from_u16(upstream_resp.status().as_u16()).unwrap_or(StatusCode::BAD_GATEWAY);
|
||||
let upstream_status = upstream_resp.status();
|
||||
if !upstream_status.is_success() {
|
||||
// Streaming body — can't snippet without breaking the stream
|
||||
// pass-through. Log status + URL; the client still gets the
|
||||
// upstream status, just without the leaked body.
|
||||
tracing::warn!(
|
||||
node = %route.node_name,
|
||||
url = %url,
|
||||
status = upstream_status.as_u16(),
|
||||
"proxy: upstream returned non-2xx"
|
||||
);
|
||||
}
|
||||
|
||||
let status = StatusCode::from_u16(upstream_status.as_u16()).unwrap_or(StatusCode::BAD_GATEWAY);
|
||||
|
||||
let resp_headers = upstream_resp.headers().clone();
|
||||
let stream = upstream_resp.bytes_stream();
|
||||
@@ -52,28 +82,37 @@ pub async fn forward_request(
|
||||
response = response.header(key, value);
|
||||
}
|
||||
|
||||
response
|
||||
.body(body)
|
||||
.map_err(|e| ProxyError::ResponseBuild(e.to_string()))
|
||||
response.body(body).map_err(|e| {
|
||||
tracing::warn!(
|
||||
node = %route.node_name,
|
||||
url = %url,
|
||||
error = %e,
|
||||
"proxy: failed to build response"
|
||||
);
|
||||
ProxyError::ResponseBuild(e.to_string())
|
||||
})
|
||||
}
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum ProxyError {
|
||||
#[error("upstream request failed: {0}")]
|
||||
#[error("upstream request failed")]
|
||||
Upstream(reqwest::Error),
|
||||
#[error("failed to build response: {0}")]
|
||||
#[error("failed to build response")]
|
||||
ResponseBuild(String),
|
||||
}
|
||||
|
||||
impl IntoResponse for ProxyError {
|
||||
fn into_response(self) -> Response {
|
||||
let status = match &self {
|
||||
ProxyError::Upstream(_) => StatusCode::BAD_GATEWAY,
|
||||
ProxyError::ResponseBuild(_) => StatusCode::INTERNAL_SERVER_ERROR,
|
||||
let (status, message) = match &self {
|
||||
ProxyError::Upstream(_) => (StatusCode::BAD_GATEWAY, "upstream request failed"),
|
||||
ProxyError::ResponseBuild(_) => (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
"failed to build response",
|
||||
),
|
||||
};
|
||||
let body = serde_json::json!({
|
||||
"error": {
|
||||
"message": self.to_string(),
|
||||
"message": message,
|
||||
"type": "proxy_error",
|
||||
}
|
||||
});
|
||||
|
||||
Reference in New Issue
Block a user