All checks were successful
CI / Format (push) Successful in 44s
CI / CUDA type-check (push) Successful in 1m31s
CI / Clippy (push) Successful in 2m13s
CI / Test (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Defense-in-depth for the agent0 NoFeasibleNeuron storm (root cause fixed in neuron). Two cortex resilience gaps this incident exposed: 1. Brittle health flip: the poller marked a node unhealthy on a SINGLE missed /models poll, instantly yanking the node and all its models from routing. A busy neuron briefly slow to answer shouldn't be declared dead. Now debounced: NodeState.consecutive_poll_failures must reach POLL_FAILURE_THRESHOLD (3) before the node flips unhealthy (~20s at the 10s poll interval); any successful poll resets it. A never-healthy node stays unhealthy (the counter only protects an already-healthy node from blips). 2. Transient surfaced as permanent: when a catalogued model's only feasible neuron is momentarily unhealthy, the router returned 404 NoFeasibleNeuron — which litellm/clients treat as non-retryable, so agent0 hard-failed. pick_feasible_neuron now distinguishes "a feasible node exists but is unhealthy right now" → new RouteError::FeasibleNodeUnhealthy (503 + Retry-After: 3, retryable) from "no node could ever satisfy the topology" → 404 NoFeasibleNeuron (permanent). Mirrors the beast case exactly: healthy 1-GPU nodes + an unhealthy 2-GPU node → retry, don't fail. Tests: poller test updated to assert debounce (1 miss keeps healthy, 3 flip); new feasibility_routing tests cover transient-503 vs permanent-404. Local fmt/clippy/test green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
156 lines
7.1 KiB
Rust
156 lines
7.1 KiB
Rust
use crate::discovery::{ActivationStatus, DiscoveryResponse, ModelLoad};
|
||
use crate::harness::{ModelCost, ModelLimit};
|
||
use chrono::{DateTime, Utc};
|
||
use serde::{Deserialize, Serialize};
|
||
use std::collections::HashMap;
|
||
|
||
/// Runtime state of a single neuron in the fleet.
|
||
#[derive(Debug, Clone)]
|
||
pub struct NodeState {
|
||
pub name: String,
|
||
/// Base URL of the neuron daemon (e.g. "http://beast.internal:13131").
|
||
pub endpoint: String,
|
||
pub healthy: bool,
|
||
pub models: HashMap<String, ModelEntry>,
|
||
/// Number of load/unload cycles since last process restart.
|
||
pub lifecycle_cycles: u32,
|
||
pub last_poll: Option<DateTime<Utc>>,
|
||
/// Result of the most recent successful `GET /discovery` against
|
||
/// this neuron. Cached forever once obtained — device topology is
|
||
/// invariant for a given neuron process. `None` until the first
|
||
/// successful poll. Used by the router and `/v1/models` to do
|
||
/// catalogue × topology feasibility checks.
|
||
pub discovery: Option<DiscoveryResponse>,
|
||
/// Last-seen pre-warm progress from this neuron's `/health`
|
||
/// endpoint. `None` until the first /health poll succeeds. The
|
||
/// `/v1/models` handler reads `in_progress` + `pending` from here
|
||
/// to synthesize `Loading` locations so clients see a catalogued
|
||
/// model that's mid-prewarm as "loading", not "missing".
|
||
pub activation: Option<ActivationStatus>,
|
||
/// Last-seen per-model admission load from this neuron's `/health`
|
||
/// (#53), keyed by model id. The router (#55) reads it to pick the
|
||
/// least-busy replica when a model is loaded on more than one neuron.
|
||
/// Empty until the first /health poll reports load.
|
||
pub model_load: HashMap<String, ModelLoad>,
|
||
/// Consecutive failed `/models` polls. The poller marks a node
|
||
/// unhealthy only once this crosses a threshold, so a single transient
|
||
/// miss (e.g. a neuron momentarily slow to answer while busy) doesn't
|
||
/// yank the node — and all its models — out of routing. Reset to 0 on
|
||
/// any successful poll.
|
||
pub consecutive_poll_failures: u32,
|
||
}
|
||
|
||
/// A model registered on a node, with its runtime status.
|
||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||
pub struct ModelEntry {
|
||
pub id: String,
|
||
pub status: ModelStatus,
|
||
/// When this model was last used (for LRU eviction).
|
||
pub last_accessed: Option<DateTime<Utc>>,
|
||
/// Estimated VRAM usage in MB when loaded.
|
||
pub vram_estimate_mb: Option<u64>,
|
||
/// Modalities the loaded model advertises (e.g. `["text", "vision"]`),
|
||
/// copied verbatim from the neuron's `ModelInfo.capabilities` at poll
|
||
/// time. Empty when the neuron reports none. `#[serde(default)]` keeps
|
||
/// older persisted/serialised entries deserialisable.
|
||
#[serde(default)]
|
||
pub capabilities: Vec<String>,
|
||
/// Runtime-detected capability flags from the neuron's `/models`
|
||
/// response (`ModelInfo`). `false` when the neuron predates these
|
||
/// fields or hasn't reported them yet.
|
||
#[serde(default)]
|
||
pub tool_call: bool,
|
||
#[serde(default)]
|
||
pub reasoning: bool,
|
||
/// Self-derived token budget the neuron computed for this loaded
|
||
/// model (#67), copied from `ModelInfo.limit` at poll time. `None`
|
||
/// when the neuron doesn't compute one (arch without a context
|
||
/// profile, or derivation disabled). This is the authoritative
|
||
/// source the gateway advertises — operator-declared catalogue
|
||
/// limits are no longer consulted.
|
||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||
pub limit: Option<ModelLimit>,
|
||
}
|
||
|
||
/// Model lifecycle status.
|
||
///
|
||
/// `Loading` is a gateway-side synthetic status: neurons never emit it
|
||
/// on `/models` (that endpoint only knows about already-loaded handles).
|
||
/// The gateway populates it from a neuron's `/health` activation
|
||
/// snapshot so the unified `/v1/models` can distinguish "model is
|
||
/// catalogued but no one has it" from "model is materialising on
|
||
/// neuron N right now". Other status values are reported verbatim by
|
||
/// neurons.
|
||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
|
||
#[serde(rename_all = "lowercase")]
|
||
pub enum ModelStatus {
|
||
Loaded,
|
||
Unloaded,
|
||
Reloading,
|
||
Loading,
|
||
/// Reported by neuron while a poisoned model auto-recovers via
|
||
/// unload→reload (#17/#20). Temporarily unservable but NOT
|
||
/// evicted: the gateway holds the route, answers with a transient
|
||
/// retry error instead of 404, and must not race a second
|
||
/// placement elsewhere.
|
||
Recovering,
|
||
}
|
||
|
||
/// Unified model entry as exposed by the gateway's `/v1/models` endpoint.
|
||
///
|
||
/// The first four fields (`id`, `object`, `created`, `owned_by`) match
|
||
/// OpenAI's `/v1/models` shape verbatim, so existing OpenAI-aware
|
||
/// tooling deserialises this without custom code. The remaining fields
|
||
/// are helexa-specific extensions — OpenAI clients ignore unknown
|
||
/// fields and other consumers can read them for placement / debugging.
|
||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||
pub struct CortexModelEntry {
|
||
pub id: String,
|
||
/// Always `"model"` per OpenAI's contract.
|
||
pub object: String,
|
||
/// Unix-second timestamp; cortex stamps this at response time.
|
||
pub created: u64,
|
||
/// OpenAI's "publisher" field — `"helexa"` for everything we serve.
|
||
pub owned_by: String,
|
||
/// True if any neuron currently has this model loaded. False for
|
||
/// catalogue entries that are feasible but not yet loaded.
|
||
pub loaded: bool,
|
||
/// Neurons whose discovered topology can satisfy this model's
|
||
/// catalogue placement constraints. Empty for models that are
|
||
/// loaded somewhere but not present in the catalogue (cortex has
|
||
/// no feasibility opinion on those).
|
||
pub feasible_on: Vec<String>,
|
||
/// Where this model is actually loaded right now. Subset of (or
|
||
/// disjoint from) `feasible_on` depending on whether the catalogue
|
||
/// covers this model.
|
||
pub locations: Vec<ModelLocation>,
|
||
/// Union of the modalities advertised by every neuron that has this
|
||
/// model loaded (e.g. `["text", "vision"]`). Empty for catalogue-only
|
||
/// entries with no loaded location — filled from catalogue profile
|
||
/// capabilities when available, then unioned with runtime-detected
|
||
/// values from loaded neurons.
|
||
#[serde(default)]
|
||
pub capabilities: Vec<String>,
|
||
// ── Enrichment (issue #62) ────────────────────────────────
|
||
/// Per-model token budget from the catalogue profile or discovered
|
||
/// at load time. `None` when neither source provides it.
|
||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||
pub limit: Option<ModelLimit>,
|
||
/// Operator-set pricing in USD per 1M tokens (0.0 = free/self-hosted).
|
||
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||
pub cost: Option<ModelCost>,
|
||
/// `true` when any neuron reports this model supports tool calls.
|
||
#[serde(default)]
|
||
pub tool_call: bool,
|
||
/// `true` when any neuron reports this model supports reasoning tokens.
|
||
#[serde(default)]
|
||
pub reasoning: bool,
|
||
}
|
||
|
||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||
pub struct ModelLocation {
|
||
pub node: String,
|
||
pub status: ModelStatus,
|
||
pub vram_estimate_mb: Option<u64>,
|
||
}
|