feat(gateway): surface mid-prewarm models as Loading on /v1/models

The poller now fetches /health alongside /models on each neuron and stashes the activation snapshot on NodeState. The /v1/models handler gains a Pass 3 that synthesises Loading locations from each neuron's activation.in_progress and activation.pending lists, so a catalogued model that's mid-prewarm surfaces as `status: "loading"` rather than appearing absent (loaded=false, locations=[]). Without this, a client polling /v1/models during a beast restart sees Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then reappear. Now it stays visible the whole time with a clear status. Adds ModelStatus::Loading to cortex-core. The router's per-node priority loop gets an explicit (no-op) arm: Loading models aren't routable yet, and falling through to the catalogue cold-load path is the existing race — no worse than before, but tagged as a known follow-up needing neuron-side in-flight tracking on /models/load. New test_poller_captures_activation_from_health exercises the full round-trip: mock neuron with empty /models but a pre_warming /health → poller writes node.activation. Common test helpers gain spawn_mock_neuron_with_models_and_health and default_health_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:26:12 +03:00
parent 800498f530
commit b9e7a76a7a
7 changed files with 211 additions and 2 deletions
--- a/crates/cortex-core/src/node.rs
+++ b/crates/cortex-core/src/node.rs
@@ -1,4 +1,4 @@
-use crate::discovery::DiscoveryResponse;
+use crate::discovery::{ActivationStatus, DiscoveryResponse};
 use chrono::{DateTime, Utc};
 use serde::{Deserialize, Serialize};
 use std::collections::HashMap;
@@ -20,6 +20,12 @@ pub struct NodeState {
    /// successful poll. Used by the router and `/v1/models` to do
    /// catalogue × topology feasibility checks.
    pub discovery: Option<DiscoveryResponse>,
+    /// Last-seen pre-warm progress from this neuron's `/health`
+    /// endpoint. `None` until the first /health poll succeeds. The
+    /// `/v1/models` handler reads `in_progress` + `pending` from here
+    /// to synthesize `Loading` locations so clients see a catalogued
+    /// model that's mid-prewarm as "loading", not "missing".
+    pub activation: Option<ActivationStatus>,
 }

 /// A model registered on a node, with its runtime status.
@@ -34,12 +40,21 @@ pub struct ModelEntry {
 }

 /// Model lifecycle status.
+///
+/// `Loading` is a gateway-side synthetic status: neurons never emit it
+/// on `/models` (that endpoint only knows about already-loaded handles).
+/// The gateway populates it from a neuron's `/health` activation
+/// snapshot so the unified `/v1/models` can distinguish "model is
+/// catalogued but no one has it" from "model is materialising on
+/// neuron N right now". Other status values are reported verbatim by
+/// neurons.
 #[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
 #[serde(rename_all = "lowercase")]
 pub enum ModelStatus {
    Loaded,
    Unloaded,
    Reloading,
+    Loading,
 }

 /// Unified model entry as exposed by the gateway's `/v1/models` endpoint.