feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load

Realises [project-unified-models-endpoint]: cortex now surfaces every model the operator has provisioned in the catalogue, transparently cold-loads on the first request, and routes the request once the load is done — without per-node configuration or client awareness of which neuron hosts what. cortex-core changes: - NodeState gains `discovery: Option<DiscoveryResponse>` — populated once per neuron on first successful poll, cached forever after (topology is invariant for a neuron process). - ModelProfile gains `is_feasible_on(neuron, devices)` with the pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests. - CortexModelEntry expanded with OpenAI-compatible (`id`, `object`, `created`, `owned_by`) plus helexa-specific extension fields (`loaded`, `feasible_on`, `locations`). cortex-gateway changes: - poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per neuron and caches on NodeState. - handlers.rs::list_models rewritten as union of (catalogue × topology feasibility) + (currently loaded somewhere). Catalogue-defined models surface even when not yet loaded. - router.rs::resolve gains priority 3 (catalogue cold-load): 1. loaded somewhere → route there 2. unloaded somewhere → route + lazy load via neuron 3. in catalogue → pick feasible neuron, POST /models/load, wait, route. Cache the new entry locally so subsequent requests skip the poll wait. 4. else 404 - pick_feasible_neuron prefers pinned_on neurons, falls back to any feasible one (stable by name). - profile_to_spec translates ModelProfile → ModelSpec, picking devices by VRAM floor and setting tensor_parallel = min_devices for multi- device profiles. - "already loaded" responses from neuron are tolerated (two concurrent requests racing the same cold-load is a benign outcome). models.example.toml rewritten to reflect the canonical helexa fleet (beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a working TP example (Qwen3.6-27B pinned on beast) plus single-GPU profiles for the smaller models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:39:04 +03:00
parent f72dee094f
commit 735945ee81
7 changed files with 528 additions and 54 deletions
--- a/crates/cortex-gateway/src/handlers.rs
+++ b/crates/cortex-gateway/src/handlers.rs
@@ -185,12 +185,62 @@ async fn anthropic_messages(
    }
 }

-/// `GET /v1/models` — aggregate models from all nodes.
+/// `GET /v1/models` — union of (catalogue × topology feasibility) and
+/// (currently loaded somewhere). The result is what the fleet *could*
+/// serve, not just what's already loaded — so OpenAI-compatible tools
+/// see every model the operator has provisioned, and cortex
+/// transparently cold-loads the first time one is requested.
 async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
+    use std::collections::HashMap;
+    let now = Utc::now().timestamp() as u64;
    let nodes = fleet.nodes.read().await;
-    let mut model_map: std::collections::HashMap<String, CortexModelEntry> =
-        std::collections::HashMap::new();
+    let catalogue = &fleet.catalogue;

+    let mut entries: HashMap<String, CortexModelEntry> = HashMap::new();
+
+    // Pass 1: catalogue × topology. For every catalogue profile, find
+    // healthy neurons whose discovered devices satisfy the profile.
+    // Catalogue-defined models surface here even if nothing has loaded
+    // them yet — that's the point of the unified endpoint.
+    for profile in &catalogue.models {
+        let mut feasible_on = Vec::new();
+        for node in nodes.values() {
+            if !node.healthy {
+                continue;
+            }
+            let Some(disc) = node.discovery.as_ref() else {
+                continue;
+            };
+            if profile.is_feasible_on(&node.name, &disc.devices) {
+                feasible_on.push(node.name.clone());
+            }
+        }
+        if feasible_on.is_empty() {
+            // The catalogue lists this model but no neuron's topology
+            // matches — surface it as not-loaded with no feasible
+            // location. Hides nothing; lets operators see why a
+            // configured model isn't reachable.
+            feasible_on.clear();
+        }
+        entries.insert(
+            profile.id.clone(),
+            CortexModelEntry {
+                id: profile.id.clone(),
+                object: "model".into(),
+                created: now,
+                owned_by: "helexa".into(),
+                loaded: false,
+                feasible_on,
+                locations: Vec::new(),
+            },
+        );
+    }
+
+    // Pass 2: layer the actually-loaded state on top. For each
+    // (node, model) entry, attach a ModelLocation. If the model isn't
+    // in the catalogue, create a new CortexModelEntry from scratch —
+    // cortex doesn't refuse to surface a manually-loaded model just
+    // because the operator didn't enumerate it in models.toml.
    for node in nodes.values() {
        for (model_id, entry) in &node.models {
            let location = ModelLocation {
@@ -198,19 +248,30 @@ async fn list_models(State(fleet): State<Arc<CortexState>>) -> Json<Value> {
                status: entry.status,
                vram_estimate_mb: entry.vram_estimate_mb,
            };
-            model_map
+            let was_loaded = matches!(entry.status, cortex_core::node::ModelStatus::Loaded);
+            entries
                .entry(model_id.clone())
-                .and_modify(|e| e.locations.push(location.clone()))
+                .and_modify(|e| {
+                    e.locations.push(location.clone());
+                    if was_loaded {
+                        e.loaded = true;
+                    }
+                })
                .or_insert_with(|| CortexModelEntry {
                    id: model_id.clone(),
                    object: "model".into(),
+                    created: now,
+                    owned_by: "helexa".into(),
+                    loaded: was_loaded,
+                    // Not in catalogue — cortex has no opinion on
+                    // feasibility; leave empty.
+                    feasible_on: Vec::new(),
                    locations: vec![location],
                });
        }
    }

-    let data: Vec<Value> = model_map.values().map(|e| json!(e)).collect();
-
+    let data: Vec<Value> = entries.values().map(|e| json!(e)).collect();
    Json(json!({
        "object": "list",
        "data": data,