feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load

Realises [project-unified-models-endpoint]: cortex now surfaces every model the operator has provisioned in the catalogue, transparently cold-loads on the first request, and routes the request once the load is done — without per-node configuration or client awareness of which neuron hosts what. cortex-core changes: - NodeState gains `discovery: Option<DiscoveryResponse>` — populated once per neuron on first successful poll, cached forever after (topology is invariant for a neuron process). - ModelProfile gains `is_feasible_on(neuron, devices)` with the pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests. - CortexModelEntry expanded with OpenAI-compatible (`id`, `object`, `created`, `owned_by`) plus helexa-specific extension fields (`loaded`, `feasible_on`, `locations`). cortex-gateway changes: - poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per neuron and caches on NodeState. - handlers.rs::list_models rewritten as union of (catalogue × topology feasibility) + (currently loaded somewhere). Catalogue-defined models surface even when not yet loaded. - router.rs::resolve gains priority 3 (catalogue cold-load): 1. loaded somewhere → route there 2. unloaded somewhere → route + lazy load via neuron 3. in catalogue → pick feasible neuron, POST /models/load, wait, route. Cache the new entry locally so subsequent requests skip the poll wait. 4. else 404 - pick_feasible_neuron prefers pinned_on neurons, falls back to any feasible one (stable by name). - profile_to_spec translates ModelProfile → ModelSpec, picking devices by VRAM floor and setting tensor_parallel = min_devices for multi- device profiles. - "already loaded" responses from neuron are tolerated (two concurrent requests racing the same cold-load is a benign outcome). models.example.toml rewritten to reflect the canonical helexa fleet (beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a working TP example (Qwen3.6-27B pinned on beast) plus single-GPU profiles for the smaller models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:39:04 +03:00
parent f72dee094f
commit 735945ee81
7 changed files with 528 additions and 54 deletions
--- a/crates/cortex-gateway/src/poller.rs
+++ b/crates/cortex-gateway/src/poller.rs
@@ -3,6 +3,7 @@

 use crate::state::CortexState;
 use chrono::Utc;
+use cortex_core::discovery::DiscoveryResponse;
 use cortex_core::harness::ModelInfo;
 use cortex_core::node::{ModelEntry, ModelStatus};
 use std::sync::Arc;
@@ -25,7 +26,59 @@ pub async fn poll_once(fleet: &CortexState) {
    }
 }

+/// One-shot fetch of `GET /discovery`. Cached on the NodeState forever
+/// after the first success — topology is invariant for a given neuron
+/// process. Skipped when the cache is already populated.
+async fn maybe_poll_discovery(fleet: &CortexState, name: &str, endpoint: &str) {
+    {
+        let nodes = fleet.nodes.read().await;
+        match nodes.get(name) {
+            Some(n) if n.discovery.is_some() => return,
+            _ => {}
+        }
+    }
+    let url = format!("{endpoint}/discovery");
+    let resp = match fleet
+        .http_client
+        .get(&url)
+        .timeout(Duration::from_secs(5))
+        .send()
+        .await
+    {
+        Ok(r) if r.status().is_success() => r,
+        Ok(r) => {
+            tracing::debug!(node = name, status = %r.status(), "discovery probe non-success");
+            return;
+        }
+        Err(e) => {
+            tracing::debug!(node = name, error = %e, "discovery probe unreachable");
+            return;
+        }
+    };
+    match resp.json::<DiscoveryResponse>().await {
+        Ok(d) => {
+            let mut nodes = fleet.nodes.write().await;
+            if let Some(node) = nodes.get_mut(name) {
+                tracing::info!(
+                    node = name,
+                    hostname = %d.hostname,
+                    devices = d.devices.len(),
+                    "discovery cached"
+                );
+                node.discovery = Some(d);
+            }
+        }
+        Err(e) => {
+            tracing::warn!(node = name, error = %e, "failed to parse /discovery response");
+        }
+    }
+}
+
 async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
+    // Topology first — cheap once cached, and the router needs it to
+    // route requests against catalogue entries that aren't loaded yet.
+    maybe_poll_discovery(fleet, name, endpoint).await;
+
    let url = format!("{endpoint}/models");

    let result = fleet