Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
155 lines
5.0 KiB
Rust
155 lines
5.0 KiB
Rust
//! Background poller that periodically queries each neuron's API
|
|
//! to refresh the fleet state.
|
|
|
|
use crate::state::CortexState;
|
|
use chrono::Utc;
|
|
use cortex_core::discovery::DiscoveryResponse;
|
|
use cortex_core::harness::ModelInfo;
|
|
use cortex_core::node::{ModelEntry, ModelStatus};
|
|
use std::sync::Arc;
|
|
use std::time::Duration;
|
|
|
|
const POLL_INTERVAL: Duration = Duration::from_secs(10);
|
|
|
|
/// Runs forever, polling all neurons on a fixed interval.
|
|
pub async fn poll_loop(fleet: Arc<CortexState>) {
|
|
loop {
|
|
poll_once(&fleet).await;
|
|
tokio::time::sleep(POLL_INTERVAL).await;
|
|
}
|
|
}
|
|
|
|
/// Poll all neurons once. Used by `poll_loop` and available for testing.
|
|
pub async fn poll_once(fleet: &CortexState) {
|
|
for nc in &fleet.neuron_configs {
|
|
poll_neuron(fleet, &nc.name, &nc.endpoint).await;
|
|
}
|
|
}
|
|
|
|
/// One-shot fetch of `GET /discovery`. Cached on the NodeState forever
|
|
/// after the first success — topology is invariant for a given neuron
|
|
/// process. Skipped when the cache is already populated.
|
|
async fn maybe_poll_discovery(fleet: &CortexState, name: &str, endpoint: &str) {
|
|
{
|
|
let nodes = fleet.nodes.read().await;
|
|
match nodes.get(name) {
|
|
Some(n) if n.discovery.is_some() => return,
|
|
_ => {}
|
|
}
|
|
}
|
|
let url = format!("{endpoint}/discovery");
|
|
let resp = match fleet
|
|
.http_client
|
|
.get(&url)
|
|
.timeout(Duration::from_secs(5))
|
|
.send()
|
|
.await
|
|
{
|
|
Ok(r) if r.status().is_success() => r,
|
|
Ok(r) => {
|
|
tracing::debug!(node = name, status = %r.status(), "discovery probe non-success");
|
|
return;
|
|
}
|
|
Err(e) => {
|
|
tracing::debug!(node = name, error = %e, "discovery probe unreachable");
|
|
return;
|
|
}
|
|
};
|
|
match resp.json::<DiscoveryResponse>().await {
|
|
Ok(d) => {
|
|
let mut nodes = fleet.nodes.write().await;
|
|
if let Some(node) = nodes.get_mut(name) {
|
|
tracing::info!(
|
|
node = name,
|
|
hostname = %d.hostname,
|
|
devices = d.devices.len(),
|
|
"discovery cached"
|
|
);
|
|
node.discovery = Some(d);
|
|
}
|
|
}
|
|
Err(e) => {
|
|
tracing::warn!(node = name, error = %e, "failed to parse /discovery response");
|
|
}
|
|
}
|
|
}
|
|
|
|
async fn poll_neuron(fleet: &CortexState, name: &str, endpoint: &str) {
|
|
// Topology first — cheap once cached, and the router needs it to
|
|
// route requests against catalogue entries that aren't loaded yet.
|
|
maybe_poll_discovery(fleet, name, endpoint).await;
|
|
|
|
let url = format!("{endpoint}/models");
|
|
|
|
let result = fleet
|
|
.http_client
|
|
.get(&url)
|
|
.timeout(Duration::from_secs(5))
|
|
.send()
|
|
.await;
|
|
|
|
let mut nodes = fleet.nodes.write().await;
|
|
let Some(node) = nodes.get_mut(name) else {
|
|
return;
|
|
};
|
|
|
|
match result {
|
|
Ok(resp) if resp.status().is_success() => {
|
|
match resp.json::<Vec<ModelInfo>>().await {
|
|
Ok(models) => {
|
|
let mut seen = std::collections::HashSet::new();
|
|
for upstream in &models {
|
|
seen.insert(upstream.id.clone());
|
|
let status = parse_status(&upstream.status);
|
|
|
|
node.models
|
|
.entry(upstream.id.clone())
|
|
.and_modify(|e| {
|
|
e.status = status;
|
|
e.vram_estimate_mb = upstream.vram_used_mb;
|
|
})
|
|
.or_insert_with(|| ModelEntry {
|
|
id: upstream.id.clone(),
|
|
status,
|
|
last_accessed: None,
|
|
vram_estimate_mb: upstream.vram_used_mb,
|
|
});
|
|
}
|
|
|
|
// Remove models no longer reported by the neuron.
|
|
node.models.retain(|id, _| seen.contains(id));
|
|
|
|
node.healthy = true;
|
|
node.last_poll = Some(Utc::now());
|
|
tracing::debug!(node = name, models = models.len(), "poll ok");
|
|
}
|
|
Err(e) => {
|
|
tracing::warn!(node = name, error = %e, "failed to parse /models response");
|
|
node.healthy = false;
|
|
}
|
|
}
|
|
}
|
|
Ok(resp) => {
|
|
tracing::warn!(
|
|
node = name,
|
|
status = %resp.status(),
|
|
"neuron returned non-success status"
|
|
);
|
|
node.healthy = false;
|
|
}
|
|
Err(e) => {
|
|
tracing::warn!(node = name, error = %e, "failed to reach neuron");
|
|
node.healthy = false;
|
|
}
|
|
}
|
|
}
|
|
|
|
fn parse_status(s: &str) -> ModelStatus {
|
|
match s {
|
|
"loaded" => ModelStatus::Loaded,
|
|
"unloaded" => ModelStatus::Unloaded,
|
|
"reloading" => ModelStatus::Reloading,
|
|
_ => ModelStatus::Loaded,
|
|
}
|
|
}
|