feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,3 +1,4 @@
|
||||
use crate::discovery::DiscoveryResponse;
|
||||
use chrono::{DateTime, Utc};
|
||||
use serde::{Deserialize, Serialize};
|
||||
use std::collections::HashMap;
|
||||
@@ -13,6 +14,12 @@ pub struct NodeState {
|
||||
/// Number of load/unload cycles since last process restart.
|
||||
pub lifecycle_cycles: u32,
|
||||
pub last_poll: Option<DateTime<Utc>>,
|
||||
/// Result of the most recent successful `GET /discovery` against
|
||||
/// this neuron. Cached forever once obtained — device topology is
|
||||
/// invariant for a given neuron process. `None` until the first
|
||||
/// successful poll. Used by the router and `/v1/models` to do
|
||||
/// catalogue × topology feasibility checks.
|
||||
pub discovery: Option<DiscoveryResponse>,
|
||||
}
|
||||
|
||||
/// A model registered on a node, with its runtime status.
|
||||
@@ -36,12 +43,32 @@ pub enum ModelStatus {
|
||||
}
|
||||
|
||||
/// Unified model entry as exposed by the gateway's `/v1/models` endpoint.
|
||||
/// Includes which node(s) host this model and their status.
|
||||
///
|
||||
/// The first four fields (`id`, `object`, `created`, `owned_by`) match
|
||||
/// OpenAI's `/v1/models` shape verbatim, so existing OpenAI-aware
|
||||
/// tooling deserialises this without custom code. The remaining fields
|
||||
/// are helexa-specific extensions — OpenAI clients ignore unknown
|
||||
/// fields and other consumers can read them for placement / debugging.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct CortexModelEntry {
|
||||
pub id: String,
|
||||
/// Always `"model"` per OpenAI's contract.
|
||||
pub object: String,
|
||||
/// Which nodes have this model (and their status).
|
||||
/// Unix-second timestamp; cortex stamps this at response time.
|
||||
pub created: u64,
|
||||
/// OpenAI's "publisher" field — `"helexa"` for everything we serve.
|
||||
pub owned_by: String,
|
||||
/// True if any neuron currently has this model loaded. False for
|
||||
/// catalogue entries that are feasible but not yet loaded.
|
||||
pub loaded: bool,
|
||||
/// Neurons whose discovered topology can satisfy this model's
|
||||
/// catalogue placement constraints. Empty for models that are
|
||||
/// loaded somewhere but not present in the catalogue (cortex has
|
||||
/// no feasibility opinion on those).
|
||||
pub feasible_on: Vec<String>,
|
||||
/// Where this model is actually loaded right now. Subset of (or
|
||||
/// disjoint from) `feasible_on` depending on whether the catalogue
|
||||
/// covers this model.
|
||||
pub locations: Vec<ModelLocation>,
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user