Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52 lines
1.9 KiB
TOML
52 lines
1.9 KiB
TOML
# models.example.toml — model catalogue
|
||
#
|
||
# Copy to /etc/cortex/models.toml and adjust for your environment.
|
||
# Describes how to serve each model. Cortex matches these profiles
|
||
# against discovered neuron topologies for placement decisions; the
|
||
# resulting `(catalogue × topology)` set is what `GET /v1/models`
|
||
# returns and what the router can cold-load on demand.
|
||
#
|
||
# Field reference:
|
||
# id - HuggingFace model id, exact match.
|
||
# harness - which engine handles inference (currently "candle").
|
||
# quant - GGUF quantisation tag for the file in the HF repo
|
||
# (e.g. "Q4_K_M"). Omit/empty for the dense
|
||
# safetensors path. TP requires dense.
|
||
# vram_mb - rough estimate; advisory only, not enforced.
|
||
# min_devices - GPU count this profile needs. TP profiles use
|
||
# the same value as the tensor-parallel size.
|
||
# min_device_vram_mb - each device must meet this VRAM floor for the
|
||
# neuron to be considered "feasible".
|
||
# pinned_on - optional whitelist of neuron names. Non-empty
|
||
# narrows feasibility to just those neurons and
|
||
# protects the model from LRU eviction there.
|
||
#
|
||
# The examples below match the canonical helexa fleet
|
||
# (beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060).
|
||
|
||
# Tensor-parallel target — only beast has two big GPUs.
|
||
[[models]]
|
||
id = "Qwen/Qwen3.6-27B"
|
||
harness = "candle"
|
||
vram_mb = 54000
|
||
min_devices = 2
|
||
min_device_vram_mb = 24000
|
||
pinned_on = ["beast"]
|
||
|
||
# Mid-size dense model — fits on benjy or beast.
|
||
[[models]]
|
||
id = "Qwen/Qwen3-8B"
|
||
harness = "candle"
|
||
vram_mb = 18000
|
||
min_devices = 1
|
||
min_device_vram_mb = 16000
|
||
|
||
# Small GGUF quantised — runs on the smallest neuron (quadbrat).
|
||
[[models]]
|
||
id = "unsloth/Qwen3-0.6B-GGUF"
|
||
harness = "candle"
|
||
quant = "Q4_K_M"
|
||
vram_mb = 500
|
||
min_devices = 1
|
||
min_device_vram_mb = 4000
|