Files
cortex/models.example.toml
rob thijssen 735945ee81
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 48s
CI / Clippy (push) Successful in 2m12s
CI / Test (push) Successful in 4m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m10s
build-prerelease / Build neuron-blackwell (push) Successful in 3m35s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
feat(cortex): unified /v1/models — catalogue × topology feasibility + cold-load
Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.

cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
  once per neuron on first successful poll, cached forever after
  (topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
  pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
  `created`, `owned_by`) plus helexa-specific extension fields
  (`loaded`, `feasible_on`, `locations`).

cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
  neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
  feasibility) + (currently loaded somewhere). Catalogue-defined models
  surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
    1. loaded somewhere → route there
    2. unloaded somewhere → route + lazy load via neuron
    3. in catalogue → pick feasible neuron, POST /models/load, wait,
       route. Cache the new entry locally so subsequent requests skip
       the poll wait.
    4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
  feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
  by VRAM floor and setting tensor_parallel = min_devices for multi-
  device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
  requests racing the same cold-load is a benign outcome).

models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:39:04 +03:00

52 lines
1.9 KiB
TOML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# models.example.toml — model catalogue
#
# Copy to /etc/cortex/models.toml and adjust for your environment.
# Describes how to serve each model. Cortex matches these profiles
# against discovered neuron topologies for placement decisions; the
# resulting `(catalogue × topology)` set is what `GET /v1/models`
# returns and what the router can cold-load on demand.
#
# Field reference:
# id - HuggingFace model id, exact match.
# harness - which engine handles inference (currently "candle").
# quant - GGUF quantisation tag for the file in the HF repo
# (e.g. "Q4_K_M"). Omit/empty for the dense
# safetensors path. TP requires dense.
# vram_mb - rough estimate; advisory only, not enforced.
# min_devices - GPU count this profile needs. TP profiles use
# the same value as the tensor-parallel size.
# min_device_vram_mb - each device must meet this VRAM floor for the
# neuron to be considered "feasible".
# pinned_on - optional whitelist of neuron names. Non-empty
# narrows feasibility to just those neurons and
# protects the model from LRU eviction there.
#
# The examples below match the canonical helexa fleet
# (beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060).
# Tensor-parallel target — only beast has two big GPUs.
[[models]]
id = "Qwen/Qwen3.6-27B"
harness = "candle"
vram_mb = 54000
min_devices = 2
min_device_vram_mb = 24000
pinned_on = ["beast"]
# Mid-size dense model — fits on benjy or beast.
[[models]]
id = "Qwen/Qwen3-8B"
harness = "candle"
vram_mb = 18000
min_devices = 1
min_device_vram_mb = 16000
# Small GGUF quantised — runs on the smallest neuron (quadbrat).
[[models]]
id = "unsloth/Qwen3-0.6B-GGUF"
harness = "candle"
quant = "Q4_K_M"
vram_mb = 500
min_devices = 1
min_device_vram_mb = 4000