Files
cortex/models.example.toml
rob thijssen 62ca125a68
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m22s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has started running
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
chore: keep models.example.toml generic; deploy.sh sync's local models.toml
Reverts the previous commit's naming of specific helexa neuron hosts
in the shipped example catalogue (`models.example.toml`) — the example
is supposed to be a generic starting point that any operator copies
and adapts, not a record of one particular fleet's layout.

- `pinned_on` in the TP example uses the placeholder
  `"your-multi-gpu-neuron"`. Other entries keep the model ids
  (since those are HuggingFace-canonical, not fleet-specific).
- New `models.toml` at repo root holds the helexa-fleet catalogue
  (beast / benjy / quadbrat). Added to `.gitignore` alongside
  `cortex.toml` — both are operator-owned, gitignored, RPM-marked
  `%config(noreplace)`, and synced by `deploy.sh`.
- `deploy.sh` now rsync's `models.toml` to `/etc/cortex/models.toml`
  on the gateway host on the same lifecycle as `cortex.toml`. Skips
  cleanly when no local file exists, so users without a catalogue
  aren't surprised by silent overwrites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:47:08 +03:00

51 lines
1.9 KiB
TOML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# models.example.toml — model catalogue
#
# Copy to /etc/cortex/models.toml and adjust for your environment.
# Describes how to serve each model. Cortex matches these profiles
# against discovered neuron topologies for placement decisions; the
# resulting `(catalogue × topology)` set is what `GET /v1/models`
# returns and what the router can cold-load on demand.
#
# Field reference:
# id - HuggingFace model id, exact match.
# harness - which engine handles inference (currently "candle").
# quant - GGUF quantisation tag for the file in the HF repo
# (e.g. "Q4_K_M"). Omit/empty for the dense
# safetensors path. TP requires dense.
# vram_mb - rough estimate; advisory only, not enforced.
# min_devices - GPU count this profile needs. TP profiles use
# the same value as the tensor-parallel size.
# min_device_vram_mb - each device must meet this VRAM floor for the
# neuron to be considered "feasible".
# pinned_on - optional whitelist of neuron names. Non-empty
# narrows feasibility to just those neurons and
# protects the model from LRU eviction there.
# Tensor-parallel target — needs a neuron with at least 2 large GPUs.
# The example pins to a specific neuron name; adjust or remove the
# pinned_on entry for your own fleet.
[[models]]
id = "Qwen/Qwen3.6-27B"
harness = "candle"
vram_mb = 54000
min_devices = 2
min_device_vram_mb = 24000
pinned_on = ["your-multi-gpu-neuron"]
# Mid-size dense model — fits on any single GPU with ≥16 GB VRAM.
[[models]]
id = "Qwen/Qwen3-8B"
harness = "candle"
vram_mb = 18000
min_devices = 1
min_device_vram_mb = 16000
# Small GGUF quantised — runs on any small GPU.
[[models]]
id = "unsloth/Qwen3-0.6B-GGUF"
harness = "candle"
quant = "Q4_K_M"
vram_mb = 500
min_devices = 1
min_device_vram_mb = 4000