Files
cortex/models.example.toml
rob thijssen d0292ed377
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 40s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m13s
build-prerelease / Build neuron-ampere (push) Successful in 7m31s
build-prerelease / Build neuron-ada (push) Successful in 8m16s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m21s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Build cortex binary (push) Successful in 4m5s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
feat(cortex): catalogue source field + scheme-qualified /models/load
Phase 3 of plan-source-aware-loader-preflight. Adds an optional
`source` field to `ModelProfile` and threads it through the
router's cold-load path so a profile pointing at the helexa
registry forwards `helexa:<id>` to neuron's `/models/load`
instead of leaving neuron to substitute its `default_source`
(typically `huggingface`).

Without this, an operator who declares
`source = "helexa"` in models.toml would still see neuron fetch
from HuggingFace — the catalogue → ModelSpec translation in
`profile_to_spec` was dropping the scheme on the floor.

What lands:

- `cortex-core::catalogue::ModelProfile.source: Option<String>`.
  None is the default and preserves pre-Phase-3 behaviour.
- `cortex-gateway::router::qualified_model_id(profile)` —
  small pure helper, extracted from `profile_to_spec` so it can
  be unit-tested. Empty-string `source` is treated as None so
  operators who blank out a previously-set value don't trip a
  scheme-with-no-scheme failure mode in neuron.
- `models.example.toml` documents the new field with a
  commented-out helexa-scheme example pointing back at
  neuron.example.toml's matching sources block.

Tests:

- 2 new unit tests in `cortex-core::catalogue`: source-absent
  round-trip and source-present round-trip through TOML.
- 3 new unit tests in `cortex-gateway::router`: pass-through
  when None, prefix when Some, pass-through on empty-string
  source.
- ModelProfile literal in catalogue's existing test updated to
  carry `source: None`.

CI gate: cargo fmt --check, cargo clippy --workspace
--all-targets -- -D warnings, cargo test --workspace
(24 test groups ok, zero failures).

Completes Phase 3. With Phases 1+2+3 landed:
- neuron parses `scheme:org/name`, routes per-source hf-hub
  Api with disambiguated cache.
- preflight returns structured errors before any device
  allocation.
- cortex catalogue declares per-model source jurisdiction
  and forwards it to neuron.

The registry itself (registry.helexa.ai service, MinIO,
nginx, mirror fabric) is the next moving piece — landing
under a separate project per the design discussion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 14:53:58 +03:00

85 lines
3.4 KiB
TOML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# models.example.toml — model catalogue
#
# Copy to /etc/cortex/models.toml and adjust for your environment.
# Describes how to serve each model. Cortex matches these profiles
# against discovered neuron topologies for placement decisions; the
# resulting `(catalogue × topology)` set is what `GET /v1/models`
# returns and what the router can cold-load on demand.
#
# Field reference:
# id - Repo id in the source registry (e.g. "Qwen/Qwen3.6-27B").
# Exact match.
# harness - which engine handles inference (currently "candle").
# quant - GGUF quantisation tag for the file in the HF repo
# (e.g. "Q4_K_M"). Omit/empty for the dense
# safetensors path. TP requires dense.
# vram_mb - rough estimate; advisory only, not enforced.
# min_devices - GPU count this profile needs. TP profiles use
# the same value as the tensor-parallel size.
# min_device_vram_mb - each device must meet this VRAM floor for the
# neuron to be considered "feasible".
# pinned_on - optional whitelist of neuron names. Non-empty
# narrows feasibility to just those neurons and
# protects the model from LRU eviction there.
# source - optional source scheme ("huggingface", "helexa",
# operator mirror tag). When set, cortex forwards
# the load to neuron as `scheme:id` so the daemon
# fetches from the right registry. Omit to let
# neuron substitute its own `default_source`.
# Tensor-parallel target — needs a neuron with at least 2 large GPUs.
# The example pins to a specific neuron name; adjust or remove the
# pinned_on entry for your own fleet.
[[models]]
id = "Qwen/Qwen3.6-27B"
harness = "candle"
vram_mb = 54000
min_devices = 2
min_device_vram_mb = 24000
pinned_on = ["your-multi-gpu-neuron"]
# Mid-size dense model — fits on any single GPU with ≥16 GB VRAM.
[[models]]
id = "Qwen/Qwen3-8B"
harness = "candle"
vram_mb = 18000
min_devices = 1
min_device_vram_mb = 16000
# Small GGUF quantised — runs on any small GPU.
[[models]]
id = "unsloth/Qwen3-0.6B-GGUF"
harness = "candle"
quant = "Q4_K_M"
vram_mb = 500
min_devices = 1
min_device_vram_mb = 4000
# Helexa registry model — `source` pins this entry to the helexa
# scheme so cortex forwards `helexa:Helexa/Qwen3.6-27B-Uncensored` to
# neuron's /models/load. Requires the neuron config to declare a
# matching [harness.candle.sources.helexa] entry pointing at the
# helexa registry endpoint (see neuron.example.toml).
#
# [[models]]
# id = "Helexa/Qwen3.6-27B-Uncensored"
# harness = "candle"
# source = "helexa"
# vram_mb = 54000
# min_devices = 2
# min_device_vram_mb = 24000
# -- Tier aliases ------------------------------------------------------------
# Optional. Clients can request inference against an alias (e.g.
# `model: "helexa/small"` in /v1/chat/completions) and cortex
# transparently routes to the concrete model id below — including
# rewriting the body's model field so neuron sees a name that matches
# its loaded handle. Both the alias and the target appear in
# /v1/models so clients can discover either. Operators can swap
# targets here without changing client code.
#
# [aliases]
# "helexa/small" = "Qwen/Qwen3-1.7B"
# "helexa/balanced" = "Qwen/Qwen3-8B"
# "helexa/large" = "Qwen/Qwen3.6-27B"