Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 40s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m13s
build-prerelease / Build neuron-ampere (push) Successful in 7m31s
build-prerelease / Build neuron-ada (push) Successful in 8m16s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m21s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Build cortex binary (push) Successful in 4m5s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Phase 3 of plan-source-aware-loader-preflight. Adds an optional `source` field to `ModelProfile` and threads it through the router's cold-load path so a profile pointing at the helexa registry forwards `helexa:<id>` to neuron's `/models/load` instead of leaving neuron to substitute its `default_source` (typically `huggingface`). Without this, an operator who declares `source = "helexa"` in models.toml would still see neuron fetch from HuggingFace — the catalogue → ModelSpec translation in `profile_to_spec` was dropping the scheme on the floor. What lands: - `cortex-core::catalogue::ModelProfile.source: Option<String>`. None is the default and preserves pre-Phase-3 behaviour. - `cortex-gateway::router::qualified_model_id(profile)` — small pure helper, extracted from `profile_to_spec` so it can be unit-tested. Empty-string `source` is treated as None so operators who blank out a previously-set value don't trip a scheme-with-no-scheme failure mode in neuron. - `models.example.toml` documents the new field with a commented-out helexa-scheme example pointing back at neuron.example.toml's matching sources block. Tests: - 2 new unit tests in `cortex-core::catalogue`: source-absent round-trip and source-present round-trip through TOML. - 3 new unit tests in `cortex-gateway::router`: pass-through when None, prefix when Some, pass-through on empty-string source. - ModelProfile literal in catalogue's existing test updated to carry `source: None`. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (24 test groups ok, zero failures). Completes Phase 3. With Phases 1+2+3 landed: - neuron parses `scheme:org/name`, routes per-source hf-hub Api with disambiguated cache. - preflight returns structured errors before any device allocation. - cortex catalogue declares per-model source jurisdiction and forwards it to neuron. The registry itself (registry.helexa.ai service, MinIO, nginx, mirror fabric) is the next moving piece — landing under a separate project per the design discussion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
85 lines
3.4 KiB
TOML
85 lines
3.4 KiB
TOML
# models.example.toml — model catalogue
|
||
#
|
||
# Copy to /etc/cortex/models.toml and adjust for your environment.
|
||
# Describes how to serve each model. Cortex matches these profiles
|
||
# against discovered neuron topologies for placement decisions; the
|
||
# resulting `(catalogue × topology)` set is what `GET /v1/models`
|
||
# returns and what the router can cold-load on demand.
|
||
#
|
||
# Field reference:
|
||
# id - Repo id in the source registry (e.g. "Qwen/Qwen3.6-27B").
|
||
# Exact match.
|
||
# harness - which engine handles inference (currently "candle").
|
||
# quant - GGUF quantisation tag for the file in the HF repo
|
||
# (e.g. "Q4_K_M"). Omit/empty for the dense
|
||
# safetensors path. TP requires dense.
|
||
# vram_mb - rough estimate; advisory only, not enforced.
|
||
# min_devices - GPU count this profile needs. TP profiles use
|
||
# the same value as the tensor-parallel size.
|
||
# min_device_vram_mb - each device must meet this VRAM floor for the
|
||
# neuron to be considered "feasible".
|
||
# pinned_on - optional whitelist of neuron names. Non-empty
|
||
# narrows feasibility to just those neurons and
|
||
# protects the model from LRU eviction there.
|
||
# source - optional source scheme ("huggingface", "helexa",
|
||
# operator mirror tag). When set, cortex forwards
|
||
# the load to neuron as `scheme:id` so the daemon
|
||
# fetches from the right registry. Omit to let
|
||
# neuron substitute its own `default_source`.
|
||
|
||
# Tensor-parallel target — needs a neuron with at least 2 large GPUs.
|
||
# The example pins to a specific neuron name; adjust or remove the
|
||
# pinned_on entry for your own fleet.
|
||
[[models]]
|
||
id = "Qwen/Qwen3.6-27B"
|
||
harness = "candle"
|
||
vram_mb = 54000
|
||
min_devices = 2
|
||
min_device_vram_mb = 24000
|
||
pinned_on = ["your-multi-gpu-neuron"]
|
||
|
||
# Mid-size dense model — fits on any single GPU with ≥16 GB VRAM.
|
||
[[models]]
|
||
id = "Qwen/Qwen3-8B"
|
||
harness = "candle"
|
||
vram_mb = 18000
|
||
min_devices = 1
|
||
min_device_vram_mb = 16000
|
||
|
||
# Small GGUF quantised — runs on any small GPU.
|
||
[[models]]
|
||
id = "unsloth/Qwen3-0.6B-GGUF"
|
||
harness = "candle"
|
||
quant = "Q4_K_M"
|
||
vram_mb = 500
|
||
min_devices = 1
|
||
min_device_vram_mb = 4000
|
||
|
||
# Helexa registry model — `source` pins this entry to the helexa
|
||
# scheme so cortex forwards `helexa:Helexa/Qwen3.6-27B-Uncensored` to
|
||
# neuron's /models/load. Requires the neuron config to declare a
|
||
# matching [harness.candle.sources.helexa] entry pointing at the
|
||
# helexa registry endpoint (see neuron.example.toml).
|
||
#
|
||
# [[models]]
|
||
# id = "Helexa/Qwen3.6-27B-Uncensored"
|
||
# harness = "candle"
|
||
# source = "helexa"
|
||
# vram_mb = 54000
|
||
# min_devices = 2
|
||
# min_device_vram_mb = 24000
|
||
|
||
# -- Tier aliases ------------------------------------------------------------
|
||
# Optional. Clients can request inference against an alias (e.g.
|
||
# `model: "helexa/small"` in /v1/chat/completions) and cortex
|
||
# transparently routes to the concrete model id below — including
|
||
# rewriting the body's model field so neuron sees a name that matches
|
||
# its loaded handle. Both the alias and the target appear in
|
||
# /v1/models so clients can discover either. Operators can swap
|
||
# targets here without changing client code.
|
||
#
|
||
# [aliases]
|
||
# "helexa/small" = "Qwen/Qwen3-1.7B"
|
||
# "helexa/balanced" = "Qwen/Qwen3-8B"
|
||
# "helexa/large" = "Qwen/Qwen3.6-27B"
|