All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m21s
CI / Test (push) Successful in 4m40s
build-prerelease / Build neuron-blackwell (push) Successful in 3m38s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build neuron-ada (push) Successful in 4m45s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 9m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Operators can now define tier aliases in models.toml:
[aliases]
"helexa/small" = "Qwen/Qwen3-1.7B"
"helexa/balanced" = "Qwen/Qwen3-8B"
"helexa/large" = "Qwen/Qwen3.6-27B"
A client request for `model: "helexa/small"` is resolved to the concrete
model id at routing time. The gateway also rewrites the proxied body's
`model` field to the concrete id so neuron sees a name that matches its
loaded handle (otherwise the harness rejects the request).
Motivated by the finger-in-the-wind benchmark: same "what's the capital
of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with
identical correctness. Aliases let clients pick a latency tier without
hardcoding model ids, and let operators swap targets without changing
client code.
Changes:
* cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>`
+ `resolve_alias(&str) -> &str`. Unit tests cover the basic
resolution + TOML round-trip.
* cortex-gateway:
* `RouteDecision` gains `resolved_model_id: String`. `router::resolve`
consumes aliases at entry and threads the concrete id through.
* Handlers (chat_completions, completions, anthropic_messages
streaming + non-streaming) rewrite the body's `model` field with
`rewrite_model_in_body` before proxying, using the resolved id
for metrics labels, LRU touch, and the body itself.
* `/v1/models` (Pass 4) emits each alias as its own entry mirroring
the target's `loaded` flag, feasible_on, and locations — clients
browsing the endpoint see both names and can pick either.
* `models.toml` declares the three tier aliases; `models.example.toml`
documents the section as opt-in.
* Integration tests verify: end-to-end alias→concrete request flow,
alias surfacing in /v1/models, and no-op fall-through for
non-alias model ids.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
65 lines
2.6 KiB
TOML
65 lines
2.6 KiB
TOML
# models.example.toml — model catalogue
|
||
#
|
||
# Copy to /etc/cortex/models.toml and adjust for your environment.
|
||
# Describes how to serve each model. Cortex matches these profiles
|
||
# against discovered neuron topologies for placement decisions; the
|
||
# resulting `(catalogue × topology)` set is what `GET /v1/models`
|
||
# returns and what the router can cold-load on demand.
|
||
#
|
||
# Field reference:
|
||
# id - HuggingFace model id, exact match.
|
||
# harness - which engine handles inference (currently "candle").
|
||
# quant - GGUF quantisation tag for the file in the HF repo
|
||
# (e.g. "Q4_K_M"). Omit/empty for the dense
|
||
# safetensors path. TP requires dense.
|
||
# vram_mb - rough estimate; advisory only, not enforced.
|
||
# min_devices - GPU count this profile needs. TP profiles use
|
||
# the same value as the tensor-parallel size.
|
||
# min_device_vram_mb - each device must meet this VRAM floor for the
|
||
# neuron to be considered "feasible".
|
||
# pinned_on - optional whitelist of neuron names. Non-empty
|
||
# narrows feasibility to just those neurons and
|
||
# protects the model from LRU eviction there.
|
||
|
||
# Tensor-parallel target — needs a neuron with at least 2 large GPUs.
|
||
# The example pins to a specific neuron name; adjust or remove the
|
||
# pinned_on entry for your own fleet.
|
||
[[models]]
|
||
id = "Qwen/Qwen3.6-27B"
|
||
harness = "candle"
|
||
vram_mb = 54000
|
||
min_devices = 2
|
||
min_device_vram_mb = 24000
|
||
pinned_on = ["your-multi-gpu-neuron"]
|
||
|
||
# Mid-size dense model — fits on any single GPU with ≥16 GB VRAM.
|
||
[[models]]
|
||
id = "Qwen/Qwen3-8B"
|
||
harness = "candle"
|
||
vram_mb = 18000
|
||
min_devices = 1
|
||
min_device_vram_mb = 16000
|
||
|
||
# Small GGUF quantised — runs on any small GPU.
|
||
[[models]]
|
||
id = "unsloth/Qwen3-0.6B-GGUF"
|
||
harness = "candle"
|
||
quant = "Q4_K_M"
|
||
vram_mb = 500
|
||
min_devices = 1
|
||
min_device_vram_mb = 4000
|
||
|
||
# -- Tier aliases ------------------------------------------------------------
|
||
# Optional. Clients can request inference against an alias (e.g.
|
||
# `model: "helexa/small"` in /v1/chat/completions) and cortex
|
||
# transparently routes to the concrete model id below — including
|
||
# rewriting the body's model field so neuron sees a name that matches
|
||
# its loaded handle. Both the alias and the target appear in
|
||
# /v1/models so clients can discover either. Operators can swap
|
||
# targets here without changing client code.
|
||
#
|
||
# [aliases]
|
||
# "helexa/small" = "Qwen/Qwen3-1.7B"
|
||
# "helexa/balanced" = "Qwen/Qwen3-8B"
|
||
# "helexa/large" = "Qwen/Qwen3.6-27B"
|