Realises [project-unified-models-endpoint]: cortex now surfaces every
model the operator has provisioned in the catalogue, transparently
cold-loads on the first request, and routes the request once the load
is done — without per-node configuration or client awareness of which
neuron hosts what.
cortex-core changes:
- NodeState gains `discovery: Option<DiscoveryResponse>` — populated
once per neuron on first successful poll, cached forever after
(topology is invariant for a neuron process).
- ModelProfile gains `is_feasible_on(neuron, devices)` with the
pinned_on / min_devices / min_device_vram_mb logic + 5 unit tests.
- CortexModelEntry expanded with OpenAI-compatible (`id`, `object`,
`created`, `owned_by`) plus helexa-specific extension fields
(`loaded`, `feasible_on`, `locations`).
cortex-gateway changes:
- poller.rs: `maybe_poll_discovery` fetches `GET /discovery` once per
neuron and caches on NodeState.
- handlers.rs::list_models rewritten as union of (catalogue × topology
feasibility) + (currently loaded somewhere). Catalogue-defined models
surface even when not yet loaded.
- router.rs::resolve gains priority 3 (catalogue cold-load):
1. loaded somewhere → route there
2. unloaded somewhere → route + lazy load via neuron
3. in catalogue → pick feasible neuron, POST /models/load, wait,
route. Cache the new entry locally so subsequent requests skip
the poll wait.
4. else 404
- pick_feasible_neuron prefers pinned_on neurons, falls back to any
feasible one (stable by name).
- profile_to_spec translates ModelProfile → ModelSpec, picking devices
by VRAM floor and setting tensor_parallel = min_devices for multi-
device profiles.
- "already loaded" responses from neuron are tolerated (two concurrent
requests racing the same cold-load is a benign outcome).
models.example.toml rewritten to reflect the canonical helexa fleet
(beast = 2x RTX 5090, benjy = RTX 4090, quadbrat = RTX 3060) with a
working TP example (Qwen3.6-27B pinned on beast) plus single-GPU
profiles for the smaller models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 1 of the candle-native pivot. Replaces the external-process
harness model (mistralrs over HTTP, llamacpp placeholder) with an
in-process Harness trait whose sole implementation is candle. The
trait keeps its shape so future engines slot in additively, but
start/stop default to no-ops and HarnessConfig drops endpoint and
systemd_unit since no harness needs external supervision.
Behaviour is unchanged on the wire: load_model returns a "not
implemented yet (Stage 2)" error and list_models is empty. The
gateway-side proxy, poller, and router are untouched.
CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are
marked superseded; the staged plan lives in
~/.claude/plans/create-a-more-aggressive-calm-naur.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous defaults collided with well-trodden infra services and with
the Linux ephemeral port range:
- cortex API 8000 — common dev-server default (Django, minio UI)
- cortex metrics 9100 — Prometheus node_exporter default
- neuron API 9090 — Cockpit default on Fedora, Prometheus self
Move to helexa-themed palindromic ports, all below Linux's
32768-60999 ephemeral range and not registered to any well-known
service:
- cortex API 31313
- cortex metrics 31314
- neuron API 13131
Updated places:
- cortex.example.toml, neuron.example.toml defaults
- default impls in cortex-core and neuron config
- cortex-cli --endpoint default for the status subcommand
- doc comments citing example URLs
- README.md and CLAUDE.md snippets
Consumers already on the old ports need a one-line edit in their
/etc/cortex/cortex.toml or /etc/neuron/neuron.toml to match;
firewall rules and prometheus scrape configs will also need
updating.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace NodeConfig (static vram_mb, pinned) with NeuronEndpoint.
Hardware discovery and model pinning now come from neuron API and
models.toml catalogue respectively.
- config.rs: nodes -> neurons, add models_config path
- catalogue.rs: ModelProfile with pinned_on, ModelCatalogue
- poller.rs: poll neuron GET /models (ModelInfo format)
- router.rs: resolve inference endpoint via neuron GET /models/{id}/endpoint
- evictor.rs: call neuron POST /models/unload
- node.rs: remove vram_mb, pinned fields (come from discovery/catalogue)
- All 22 gateway tests updated to mock neuron API
- Remove MistralModelsResponse, ModelLifecycleRequest (no longer needed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .gitea/workflows/ci.yml with fmt/clippy/test on all branches
and SRPM build + COPR publish on version tags
- Add cortex.spec for Fedora RPM packaging
- Add GPL-3.0-or-later LICENSE file
- Add cortex.example.toml with generic hostnames; gitignore cortex.toml
- Scrub infrastructure-specific hostnames from README.md, CLAUDE.md,
and doc comments
- Fix unused imports and clippy warnings to pass -D warnings
- Fix missing deps (bytes, reqwest, serde_json) exposed during build
- Run cargo fmt across workspace
- Update SPDX license identifier to GPL-3.0-or-later
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>