Phase 2 of plan-source-aware-loader-preflight. Adds a one-RTT
placement feasibility check that runs before any device allocation,
NCCL handshake, or weight fetch. Replaces today's opaque
"fetch config.json … 404" failure mode (when an operator points
`tensor_parallel = 2` at a GGUF-only repo) with a structured
error that names the failure class and points at the fix.
What lands:
- `crates/neuron/src/harness/preflight.rs` — new module. Classifies
a repo's siblings listing into `SourceFormat` (Gguf | DenseSafetensors
| Mixed | Empty), applies the tp/quant feasibility table, returns a
`PlacementPlan` on success or a typed `PreflightError` on rejection.
`PreflightError` is `serde::Serialize` so the HTTP layer can emit
the structured shape verbatim; it's `thiserror::Error` so log lines
get a single-line Display when downcasting from anyhow. Includes
best-effort Levenshtein-nearest suggestion for malformed quant names
(the second sharp edge the HauhauCS scenario surfaced — operator
writes `q6k` against filenames containing `Q6_K_P`, and today's
matcher just says "no GGUF file matching quant").
- `CandleHarness::load_model` — calls `preflight(...)` first thing
after the "already loaded" guard, before any `ensure_device_worker`
or `resolve_*`. Failure wraps the typed error in `anyhow::Error` so
the existing trait surface is unchanged; the HTTP handler and the
startup logger downcast to recover the structured form.
- `crates/neuron/src/api.rs::load_model` handler — maps `PreflightError`
to 422 Unprocessable Entity with `{"error": {"kind": "...",
"model_id": "...", "suggestion": "..." }}`. Other failures keep
the existing 400 + free-form `format!("{e:#}")` shape.
- `crates/neuron/src/startup.rs::load_default_models` — when the
failure is a preflight rejection, log as `reason=<kind> detail=<msg>`
instead of the opaque `error=<chain>`, so journalctl on beast will
now show `reason=tp_requires_safetensors detail="repo is GGUF-only
(8 .gguf files); TP requires dense safetensors..."` instead of
`error=fetch config.json from HauhauCS/...: 404 Not Found`.
Tests:
- 18 unit tests in `harness/preflight.rs` covering classifier,
quant matching, Levenshtein, error serialization, and the full
feasibility table (gguf+tp rejected, gguf+bad-quant suggests
nearest, gguf+good-quant ok, dense+tp ok, empty rejected, mixed
prefers safetensors).
- 7 integration tests in `tests/preflight.rs` exercising the
network path through an axum mock that serves hf-hub-compatible
`/api/models/{org}/{name}/revision/main` payloads. Adds `tempfile`
as a dev-dependency for per-test cache dirs.
Out of scope (deferred to subsequent phases):
- Phase 1 (source-aware loader plumbing — `scheme:org/name` parsing,
per-scheme `SourceConfig`, cache disambiguation). Preflight runs
against the single configured HuggingFace source today; the scheme
threading lands cleanly when Phase 1 ships.
- Phase 3 (cortex catalogue source field).
- GGUF tensor-parallel loading. Preflight rejects this combination
with `TpRequiresSafetensors`; the underlying loader gap is the
separate `Helexa` curated-registry / heretic-rs conversation.
Refs #4-#9 architectural follow-up; no specific issue closed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
Inside the daemon, every CUDA device gets one dedicated OS thread
(named cuda-dev-N) that owns the device's CUDA context for the
daemon's lifetime. Model loads, forward passes, KV-cache resets,
NCCL collectives, VRAM queries, and unloads all route through that
thread via a job channel; tensors never escape it alive. This pins
context binding to a known thread, makes the CUDA Drop contract
structurally safe, and isolates driver-error poisoning to one worker
rather than the whole process. See CLAUDE.md for the design
rationale and crates/neuron/src/harness/device_worker/ for the code.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0