GGUF-only HF repos (unsloth/Qwen3-*-GGUF, Qwen/Qwen3-*-GGUF) ship the .gguf file but not tokenizer.json — the tokenizer data is embedded in the GGUF metadata itself, and the standalone tokenizer.json lives in the base non-GGUF repo (unsloth/Qwen3-0.6B, Qwen/Qwen3-0.6B, etc.). Live validation against quadbrat hit: HTTP 400 fetch tokenizer.json from unsloth/Qwen3-0.6B-GGUF: HTTP status client error (404 Not Found) resolve_files now derives the tokenizer repo by stripping a `-GGUF` or `-gguf` suffix from the model_id; non-GGUF ids fall through to fetching from the same repo. The error message includes the attempted tokenizer repo id so the next failure (e.g. base repo doesn't exist) is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0