Stage 1 of the candle-native pivot. Replaces the external-process harness model (mistralrs over HTTP, llamacpp placeholder) with an in-process Harness trait whose sole implementation is candle. The trait keeps its shape so future engines slot in additively, but start/stop default to no-ops and HarnessConfig drops endpoint and systemd_unit since no harness needs external supervision. Behaviour is unchanged on the wire: load_model returns a "not implemented yet (Stage 2)" error and list_models is empty. The gateway-side proxy, poller, and router are untouched. CLAUDE.md Phase 11 (llama.cpp) and Phase 12 (mistral.rs COPR) are marked superseded; the staged plan lives in ~/.claude/plans/create-a-more-aggressive-calm-naur.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
131 lines
4.5 KiB
Markdown
131 lines
4.5 KiB
Markdown
# cortex
|
|
|
|
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
|
|
clusters. Cortex sits in front of one or more `neuron` daemons (each running
|
|
candle-based inference on a local GPU host) and presents a unified OpenAI +
|
|
Anthropic compatible API surface.
|
|
|
|
## Problem
|
|
|
|
Running local LLMs across multiple GPU nodes (different VRAM tiers, different
|
|
model affinities) requires a unified API surface that:
|
|
|
|
- Presents a **single `/v1/models` catalogue** merging every model that can be
|
|
served by any neuron in the fleet.
|
|
- **Routes requests** to the correct node based on where a model is loaded
|
|
(or can be loaded), handling cold-load and eviction transparently.
|
|
- Manages **model lifecycle** — load on demand, unload cold models, pin
|
|
critical ones — by calling each neuron's `/models/{load,unload}` API.
|
|
- Translates between **OpenAI and Anthropic** request/response envelopes so
|
|
every client speaks whichever dialect it prefers.
|
|
- Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes
|
|
them as Prometheus counters/histograms.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
|
|
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
|
|
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
|
|
│ │ │ │
|
|
└────────────────┴──────┬───────┴───────────────┘
|
|
│
|
|
┌──────────▼──────────┐
|
|
│ cortex │
|
|
│ (cortex-gateway) │
|
|
│ │
|
|
│ Router · Metrics │
|
|
│ Evictor · Translate│
|
|
└──┬──────┬────────┬──┘
|
|
│ │ │
|
|
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
|
|
│ neuron │ │ neuron │ │ neuron │
|
|
│ :13131 │ │ :13131 │ │ :13131 │
|
|
│ candle │ │ candle │ │ candle │
|
|
└───────────┘ └────────┘ └───────────┘
|
|
private network (.internal)
|
|
```
|
|
|
|
### Crates
|
|
|
|
| Crate | Purpose |
|
|
|---|---|
|
|
| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
|
|
| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
|
|
| `neuron` | Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
|
|
| `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) |
|
|
|
|
## Node setup
|
|
|
|
Each GPU node runs `neuron` (listening on `:13131`). Neuron uses
|
|
huggingface/candle for in-process inference — there is no external
|
|
inference subprocess to manage.
|
|
|
|
The neuron RPM (`helexa-neuron`) ships a systemd unit:
|
|
|
|
```sh
|
|
dnf copr enable helexa/helexa
|
|
dnf install helexa-neuron
|
|
systemctl enable --now neuron
|
|
```
|
|
|
|
## Gateway config
|
|
|
|
```toml
|
|
# /etc/cortex/cortex.toml
|
|
[gateway]
|
|
listen = "0.0.0.0:31313"
|
|
metrics_listen = "0.0.0.0:31314"
|
|
|
|
[eviction]
|
|
strategy = "lru" # lru | priority
|
|
defrag_after_cycles = 50
|
|
|
|
[[neurons]]
|
|
name = "beast"
|
|
endpoint = "http://beast.internal:13131"
|
|
|
|
[[neurons]]
|
|
name = "benjy"
|
|
endpoint = "http://benjy.internal:13131"
|
|
```
|
|
|
|
Model placement profiles live in `models.toml` — see `models.example.toml`.
|
|
|
|
## Building
|
|
|
|
```sh
|
|
cargo build --release
|
|
```
|
|
|
|
## CI
|
|
|
|
Every push triggers format, lint, and test checks. Ensure these pass
|
|
locally before pushing:
|
|
|
|
```sh
|
|
cargo fmt --check --all # must be clean
|
|
cargo clippy --workspace -- -D warnings # warnings are errors
|
|
cargo test --workspace # all tests must pass
|
|
```
|
|
|
|
Tagged releases (`v*`) additionally build SRPMs for both `cortex` and
|
|
`helexa-neuron` and publish to COPR.
|
|
|
|
## Running
|
|
|
|
```sh
|
|
# start the gateway
|
|
cortex serve --config /etc/cortex/cortex.toml
|
|
|
|
# check fleet status
|
|
cortex status
|
|
|
|
# list all models across nodes
|
|
curl http://localhost:31313/v1/models
|
|
```
|
|
|
|
## License
|
|
|
|
GPL-3.0
|