6.8 KiB
CLAUDE.md — cortex
Project overview
cortex is a Rust reverse-proxy that sits in front of multiple mistral.rs inference nodes and presents a unified OpenAI + Anthropic compatible API surface. It handles model routing, lifecycle management (load/unload/evict), request translation, and metrics collection.
Repository layout
cortex/
├── Cargo.toml # workspace root
├── cortex.toml # example gateway config
├── README.md
├── CLAUDE.md # ← you are here
├── crates/
│ ├── cortex-core/ # shared types, config, envelopes
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── config.rs # figment-based config structs
│ │ ├── node.rs # NodeState, ModelStatus
│ │ ├── openai.rs # OpenAI request/response types
│ │ ├── anthropic.rs # Anthropic request/response types
│ │ ├── translate.rs # OpenAI <-> Anthropic translation
│ │ └── metrics.rs # RequestMetrics, histogram helpers
│ ├── cortex-gateway/ # the HTTP proxy server
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── state.rs # CortexState: Arc<RwLock<...>>
│ │ ├── router.rs # model -> node routing logic
│ │ ├── proxy.rs # streaming HTTP proxy to backends
│ │ ├── evictor.rs # LRU/priority eviction logic
│ │ ├── poller.rs # background task polling node status
│ │ ├── handlers.rs # axum handlers (chat, completions, models, etc.)
│ │ └── metrics.rs # prometheus exporter endpoint
│ ├── cortex-agent/ # per-node sidecar (future: defrag, restart)
│ │ └── src/
│ │ ├── lib.rs
│ │ └── agent.rs # local node management
│ └── cortex-cli/ # CLI entrypoint
│ └── src/
│ └── main.rs
└── tests/ # integration tests (future)
Key design decisions
mistral.rs HTTP API for model lifecycle
mistral.rs (v0.8+) supports dynamic model loading/unloading at runtime:
POST /v1/models/unload {"model_id": "..."}— frees VRAM, preserves configPOST /v1/models/reload {"model_id": "..."}— explicitly reloadPOST /v1/models/status {"model_id": "..."}— loaded/unloaded/reloadingGET /v1/models— lists all models with status field- Lazy loading: requests to unloaded models trigger automatic reload
The gateway does NOT manage systemd units for model swaps. It calls these HTTP endpoints directly. The only systemd interaction is for full-process restarts after VRAM fragmentation accumulates (defrag_after_cycles).
Streaming proxy
Chat completions are proxied as SSE streams. The gateway must:
- Parse the inbound request to extract the model name
- Route to the correct backend node
- Stream the response back, capturing token timing for metrics
- NOT buffer the full response — true streaming passthrough
Anthropic translation
When a request arrives at /v1/messages (Anthropic format), the gateway
translates it to OpenAI format before proxying to mistral.rs, then
translates the response back. This is stateless envelope transformation.
Eviction
The evictor runs as a background task. Before loading a model on a node where VRAM is tight:
- Check if the model is already loaded elsewhere → route there instead
- Find the LRU model on the target node (excluding pinned models)
- Call
/v1/models/unloadon that model - The incoming request's lazy-load triggers the new model load
Metrics
Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port.
Tech stack
- Rust 2024 edition — workspace with 4 crates
- Axum 0.8 — HTTP framework (same as mistral.rs itself)
- reqwest — HTTP client for proxying to backends
- figment — config loading (TOML + env vars)
- tokio — async runtime
- metrics + metrics-exporter-prometheus — observability
- tracing — structured logging
Build commands
cargo build --release # build all crates
cargo run -p cortex-cli -- serve # run the gateway
cargo test # run all tests
cargo clippy --workspace # lint
CI
Gitea Actions runs on every push to any branch. All three checks must pass before merging:
cargo fmt --check --all # formatting
cargo clippy --workspace -- -D warnings # lint (warnings are errors)
cargo test --workspace # tests
Run these locally before pushing. cargo fmt --all fixes formatting
automatically. Clippy warnings must be resolved, not suppressed with
#[allow(...)] unless there is a clear rationale.
Environment
- Targets Fedora 43 (systemd, SELinux enforcing)
- Nodes communicate over a private network (e.g. WireGuard mesh)
- One or more GPU nodes running mistral.rs on port 8080
- Optionally a metrics-only node (no GPU) for Prometheus/Grafana
- Each node runs
mistralrs serveon port 8080 - Gateway listens on port 8000 (API) and 9100 (metrics)
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
Conventions
- Error handling:
anyhowfor binaries,thiserrorfor library crates - No
unwrap()in library code;expect()only with clear rationale - All public types derive
Debug, Clone, Serialize, Deserializewhere sensible - Config structs use
figmentwith TOML as primary source, env vars as override - Prefer
Arc<RwLock<...>>for shared fleet state; minimize lock duration - SSE streaming uses
tokio_stream+eventsource-streamfor parsing - Log at
infofor request routing,debugfor proxy details,warnfor eviction and node health,errorfor proxy failures
Current status
Scaffold phase — crate structure, types, and handler stubs are in place. The following needs implementation:
- cortex-core: Flesh out OpenAI/Anthropic envelope types with all fields needed for chat completions (streaming + non-streaming)
- cortex-gateway/proxy.rs: Implement streaming HTTP proxy with SSE passthrough
- cortex-gateway/router.rs: Model-to-node routing with fallback to least-loaded
- cortex-gateway/evictor.rs: LRU eviction with pinning support
- cortex-gateway/poller.rs: Background polling of node
/v1/modelsendpoints - cortex-gateway/handlers.rs: Wire up axum routes to proxy logic
- cortex-core/translate.rs: OpenAI <-> Anthropic request/response translation
- cortex-agent: Sidecar for VRAM defrag restarts (lower priority)
- Integration tests: Mock mistralrs backends, test routing + eviction