feat: scaffold cortex workspace

Rust reverse-proxy for multi-node mistral.rs inference clusters. Includes crate structure (cortex-core, cortex-gateway, cortex-agent, cortex-cli), config loading, OpenAI/Anthropic translation stubs, model routing, eviction, polling, and streaming proxy scaffolding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:13:30 +03:00
commit 0da68833af
28 changed files with 4659 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,141 @@
+# CLAUDE.md — cortex
+
+## Project overview
+
+cortex is a Rust reverse-proxy that sits in front of multiple
+mistral.rs inference nodes and presents a unified OpenAI + Anthropic
+compatible API surface. It handles model routing, lifecycle management
+(load/unload/evict), request translation, and metrics collection.
+
+## Repository layout
+
+```
+cortex/
+├── Cargo.toml              # workspace root
+├── cortex.toml      # example gateway config
+├── README.md
+├── CLAUDE.md               # ← you are here
+├── crates/
+│   ├── cortex-core/            # shared types, config, envelopes
+│   │   └── src/
+│   │       ├── lib.rs
+│   │       ├── config.rs       # figment-based config structs
+│   │       ├── node.rs         # NodeState, ModelStatus
+│   │       ├── openai.rs       # OpenAI request/response types
+│   │       ├── anthropic.rs    # Anthropic request/response types
+│   │       ├── translate.rs    # OpenAI <-> Anthropic translation
+│   │       └── metrics.rs      # RequestMetrics, histogram helpers
+│   ├── cortex-gateway/         # the HTTP proxy server
+│   │   └── src/
+│   │       ├── lib.rs
+│   │       ├── state.rs        # CortexState: Arc<RwLock<...>>
+│   │       ├── router.rs       # model -> node routing logic
+│   │       ├── proxy.rs        # streaming HTTP proxy to backends
+│   │       ├── evictor.rs      # LRU/priority eviction logic
+│   │       ├── poller.rs       # background task polling node status
+│   │       ├── handlers.rs     # axum handlers (chat, completions, models, etc.)
+│   │       └── metrics.rs      # prometheus exporter endpoint
+│   ├── cortex-agent/           # per-node sidecar (future: defrag, restart)
+│   │   └── src/
+│   │       ├── lib.rs
+│   │       └── agent.rs        # local node management
+│   └── cortex-cli/             # CLI entrypoint
+│       └── src/
+│           └── main.rs
+└── tests/                  # integration tests (future)
+```
+
+## Key design decisions
+
+### mistral.rs HTTP API for model lifecycle
+mistral.rs (v0.8+) supports dynamic model loading/unloading at runtime:
+- `POST /v1/models/unload {"model_id": "..."}` — frees VRAM, preserves config
+- `POST /v1/models/reload {"model_id": "..."}` — explicitly reload
+- `POST /v1/models/status {"model_id": "..."}` — loaded/unloaded/reloading
+- `GET /v1/models` — lists all models with status field
+- Lazy loading: requests to unloaded models trigger automatic reload
+
+The gateway does NOT manage systemd units for model swaps. It calls these
+HTTP endpoints directly. The only systemd interaction is for full-process
+restarts after VRAM fragmentation accumulates (defrag_after_cycles).
+
+### Streaming proxy
+Chat completions are proxied as SSE streams. The gateway must:
+1. Parse the inbound request to extract the model name
+2. Route to the correct backend node
+3. Stream the response back, capturing token timing for metrics
+4. NOT buffer the full response — true streaming passthrough
+
+### Anthropic translation
+When a request arrives at `/v1/messages` (Anthropic format), the gateway
+translates it to OpenAI format before proxying to mistral.rs, then
+translates the response back. This is stateless envelope transformation.
+
+### Eviction
+The evictor runs as a background task. Before loading a model on a node
+where VRAM is tight:
+1. Check if the model is already loaded elsewhere → route there instead
+2. Find the LRU model on the target node (excluding pinned models)
+3. Call `/v1/models/unload` on that model
+4. The incoming request's lazy-load triggers the new model load
+
+### Metrics
+Per-request: model, node, prompt_tokens, completion_tokens, total_tokens,
+tok_per_sec, time_to_first_token_ms, total_latency_ms.
+Exposed as Prometheus histograms/counters on a separate port.
+
+## Tech stack
+
+- **Rust 2024 edition** — workspace with 4 crates
+- **Axum 0.8** — HTTP framework (same as mistral.rs itself)
+- **reqwest** — HTTP client for proxying to backends
+- **figment** — config loading (TOML + env vars)
+- **tokio** — async runtime
+- **metrics + metrics-exporter-prometheus** — observability
+- **tracing** — structured logging
+
+## Build commands
+
+```sh
+cargo build --release           # build all crates
+cargo run -p cortex-cli -- serve    # run the gateway
+cargo test                      # run all tests
+cargo clippy --workspace        # lint
+```
+
+## Environment
+
+- Targets Fedora 43 (systemd, SELinux enforcing)
+- Nodes communicate over a private network (e.g. WireGuard mesh)
+  - One or more GPU nodes running mistral.rs on port 8080
+  - Optionally a metrics-only node (no GPU) for Prometheus/Grafana
+- Each node runs `mistralrs serve` on port 8080
+- Gateway listens on port 8000 (API) and 9100 (metrics)
+- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
+
+## Conventions
+
+- Error handling: `anyhow` for binaries, `thiserror` for library crates
+- No `unwrap()` in library code; `expect()` only with clear rationale
+- All public types derive `Debug, Clone, Serialize, Deserialize` where sensible
+- Config structs use `figment` with TOML as primary source, env vars as override
+- Prefer `Arc<RwLock<...>>` for shared fleet state; minimize lock duration
+- SSE streaming uses `tokio_stream` + `eventsource-stream` for parsing
+- Log at `info` for request routing, `debug` for proxy details, `warn` for
+  eviction and node health, `error` for proxy failures
+
+## Current status
+
+**Scaffold phase** — crate structure, types, and handler stubs are in place.
+The following needs implementation:
+
+1. **cortex-core**: Flesh out OpenAI/Anthropic envelope types with all fields
+   needed for chat completions (streaming + non-streaming)
+2. **cortex-gateway/proxy.rs**: Implement streaming HTTP proxy with SSE passthrough
+3. **cortex-gateway/router.rs**: Model-to-node routing with fallback to least-loaded
+4. **cortex-gateway/evictor.rs**: LRU eviction with pinning support
+5. **cortex-gateway/poller.rs**: Background polling of node `/v1/models` endpoints
+6. **cortex-gateway/handlers.rs**: Wire up axum routes to proxy logic
+7. **cortex-core/translate.rs**: OpenAI <-> Anthropic request/response translation
+8. **cortex-agent**: Sidecar for VRAM defrag restarts (lower priority)
+9. **Integration tests**: Mock mistralrs backends, test routing + eviction