feat: scaffold cortex workspace
Rust reverse-proxy for multi-node mistral.rs inference clusters. Includes crate structure (cortex-core, cortex-gateway, cortex-agent, cortex-cli), config loading, OpenAI/Anthropic translation stubs, model routing, eviction, polling, and streaming proxy scaffolding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
141
CLAUDE.md
Normal file
141
CLAUDE.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# CLAUDE.md — cortex
|
||||
|
||||
## Project overview
|
||||
|
||||
cortex is a Rust reverse-proxy that sits in front of multiple
|
||||
mistral.rs inference nodes and presents a unified OpenAI + Anthropic
|
||||
compatible API surface. It handles model routing, lifecycle management
|
||||
(load/unload/evict), request translation, and metrics collection.
|
||||
|
||||
## Repository layout
|
||||
|
||||
```
|
||||
cortex/
|
||||
├── Cargo.toml # workspace root
|
||||
├── cortex.toml # example gateway config
|
||||
├── README.md
|
||||
├── CLAUDE.md # ← you are here
|
||||
├── crates/
|
||||
│ ├── cortex-core/ # shared types, config, envelopes
|
||||
│ │ └── src/
|
||||
│ │ ├── lib.rs
|
||||
│ │ ├── config.rs # figment-based config structs
|
||||
│ │ ├── node.rs # NodeState, ModelStatus
|
||||
│ │ ├── openai.rs # OpenAI request/response types
|
||||
│ │ ├── anthropic.rs # Anthropic request/response types
|
||||
│ │ ├── translate.rs # OpenAI <-> Anthropic translation
|
||||
│ │ └── metrics.rs # RequestMetrics, histogram helpers
|
||||
│ ├── cortex-gateway/ # the HTTP proxy server
|
||||
│ │ └── src/
|
||||
│ │ ├── lib.rs
|
||||
│ │ ├── state.rs # CortexState: Arc<RwLock<...>>
|
||||
│ │ ├── router.rs # model -> node routing logic
|
||||
│ │ ├── proxy.rs # streaming HTTP proxy to backends
|
||||
│ │ ├── evictor.rs # LRU/priority eviction logic
|
||||
│ │ ├── poller.rs # background task polling node status
|
||||
│ │ ├── handlers.rs # axum handlers (chat, completions, models, etc.)
|
||||
│ │ └── metrics.rs # prometheus exporter endpoint
|
||||
│ ├── cortex-agent/ # per-node sidecar (future: defrag, restart)
|
||||
│ │ └── src/
|
||||
│ │ ├── lib.rs
|
||||
│ │ └── agent.rs # local node management
|
||||
│ └── cortex-cli/ # CLI entrypoint
|
||||
│ └── src/
|
||||
│ └── main.rs
|
||||
└── tests/ # integration tests (future)
|
||||
```
|
||||
|
||||
## Key design decisions
|
||||
|
||||
### mistral.rs HTTP API for model lifecycle
|
||||
mistral.rs (v0.8+) supports dynamic model loading/unloading at runtime:
|
||||
- `POST /v1/models/unload {"model_id": "..."}` — frees VRAM, preserves config
|
||||
- `POST /v1/models/reload {"model_id": "..."}` — explicitly reload
|
||||
- `POST /v1/models/status {"model_id": "..."}` — loaded/unloaded/reloading
|
||||
- `GET /v1/models` — lists all models with status field
|
||||
- Lazy loading: requests to unloaded models trigger automatic reload
|
||||
|
||||
The gateway does NOT manage systemd units for model swaps. It calls these
|
||||
HTTP endpoints directly. The only systemd interaction is for full-process
|
||||
restarts after VRAM fragmentation accumulates (defrag_after_cycles).
|
||||
|
||||
### Streaming proxy
|
||||
Chat completions are proxied as SSE streams. The gateway must:
|
||||
1. Parse the inbound request to extract the model name
|
||||
2. Route to the correct backend node
|
||||
3. Stream the response back, capturing token timing for metrics
|
||||
4. NOT buffer the full response — true streaming passthrough
|
||||
|
||||
### Anthropic translation
|
||||
When a request arrives at `/v1/messages` (Anthropic format), the gateway
|
||||
translates it to OpenAI format before proxying to mistral.rs, then
|
||||
translates the response back. This is stateless envelope transformation.
|
||||
|
||||
### Eviction
|
||||
The evictor runs as a background task. Before loading a model on a node
|
||||
where VRAM is tight:
|
||||
1. Check if the model is already loaded elsewhere → route there instead
|
||||
2. Find the LRU model on the target node (excluding pinned models)
|
||||
3. Call `/v1/models/unload` on that model
|
||||
4. The incoming request's lazy-load triggers the new model load
|
||||
|
||||
### Metrics
|
||||
Per-request: model, node, prompt_tokens, completion_tokens, total_tokens,
|
||||
tok_per_sec, time_to_first_token_ms, total_latency_ms.
|
||||
Exposed as Prometheus histograms/counters on a separate port.
|
||||
|
||||
## Tech stack
|
||||
|
||||
- **Rust 2024 edition** — workspace with 4 crates
|
||||
- **Axum 0.8** — HTTP framework (same as mistral.rs itself)
|
||||
- **reqwest** — HTTP client for proxying to backends
|
||||
- **figment** — config loading (TOML + env vars)
|
||||
- **tokio** — async runtime
|
||||
- **metrics + metrics-exporter-prometheus** — observability
|
||||
- **tracing** — structured logging
|
||||
|
||||
## Build commands
|
||||
|
||||
```sh
|
||||
cargo build --release # build all crates
|
||||
cargo run -p cortex-cli -- serve # run the gateway
|
||||
cargo test # run all tests
|
||||
cargo clippy --workspace # lint
|
||||
```
|
||||
|
||||
## Environment
|
||||
|
||||
- Targets Fedora 43 (systemd, SELinux enforcing)
|
||||
- Nodes communicate over a private network (e.g. WireGuard mesh)
|
||||
- One or more GPU nodes running mistral.rs on port 8080
|
||||
- Optionally a metrics-only node (no GPU) for Prometheus/Grafana
|
||||
- Each node runs `mistralrs serve` on port 8080
|
||||
- Gateway listens on port 8000 (API) and 9100 (metrics)
|
||||
- TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard
|
||||
|
||||
## Conventions
|
||||
|
||||
- Error handling: `anyhow` for binaries, `thiserror` for library crates
|
||||
- No `unwrap()` in library code; `expect()` only with clear rationale
|
||||
- All public types derive `Debug, Clone, Serialize, Deserialize` where sensible
|
||||
- Config structs use `figment` with TOML as primary source, env vars as override
|
||||
- Prefer `Arc<RwLock<...>>` for shared fleet state; minimize lock duration
|
||||
- SSE streaming uses `tokio_stream` + `eventsource-stream` for parsing
|
||||
- Log at `info` for request routing, `debug` for proxy details, `warn` for
|
||||
eviction and node health, `error` for proxy failures
|
||||
|
||||
## Current status
|
||||
|
||||
**Scaffold phase** — crate structure, types, and handler stubs are in place.
|
||||
The following needs implementation:
|
||||
|
||||
1. **cortex-core**: Flesh out OpenAI/Anthropic envelope types with all fields
|
||||
needed for chat completions (streaming + non-streaming)
|
||||
2. **cortex-gateway/proxy.rs**: Implement streaming HTTP proxy with SSE passthrough
|
||||
3. **cortex-gateway/router.rs**: Model-to-node routing with fallback to least-loaded
|
||||
4. **cortex-gateway/evictor.rs**: LRU eviction with pinning support
|
||||
5. **cortex-gateway/poller.rs**: Background polling of node `/v1/models` endpoints
|
||||
6. **cortex-gateway/handlers.rs**: Wire up axum routes to proxy logic
|
||||
7. **cortex-core/translate.rs**: OpenAI <-> Anthropic request/response translation
|
||||
8. **cortex-agent**: Sidecar for VRAM defrag restarts (lower priority)
|
||||
9. **Integration tests**: Mock mistralrs backends, test routing + eviction
|
||||
Reference in New Issue
Block a user