helexa/cortex

Fork 0

Files

rob thijssen 3ad8c72276

CI / Format, lint, build, test (push) Successful in 2m6s

Details

CI / Build SRPM (push) Has been skipped

Details

CI / Publish to COPR (push) Has been skipped

Details

docs: add CI expectations to CLAUDE.md and README.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 18:27:17 +03:00

6.8 KiB

Raw Blame History

CLAUDE.md — cortex

Project overview

cortex is a Rust reverse-proxy that sits in front of multiple mistral.rs inference nodes and presents a unified OpenAI + Anthropic compatible API surface. It handles model routing, lifecycle management (load/unload/evict), request translation, and metrics collection.

Repository layout

cortex/
├── Cargo.toml              # workspace root
├── cortex.toml      # example gateway config
├── README.md
├── CLAUDE.md               # ← you are here
├── crates/
│   ├── cortex-core/            # shared types, config, envelopes
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── config.rs       # figment-based config structs
│   │       ├── node.rs         # NodeState, ModelStatus
│   │       ├── openai.rs       # OpenAI request/response types
│   │       ├── anthropic.rs    # Anthropic request/response types
│   │       ├── translate.rs    # OpenAI <-> Anthropic translation
│   │       └── metrics.rs      # RequestMetrics, histogram helpers
│   ├── cortex-gateway/         # the HTTP proxy server
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── state.rs        # CortexState: Arc<RwLock<...>>
│   │       ├── router.rs       # model -> node routing logic
│   │       ├── proxy.rs        # streaming HTTP proxy to backends
│   │       ├── evictor.rs      # LRU/priority eviction logic
│   │       ├── poller.rs       # background task polling node status
│   │       ├── handlers.rs     # axum handlers (chat, completions, models, etc.)
│   │       └── metrics.rs      # prometheus exporter endpoint
│   ├── cortex-agent/           # per-node sidecar (future: defrag, restart)
│   │   └── src/
│   │       ├── lib.rs
│   │       └── agent.rs        # local node management
│   └── cortex-cli/             # CLI entrypoint
│       └── src/
│           └── main.rs
└── tests/                  # integration tests (future)

Key design decisions

mistral.rs HTTP API for model lifecycle

mistral.rs (v0.8+) supports dynamic model loading/unloading at runtime:

POST /v1/models/unload {"model_id": "..."} — frees VRAM, preserves config
POST /v1/models/reload {"model_id": "..."} — explicitly reload
POST /v1/models/status {"model_id": "..."} — loaded/unloaded/reloading
GET /v1/models — lists all models with status field
Lazy loading: requests to unloaded models trigger automatic reload

The gateway does NOT manage systemd units for model swaps. It calls these HTTP endpoints directly. The only systemd interaction is for full-process restarts after VRAM fragmentation accumulates (defrag_after_cycles).

Streaming proxy

Chat completions are proxied as SSE streams. The gateway must:

Parse the inbound request to extract the model name
Route to the correct backend node
Stream the response back, capturing token timing for metrics
NOT buffer the full response — true streaming passthrough

Anthropic translation

When a request arrives at /v1/messages (Anthropic format), the gateway translates it to OpenAI format before proxying to mistral.rs, then translates the response back. This is stateless envelope transformation.

Eviction

The evictor runs as a background task. Before loading a model on a node where VRAM is tight:

Check if the model is already loaded elsewhere → route there instead
Find the LRU model on the target node (excluding pinned models)
Call /v1/models/unload on that model
The incoming request's lazy-load triggers the new model load

Metrics

Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port.

Tech stack

Rust 2024 edition — workspace with 4 crates
Axum 0.8 — HTTP framework (same as mistral.rs itself)
reqwest — HTTP client for proxying to backends
figment — config loading (TOML + env vars)
tokio — async runtime
metrics + metrics-exporter-prometheus — observability
tracing — structured logging

Build commands

cargo build --release           # build all crates
cargo run -p cortex-cli -- serve    # run the gateway
cargo test                      # run all tests
cargo clippy --workspace        # lint

CI

Gitea Actions runs on every push to any branch. All three checks must pass before merging:

cargo fmt --check --all                    # formatting
cargo clippy --workspace -- -D warnings   # lint (warnings are errors)
cargo test --workspace                     # tests

Run these locally before pushing. cargo fmt --all fixes formatting automatically. Clippy warnings must be resolved, not suppressed with #[allow(...)] unless there is a clear rationale.

Environment

Targets Fedora 43 (systemd, SELinux enforcing)
Nodes communicate over a private network (e.g. WireGuard mesh)
- One or more GPU nodes running mistral.rs on port 8080
- Optionally a metrics-only node (no GPU) for Prometheus/Grafana
Each node runs mistralrs serve on port 8080
Gateway listens on port 8000 (API) and 9100 (metrics)
TLS terminated at gateway or via nginx; internal traffic is plaintext over WireGuard

Conventions

Error handling: anyhow for binaries, thiserror for library crates
No unwrap() in library code; expect() only with clear rationale
All public types derive Debug, Clone, Serialize, Deserialize where sensible
Config structs use figment with TOML as primary source, env vars as override
Prefer Arc<RwLock<...>> for shared fleet state; minimize lock duration
SSE streaming uses tokio_stream + eventsource-stream for parsing
Log at info for request routing, debug for proxy details, warn for eviction and node health, error for proxy failures

Current status

Scaffold phase — crate structure, types, and handler stubs are in place. The following needs implementation:

cortex-core: Flesh out OpenAI/Anthropic envelope types with all fields needed for chat completions (streaming + non-streaming)
cortex-gateway/proxy.rs: Implement streaming HTTP proxy with SSE passthrough
cortex-gateway/router.rs: Model-to-node routing with fallback to least-loaded
cortex-gateway/evictor.rs: LRU eviction with pinning support
cortex-gateway/poller.rs: Background polling of node /v1/models endpoints
cortex-gateway/handlers.rs: Wire up axum routes to proxy logic
cortex-core/translate.rs: OpenAI <-> Anthropic request/response translation
cortex-agent: Sidecar for VRAM defrag restarts (lower priority)
Integration tests: Mock mistralrs backends, test routing + eviction

6.8 KiB Raw Blame History

CLAUDE.md — cortex

Project overview

Repository layout

Key design decisions

mistral.rs HTTP API for model lifecycle

Streaming proxy

Anthropic translation

Eviction

Metrics

Tech stack

Build commands

CI

Environment

Conventions

Current status

6.8 KiB

Raw Blame History