feat: scaffold cortex workspace

Rust reverse-proxy for multi-node mistral.rs inference clusters. Includes crate structure (cortex-core, cortex-gateway, cortex-agent, cortex-cli), config loading, OpenAI/Anthropic translation stubs, model routing, eviction, polling, and streaming proxy scaffolding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:13:30 +03:00
commit 0da68833af
28 changed files with 4659 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,138 @@
+# cortex
+
+A Rust reverse-proxy and fleet management layer for multi-node
+[mistral.rs](https://github.com/EricLBuehler/mistral.rs) inference clusters.
+
+## Problem
+
+Running local LLMs across multiple GPU nodes (different VRAM tiers, different
+model affinities) requires a unified API surface that:
+
+- Presents a **single `/v1/models` catalogue** merging every model across every
+  node.
+- **Routes requests** to the correct node based on where a model is loaded (or
+  *can* be loaded).
+- Manages **model lifecycle** — unload cold models, reload on demand, pin
+  critical ones — using the mistral.rs
+  `/v1/models/{unload,reload,status}` HTTP API (PR #1828+).
+- Translates between **OpenAI and Anthropic** request/response envelopes so
+  every client in the homelab speaks whichever dialect it prefers.
+- Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes
+  them as Prometheus counters/histograms.
+
+## Architecture
+
+```
+┌──────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
+│ Claude Code  │  │ Zed/IDE  │  │ Tidal / mm │  │ curl / etc │
+└──────┬───────┘  └─────┬────┘  └──────┬─────┘  └──────┬─────┘
+       │                │              │               │
+       └────────────────┴──────┬───────┴───────────────┘
+                               │
+                    ┌──────────▼──────────┐
+                    │   cortex     │
+                    │   (cortex-gateway)      │
+                    │                     │
+                    │  Router · Metrics   │
+                    │  Evictor · Translate│
+                    └──┬──────┬────────┬──┘
+                       │      │        │
+            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
+            │ gpu-large │  │gpu-med │  │ gpu-small │
+            │ mistralrs │  │mistral │  │ mistralrs │
+            │ serve     │  │rs serve│  │ serve     │
+            │ :8080     │  │ :8080  │  │  :8080    │
+            └───────────┘  └────────┘  └───────────┘
+                  private network (.internal)
+```
+
+### Crates
+
+| Crate | Purpose |
+|---|---|
+| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic request/response envelopes |
+| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, metrics exporter |
+| `cortex-agent` | Per-node sidecar: polls local mistralrs, reports to gateway, handles restart/defrag |
+| `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) |
+
+## Node setup
+
+Each GPU node runs `mistralrs serve` with a multi-model config. Models are
+declared but start **unloaded** — mistral.rs lazy-loads on first request and
+the gateway can explicitly unload/reload via the HTTP API.
+
+Example node systemd unit:
+
+```ini
+# /etc/systemd/system/mistralrs.service
+[Unit]
+Description=mistral.rs inference server
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/bin/mistralrs serve \
+    --from-config /etc/mistralrs/config.toml \
+    --port 8080
+Restart=on-failure
+RestartSec=5
+Environment=CUDA_VISIBLE_DEVICES=0,1
+
+[Install]
+WantedBy=multi-user.target
+```
+
+## Gateway config
+
+```toml
+# cortex.toml
+[gateway]
+listen = "0.0.0.0:8000"
+metrics_listen = "0.0.0.0:9100"
+
+[eviction]
+strategy = "lru"        # lru | priority
+defrag_after_cycles = 50
+
+[[nodes]]
+name = "gpu-large"
+endpoint = "http://gpu-large.internal:8080"
+vram_mb = 49_152        # e.g. 2x RTX 4090
+pinned = ["your-org/large-model"]
+
+[[nodes]]
+name = "gpu-medium"
+endpoint = "http://gpu-medium.internal:8080"
+vram_mb = 24_576        # e.g. RTX 4090
+pinned = ["your-org/medium-model"]
+
+[[nodes]]
+name = "gpu-small"
+endpoint = "http://gpu-small.internal:8080"
+vram_mb = 12_288        # e.g. RTX 3060
+pinned = ["your-org/embedding-model"]
+```
+
+## Building
+
+```sh
+cargo build --release
+```
+
+## Running
+
+```sh
+# start the gateway
+cortex serve --config cortex.toml
+
+# check fleet status
+cortex status
+
+# list all models across nodes
+curl http://localhost:8000/v1/models
+```
+
+## License
+
+GPL-3.0