feat: add neuron daemon with GPU discovery and health endpoints

Replace cortex-agent stub with neuron (cortex-neuron binary). cortex-core additions: - discovery.rs: DeviceInfo, DiscoveryResponse, DeviceHealth, HealthResponse - harness.rs: Harness async trait, HarnessConfig, ModelSpec, ModelInfo neuron crate (crates/neuron/): - discovery.rs: nvidia-smi CSV parsing (pure functions) + system discovery via uname/nvidia-smi/nvcc - health.rs: cached GPU health polling every 5s - api.rs: GET /discovery and GET /health axum handlers - main.rs: CLI entrypoint with --port flag (default 9090) - harness stubs for mistralrs (Phase 8) and llamacpp (Phase 11) 12 new tests (9 unit + 3 integration), 35 total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:23:42 +03:00
parent 67b9b044d3
commit 6dc717ebcd
22 changed files with 1239 additions and 112 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -277,15 +277,458 @@ histograms appear after a proxied request.
 Token-level metrics (tok/s, TTFT) deferred — requires parsing the
 response body or final SSE chunk, which is Phase 6b work.

-### Phase 7 (lower priority): Agent sidecar
+## 2026-04-15 addendum

-**Goal:** Per-node binary that handles VRAM defrag restarts and
-reports real VRAM usage via `nvidia-smi`.
+**Phases 1–6 complete.** The gateway proxies requests (streaming and
+non-streaming), routes by model name to the correct node, polls node
+`/v1/models` for live state, evicts LRU models with pinning, translates
+Anthropic ↔ OpenAI envelopes, and emits Prometheus metrics. CI is green.

-This is deferred. The gateway handles the critical path (model
-lifecycle) entirely via the mistral.rs HTTP API. The agent adds
-operational polish: automatic process restart when `lifecycle_cycles`
-exceeds threshold, real VRAM reporting (vs. estimates), and
-potentially GPU temperature/power monitoring.
+**Phase 7 onward** introduces `neuron` — the per-node daemon that replaces
+the placeholder `cortex-agent` crate — along with hardware discovery,
+a harness abstraction (so cortex is not permanently wedded to mistral.rs),
+and a model catalogue for placement decisions.

-**Defer until:** Phases 1-6 are merged and running in production.
+
+### Architecture: cortex + neuron
+
+cortex is the **control plane**. It exposes the unified API, routes
+requests, manages model lifecycle across the fleet, and collects metrics.
+
+neuron is the **node plane**. One instance runs on every GPU host. It:
+- **Discovers** local hardware (GPU count, types, VRAM, CUDA compute
+  capability, driver version) and reports it to cortex.
+- **Manages harnesses** — inference engines like mistral.rs, llama.cpp,
+  or ComfyUI. Each harness is a trait implementation. neuron starts,
+  stops, health-checks, and proxies to whichever harness is serving a
+  given model.
+- **Manages model lifecycle** — load, unload, status — abstracting the
+  differences between harnesses (mistral.rs has HTTP lifecycle endpoints;
+  llama.cpp may need process management).
+- **Reports runtime state** — per-device VRAM usage, GPU utilisation,
+  temperature, loaded models with actual VRAM consumption.
+
+cortex never shells out to `nvidia-smi`, never touches systemd units,
+and never talks directly to a harness. It talks only to neurons.
+
+```
+                    ┌─────────────────────┐
+                    │      cortex         │
+                    │  (cortex-gateway)   │
+                    │  Router · Evictor   │
+                    │  Metrics · Translate│
+                    └──┬──────┬────────┬──┘
+                       │      │        │
+            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
+            │  neuron   │  │ neuron │  │  neuron   │
+            │  beast    │  │ benjy  │  │ quadbrat  │
+            │           │  │        │  │           │
+            │ harness:  │  │harness:│  │ harness:  │
+            │ mistralrs │  │mistral │  │ mistralrs │
+            │ (+ comfy) │  │rs      │  │           │
+            └───────────┘  └────────┘  └───────────┘
+```
+
+
+## The Harness trait
+
+Defined in `cortex-core` so both cortex and neuron share the type
+definitions. neuron provides the runtime implementations.
+
+```rust
+/// What an inference harness must do, from neuron's perspective.
+#[async_trait]
+pub trait Harness: Send + Sync {
+    /// Human-readable name (e.g. "mistralrs", "llamacpp", "comfyui").
+    fn name(&self) -> &str;
+
+    /// Start the harness process if it is not already running.
+    async fn start(&self, config: &HarnessConfig) -> Result<()>;
+
+    /// Stop the harness process gracefully.
+    async fn stop(&self) -> Result<()>;
+
+    /// Health check. Returns the harness process status.
+    async fn health(&self) -> HarnessHealth;
+
+    /// List models the harness knows about (loaded + unloaded).
+    async fn list_models(&self) -> Result<Vec<ModelInfo>>;
+
+    /// Load a model with the given spec (quant, TP, device assignment).
+    async fn load_model(&self, spec: &ModelSpec) -> Result<()>;
+
+    /// Unload a model, freeing device memory.
+    async fn unload_model(&self, model_id: &str) -> Result<()>;
+
+    /// Return the URL where inference requests for this model should
+    /// be sent. None if the model is not loaded.
+    async fn inference_endpoint(&self, model_id: &str) -> Option<String>;
+}
+```
+
+The mistral.rs implementation wraps the HTTP API:
+- `list_models` → `GET /v1/models`
+- `load_model` → `POST /v1/models/reload`
+- `unload_model` → `POST /v1/models/unload`
+- `inference_endpoint` → returns the base URL (the model name routes
+  internally within mistral.rs)
+- `start`/`stop` → manage the `mistralrs.service` systemd unit
+
+A future llama.cpp implementation would manage per-model `llama-server`
+processes (one process per loaded model, each on its own port).
+
+
+## neuron API
+
+neuron exposes an HTTP API on port 9090 that cortex polls and calls.
+
+```
+GET  /discovery
+     → {
+         hostname, os, kernel,
+         cuda_version, driver_version,
+         devices: [{ index, name, vram_total_mb, compute_capability }],
+         harnesses: ["mistralrs", ...]
+       }
+
+GET  /health
+     → {
+         uptime_secs,
+         devices: [{ index, vram_used_mb, vram_free_mb, utilization_pct, temp_c }]
+       }
+
+GET  /models
+     → [{ id, harness, status, devices: [int], vram_used_mb }]
+
+POST /models/load
+     ← { model_id, harness, quant, tensor_parallel, devices: [int] }
+     → { status: "loaded" | "loading" }
+
+POST /models/unload
+     ← { model_id }
+     → { status: "unloaded" }
+
+GET  /models/{model_id}/endpoint
+     → { url: "http://localhost:8080" }
+```
+
+cortex never constructs a harness-specific URL. It asks neuron for the
+inference endpoint and proxies there.
+
+
+## Discovery replaces static device config
+
+cortex.toml no longer contains device types, VRAM sizes, or CUDA
+architectures. That information comes from neuron's `/discovery`
+endpoint. cortex.toml shrinks to:
+
+```toml
+[gateway]
+listen = "0.0.0.0:8000"
+metrics_listen = "0.0.0.0:9100"
+
+[eviction]
+strategy = "lru"
+defrag_after_cycles = 50
+
+[[neurons]]
+name = "beast"
+endpoint = "http://beast.hanzalova.internal:9090"
+
+[[neurons]]
+name = "benjy"
+endpoint = "http://benjy.kosherinata.internal:9090"
+
+[[neurons]]
+name = "quadbrat"
+endpoint = "http://quadbrat.hanzalova.internal:9090"
+```
+
+On startup and periodically, cortex calls `GET /discovery` and
+`GET /health` on each neuron to build its topology map. The router
+uses this topology — not config — to make placement decisions.
+
+
+## Model catalogue
+
+Model serving profiles live in a separate file (`models.toml`) because
+they describe how to serve a model, not where. cortex matches these
+profiles against the discovered topology to determine valid placements.
+
+```toml
+[[models]]
+id = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
+harness = "mistralrs"
+quant = "Q4_K_M"
+vram_mb = 19000
+min_devices = 2
+min_device_vram_mb = 10000
+pinned_on = ["beast"]       # optional: never evict from these neurons
+
+[[models]]
+id = "Qwen/Qwen3-VL-8B"
+harness = "mistralrs"
+quant = "Q8_0"
+vram_mb = 10000
+min_devices = 1
+
+[[models]]
+id = "Qwen/Qwen2.5-Coder-14B-Instruct"
+harness = "mistralrs"
+quant = "Q6_K"
+vram_mb = 12000
+min_devices = 1
+pinned_on = ["benjy"]
+```
+
+The router consults the catalogue to answer: "model X needs 2 devices
+with ≥10GB each; beast has 2× RTX 5090 at 32GB each; that's a valid
+placement." This replaces the current per-node `pinned` list in config
+and the hardcoded `vram_mb` per node.
+
+
+## Revised repository layout
+
+```
+cortex/
+├── Cargo.toml
+├── cortex.toml                 # gateway config (neurons only)
+├── models.toml                 # model catalogue
+├── README.md
+├── CLAUDE.md
+├── crates/
+│   ├── cortex-core/            # shared types
+│   │   └── src/
+│   │       ├── lib.rs
+│   │       ├── config.rs       # GatewayConfig, NeuronEndpoint
+│   │       ├── catalogue.rs    # ModelProfile, placement matching
+│   │       ├── discovery.rs    # DeviceInfo, DiscoveryResponse
+│   │       ├── harness.rs      # Harness trait, HarnessConfig, HarnessHealth
+│   │       ├── node.rs         # NodeState, ModelEntry, ModelStatus
+│   │       ├── openai.rs       # OpenAI envelope types
+│   │       ├── anthropic.rs    # Anthropic envelope types
+│   │       ├── translate.rs    # OpenAI <-> Anthropic translation
+│   │       └── metrics.rs      # RequestMetrics
+│   ├── cortex-gateway/         # control plane (existing, modified)
+│   │   └── src/
+│   │       ├── lib.rs
+│   │       ├── state.rs        # CortexState (updated: discovery topology)
+│   │       ├── router.rs       # updated: catalogue + discovery placement
+│   │       ├── proxy.rs        # streaming proxy (unchanged)
+│   │       ├── evictor.rs      # updated: talks to neuron, not mistralrs
+│   │       ├── poller.rs       # updated: polls neuron, not mistralrs
+│   │       ├── handlers.rs     # axum handlers (unchanged API surface)
+│   │       └── metrics.rs      # prometheus exporter (unchanged)
+│   ├── neuron/                 # node plane (replaces cortex-agent)
+│   │   └── src/
+│   │       ├── main.rs         # binary entrypoint, axum server on :9090
+│   │       ├── discovery.rs    # nvidia-smi, device enumeration
+│   │       ├── health.rs       # runtime GPU polling
+│   │       ├── api.rs          # HTTP handlers for /discovery, /models, etc.
+│   │       ├── harness/
+│   │       │   ├── mod.rs      # Harness trait re-export, registry
+│   │       │   ├── mistralrs.rs  # mistral.rs HTTP API wrapper
+│   │       │   └── llamacpp.rs   # stub for future llama.cpp support
+│   │       └── models.rs       # local model lifecycle orchestration
+│   └── cortex-cli/             # CLI entrypoint (unchanged)
+│       └── src/
+│           └── main.rs
+└── tests/
+```
+
+The `cortex-agent` crate is deleted. Its replacement is `neuron/`.
+
+
+## Implementation plan (phases 7+)
+
+Phases 1–6 are merged and passing CI. Each subsequent phase is a
+branch → PR. CI (fmt, clippy, test) must pass before merge.
+
+### Phase 7: neuron scaffold and discovery ✅
+
+Completed. Deleted `cortex-agent`, created `crates/neuron/` (binary:
+`cortex-neuron`). Added shared types to cortex-core: `discovery.rs`
+(DeviceInfo, DiscoveryResponse, DeviceHealth, HealthResponse) and
+`harness.rs` (Harness async trait, HarnessConfig, ModelSpec, ModelInfo).
+
+neuron discovers GPUs via nvidia-smi, caches health readings, and
+serves `GET /discovery` and `GET /health`. Pure parsing functions
+separated from command execution for testability. 9 unit tests for
+nvidia-smi CSV parsing, 3 integration tests for the HTTP endpoints.
+
+### Phase 8: neuron harness — mistral.rs implementation
+
+**Goal:** neuron can manage mistral.rs: start/stop the process, list
+models, load/unload models, and report the inference endpoint.
+
+**Steps:**
+1. In `crates/neuron/src/harness/mistralrs.rs`:
+   - Implement the `Harness` trait.
+   - `start()` — invoke `systemctl start mistralrs.service` (or a
+     configured unit name). Wait for the health endpoint to respond.
+   - `stop()` — `systemctl stop mistralrs.service`.
+   - `health()` — `GET {mistralrs_endpoint}/health`.
+   - `list_models()` — `GET {mistralrs_endpoint}/v1/models`, parse the
+     response including the `status` field.
+   - `load_model()` — `POST {mistralrs_endpoint}/v1/models/reload`.
+   - `unload_model()` — `POST {mistralrs_endpoint}/v1/models/unload`.
+   - `inference_endpoint()` — return `mistralrs_endpoint` (mistral.rs
+     routes internally by model name in the request body).
+2. In `crates/neuron/src/harness/mod.rs`:
+   - A `HarnessRegistry` that maps harness name → `Box<dyn Harness>`.
+   - On neuron startup, register the mistralrs harness (configured with
+     the local mistralrs endpoint, e.g. `http://localhost:8080`).
+3. Add neuron API endpoints:
+   - `GET /models` — aggregate across all registered harnesses.
+   - `POST /models/load` — dispatch to the correct harness.
+   - `POST /models/unload` — dispatch to the correct harness.
+   - `GET /models/{model_id}/endpoint` — ask the harness.
+4. neuron config (`neuron.toml`):
+   ```toml
+   port = 9090
+   
+   [[harnesses]]
+   name = "mistralrs"
+   endpoint = "http://localhost:8080"
+   systemd_unit = "mistralrs.service"
+   ```
+5. Tests:
+   - Mock HTTP server standing in for mistral.rs. Test that the harness
+     implementation correctly translates list/load/unload calls.
+   - Integration test: start neuron with mock mistralrs backend, call
+     `GET /models`, assert it returns models from the mock.
+
+**Done when:** neuron manages a (mock) mistral.rs instance. All API
+endpoints return correct data. Tests pass.
+
+### Phase 9: cortex talks to neurons
+
+**Goal:** cortex-gateway's poller, router, and evictor talk to neuron
+instead of directly to mistral.rs. Discovery replaces static config.
+
+**Steps:**
+1. Update `cortex-core/src/config.rs`:
+   - Replace `NodeConfig { endpoint, vram_mb, pinned }` with
+     `NeuronEndpoint { name, endpoint }`.
+   - Add `ModelCatalogue` loaded from `models.toml`.
+   - Remove per-node `vram_mb` and `pinned` fields (these come from
+     discovery and the catalogue respectively).
+2. Add `cortex-core/src/catalogue.rs`:
+   - `ModelProfile { id, harness, quant, vram_mb, min_devices,
+     min_device_vram_mb, pinned_on }`.
+   - `fn find_valid_placements(profile, discovered_nodes) -> Vec<PlacementOption>`
+     that matches a model profile against discovered topologies.
+3. Update `cortex-gateway/src/state.rs`:
+   - `CortexState` holds discovered topology per neuron (devices, VRAM,
+     harnesses) alongside the existing model status map.
+4. Update `cortex-gateway/src/poller.rs`:
+   - Poll `GET {neuron}/discovery` on startup and every 60s (topology
+     changes rarely).
+   - Poll `GET {neuron}/health` every 10s (VRAM usage, utilisation).
+   - Poll `GET {neuron}/models` every 10s (model status).
+   - Merge all three into `CortexState`.
+5. Update `cortex-gateway/src/router.rs`:
+   - `resolve()` now consults the model catalogue to determine valid
+     placements, then picks the best node (loaded > unloaded-on-capable-node).
+   - For models needing TP=2, only nodes with ≥2 devices are candidates.
+6. Update `cortex-gateway/src/evictor.rs`:
+   - `evict_lru_on_node()` calls `POST {neuron}/models/unload` instead
+     of calling mistral.rs directly.
+   - Eviction respects `pinned_on` from the catalogue.
+7. Update `cortex-gateway/src/proxy.rs`:
+   - Before proxying, ask neuron for the inference endpoint:
+     `GET {neuron}/models/{model_id}/endpoint`. This decouples cortex
+     from knowing which port or harness is serving the model.
+8. Tests:
+   - Update existing integration tests to use a mock neuron (mock
+     `/discovery`, `/health`, `/models`, `/models/load`, etc.) instead
+     of a mock mistralrs.
+   - New test: model catalogue placement — profile requires TP=2,
+     assert it only routes to a node with ≥2 discovered devices.
+   - New test: eviction calls neuron's unload endpoint, not mistralrs.
+
+**Done when:** cortex has zero direct references to mistral.rs endpoints.
+All existing tests are updated and pass. New placement tests pass.
+`cortex.toml` only contains neuron endpoints. `models.toml` drives
+placement and pinning.
+
+### Phase 10: neuron packaging (RPM)
+
+**Goal:** `neuron` and `cortex` are installable via `dnf` from the
+grenade COPR repo.
+
+**Steps:**
+1. `neuron.spec` — RPM spec file for the neuron binary. Install to
+   `/usr/libexec/cortex/neuron`. Systemd unit
+   `cortex-neuron.service`. Config at `/etc/cortex/neuron.toml`.
+2. Update `cortex.spec` — ensure the cortex binary, config, and
+   `models.toml` are packaged correctly.
+3. Gitea Actions CI job: on tag push, build SRPM, submit to COPR.
+4. Document the install path:
+   ```sh
+   dnf copr enable grenade/cortex
+   # on the gateway host:
+   dnf install cortex
+   # on each GPU node:
+   dnf install cortex-neuron
+   ```
+
+**Done when:** `dnf install cortex-neuron` on a Fedora 43 host drops
+the binary, config, and systemd unit. `systemctl start cortex-neuron`
+runs discovery and serves `/discovery`.
+
+### Phase 11: llama.cpp harness stub
+
+**Goal:** Prove the harness abstraction works with a second engine.
+
+**Steps:**
+1. `crates/neuron/src/harness/llamacpp.rs` — implement the `Harness`
+   trait for llama.cpp's `llama-server`.
+   - `start()` — launch `llama-server` with the correct model path,
+     `--port`, `--n-gpu-layers`, `--tensor-split` args. Track the
+     child process.
+   - `stop()` — send SIGTERM to the child process.
+   - `list_models()` — llama-server serves one model per process, so
+     return a single-element list.
+   - `load_model()` — start a new llama-server process for this model.
+   - `unload_model()` — stop the process.
+   - `inference_endpoint()` — return `http://localhost:{assigned_port}`.
+2. Port allocation: neuron assigns ports from a range (e.g. 8100-8199)
+   to llama-server instances.
+3. Register in `HarnessRegistry` when configured:
+   ```toml
+   [[harnesses]]
+   name = "llamacpp"
+   binary = "/usr/local/bin/llama-server"
+   port_range = [8100, 8199]
+   ```
+4. Tests: mock llama-server (simple HTTP server returning canned
+   responses), test load/unload/endpoint lifecycle.
+
+**Done when:** A model with `harness = "llamacpp"` in `models.toml` can
+be loaded and served through cortex. Tests pass with mock llama-server.
+
+### Phase 12 (lower priority): mistral.rs COPR packaging
+
+**Goal:** Fedora RPMs for mistral.rs built against specific CUDA versions.
+
+**Steps:**
+1. `mistralrs-cuda.spec` — RPM spec that clones a pinned mistral.rs git
+   tag, builds with `--features cuda`, links against the system CUDA
+   toolkit. Produces `mistralrs-cuda13-server` (CUDA 13.x / sm_120) and
+   `mistralrs-cuda12-server` (CUDA 12.x / sm_89). Install binary to
+   `/usr/local/bin/mistralrs`.
+2. COPR build config: enable the NVIDIA CUDA repo as a build dependency.
+   Pin the CUDA toolkit version in `BuildRequires`.
+3. Gitea Actions or manual workflow: bump the mistral.rs tag in the spec,
+   trigger COPR rebuild.
+4. neuron's mistralrs harness config references which binary/package
+   provides the mistral.rs binary. neuron could warn at startup if the
+   installed mistral.rs CUDA version doesn't match the discovered driver.
+
+**Done when:** `dnf install mistralrs-cuda13-server` on beast provides a
+working `mistralrs` binary built for Blackwell GPUs. `dnf install
+mistralrs-cuda12-server` on benjy provides one built for Ada GPUs.
+
+This is a separate repo/spec — not part of the cortex workspace — but
+tightly coupled operationally. Track it as a sibling project.