diff --git a/CLAUDE.md b/CLAUDE.md index ec5398f..53e7857 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -616,58 +616,45 @@ dnf install cortex # gateway host dnf install helexa-neuron # GPU nodes ``` -### Phase 11: llama.cpp harness stub +## 2026-05-18 addendum: candle-native pivot -**Goal:** Prove the harness abstraction works with a second engine. +Phases 11 (llama.cpp harness) and 12 (mistral.rs COPR) below are +**superseded**. The project no longer treats mistral.rs or llama.cpp as +dependencies — both are conceptually out of scope. neuron becomes a +candle-native inference daemon, with `Harness` retained as an +internal seam for adding future engines (vision/audio/diffusion) but +its only implementation being in-process candle. -**Steps:** -1. `crates/neuron/src/harness/llamacpp.rs` — implement the `Harness` - trait for llama.cpp's `llama-server`. - - `start()` — launch `llama-server` with the correct model path, - `--port`, `--n-gpu-layers`, `--tensor-split` args. Track the - child process. - - `stop()` — send SIGTERM to the child process. - - `list_models()` — llama-server serves one model per process, so - return a single-element list. - - `load_model()` — start a new llama-server process for this model. - - `unload_model()` — stop the process. - - `inference_endpoint()` — return `http://localhost:{assigned_port}`. -2. Port allocation: neuron assigns ports from a range (e.g. 8100-8199) - to llama-server instances. -3. Register in `HarnessRegistry` when configured: - ```toml - [[harnesses]] - name = "llamacpp" - binary = "/usr/local/bin/llama-server" - port_range = [8100, 8199] - ``` -4. Tests: mock llama-server (simple HTTP server returning canned - responses), test load/unload/endpoint lifecycle. +The full staged plan for this pivot lives at +`~/.claude/plans/create-a-more-aggressive-calm-naur.md`. Summary: -**Done when:** A model with `harness = "llamacpp"` in `models.toml` can -be loaded and served through cortex. Tests pass with mock llama-server. +- **Stage 1 (this commit):** delete `mistralrs.rs` and `llamacpp.rs`, + scaffold inert `CandleHarness`, drop `endpoint`/`systemd_unit` from + `HarnessConfig`, default no-op `start`/`stop` on the `Harness` trait. +- **Stages 2–4:** wire up candle model load/unload (quantized Qwen3 + first), add OpenAI-compatible inference endpoint in neuron, then SSE + streaming. +- **Stages 5–6:** load-on-activation (default models in config) and + unload-on-deactivation (graceful shutdown). +- **Stages 7–8:** multi-GPU tensor parallelism and broader model/quant + coverage. -### Phase 12 (lower priority): mistral.rs COPR packaging +Sections of this document that describe mistral.rs HTTP behaviour +("mistral.rs API gotchas") are retained as historical context for +Phases 1–10 — they document what was true while the project depended +on mistral.rs. They do not describe current behaviour. -**Goal:** Fedora RPMs for mistral.rs built against specific CUDA versions. +--- -**Steps:** -1. `mistralrs-cuda.spec` — RPM spec that clones a pinned mistral.rs git - tag, builds with `--features cuda`, links against the system CUDA - toolkit. Produces `mistralrs-cuda13-server` (CUDA 13.x / sm_120) and - `mistralrs-cuda12-server` (CUDA 12.x / sm_89). Install binary to - `/usr/local/bin/mistralrs`. -2. COPR build config: enable the NVIDIA CUDA repo as a build dependency. - Pin the CUDA toolkit version in `BuildRequires`. -3. Gitea Actions or manual workflow: bump the mistral.rs tag in the spec, - trigger COPR rebuild. -4. neuron's mistralrs harness config references which binary/package - provides the mistral.rs binary. neuron could warn at startup if the - installed mistral.rs CUDA version doesn't match the discovered driver. +### Phase 11 (superseded): llama.cpp harness stub -**Done when:** `dnf install mistralrs-cuda13-server` on beast provides a -working `mistralrs` binary built for Blackwell GPUs. `dnf install -mistralrs-cuda12-server` on benjy provides one built for Ada GPUs. +~~Originally planned as a second engine to prove the harness +abstraction.~~ Replaced by the candle harness work in the 2026-05-18 +addendum above. llama.cpp's any-model/any-hardware breadth is no +longer in scope for helexa. -This is a separate repo/spec — not part of the cortex workspace — but -tightly coupled operationally. Track it as a sibling project. +### Phase 12 (superseded): mistral.rs COPR packaging + +~~Originally planned to ship CUDA-versioned mistral.rs RPMs.~~ Replaced +by the candle harness work in the 2026-05-18 addendum above. With +mistral.rs out of the dependency tree, there is nothing to package. diff --git a/Cargo.toml b/Cargo.toml index 0f8b6ef..da2ffaf 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -27,7 +27,7 @@ serde = { version = "1", features = ["derive"] } serde_json = "1" toml = "0.8" -# http client (for proxying to mistralrs backends) +# http client (for proxying to neuron backends) reqwest = { version = "0.12", features = ["json", "stream"] } # observability diff --git a/README.md b/README.md index ef44918..9425cda 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,23 @@ # cortex -A Rust reverse-proxy and fleet management layer for multi-node -[mistral.rs](https://github.com/EricLBuehler/mistral.rs) inference clusters. +A Rust reverse-proxy and fleet management layer for multi-node GPU inference +clusters. Cortex sits in front of one or more `neuron` daemons (each running +candle-based inference on a local GPU host) and presents a unified OpenAI + +Anthropic compatible API surface. ## Problem Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that: -- Presents a **single `/v1/models` catalogue** merging every model across every - node. -- **Routes requests** to the correct node based on where a model is loaded (or - *can* be loaded). -- Manages **model lifecycle** — unload cold models, reload on demand, pin - critical ones — using the mistral.rs - `/v1/models/{unload,reload,status}` HTTP API (PR #1828+). +- Presents a **single `/v1/models` catalogue** merging every model that can be + served by any neuron in the fleet. +- **Routes requests** to the correct node based on where a model is loaded + (or can be loaded), handling cold-load and eviction transparently. +- Manages **model lifecycle** — load on demand, unload cold models, pin + critical ones — by calling each neuron's `/models/{load,unload}` API. - Translates between **OpenAI and Anthropic** request/response envelopes so - every client in the homelab speaks whichever dialect it prefers. + every client speaks whichever dialect it prefers. - Captures **per-request metrics** (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms. @@ -30,18 +31,17 @@ model affinities) requires a unified API surface that: └────────────────┴──────┬───────┴───────────────┘ │ ┌──────────▼──────────┐ - │ cortex │ - │ (cortex-gateway) │ + │ cortex │ + │ (cortex-gateway) │ │ │ │ Router · Metrics │ │ Evictor · Translate│ └──┬──────┬────────┬──┘ │ │ │ ┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐ - │ gpu-large │ │gpu-med │ │ gpu-small │ - │ mistralrs │ │mistral │ │ mistralrs │ - │ serve │ │rs serve│ │ serve │ - │ :8080 │ │ :8080 │ │ :8080 │ + │ neuron │ │ neuron │ │ neuron │ + │ :13131 │ │ :13131 │ │ :13131 │ + │ candle │ │ candle │ │ candle │ └───────────┘ └────────┘ └───────────┘ private network (.internal) ``` @@ -50,43 +50,29 @@ model affinities) requires a unified API surface that: | Crate | Purpose | |---|---| -| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic request/response envelopes | -| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, metrics exporter | -| `cortex-agent` | Per-node sidecar: polls local mistralrs, reports to gateway, handles restart/defrag | +| `cortex-core` | Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types | +| `cortex-gateway` | Axum HTTP server: proxy, router, evictor, poller, metrics exporter | +| `neuron` | Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API | | `cortex-cli` | CLI entrypoint (`cortex serve`, `cortex status`, etc.) | ## Node setup -Each GPU node runs `mistralrs serve` with a multi-model config. Models are -declared but start **unloaded** — mistral.rs lazy-loads on first request and -the gateway can explicitly unload/reload via the HTTP API. +Each GPU node runs `neuron` (listening on `:13131`). Neuron uses +huggingface/candle for in-process inference — there is no external +inference subprocess to manage. -Example node systemd unit: +The neuron RPM (`helexa-neuron`) ships a systemd unit: -```ini -# /etc/systemd/system/mistralrs.service -[Unit] -Description=mistral.rs inference server -After=network-online.target -Wants=network-online.target - -[Service] -Type=simple -ExecStart=/usr/local/bin/mistralrs serve \ - --from-config /etc/mistralrs/config.toml \ - --port 8080 -Restart=on-failure -RestartSec=5 -Environment=CUDA_VISIBLE_DEVICES=0,1 - -[Install] -WantedBy=multi-user.target +```sh +dnf copr enable helexa/helexa +dnf install helexa-neuron +systemctl enable --now neuron ``` ## Gateway config ```toml -# cortex.toml +# /etc/cortex/cortex.toml [gateway] listen = "0.0.0.0:31313" metrics_listen = "0.0.0.0:31314" @@ -95,25 +81,17 @@ metrics_listen = "0.0.0.0:31314" strategy = "lru" # lru | priority defrag_after_cycles = 50 -[[nodes]] -name = "gpu-large" -endpoint = "http://gpu-large.internal:8080" -vram_mb = 49_152 # e.g. 2x RTX 4090 -pinned = ["your-org/large-model"] +[[neurons]] +name = "beast" +endpoint = "http://beast.internal:13131" -[[nodes]] -name = "gpu-medium" -endpoint = "http://gpu-medium.internal:8080" -vram_mb = 24_576 # e.g. RTX 4090 -pinned = ["your-org/medium-model"] - -[[nodes]] -name = "gpu-small" -endpoint = "http://gpu-small.internal:8080" -vram_mb = 12_288 # e.g. RTX 3060 -pinned = ["your-org/embedding-model"] +[[neurons]] +name = "benjy" +endpoint = "http://benjy.internal:13131" ``` +Model placement profiles live in `models.toml` — see `models.example.toml`. + ## Building ```sh @@ -131,13 +109,14 @@ cargo clippy --workspace -- -D warnings # warnings are errors cargo test --workspace # all tests must pass ``` -Tagged releases (`v*`) additionally build an SRPM and publish to COPR. +Tagged releases (`v*`) additionally build SRPMs for both `cortex` and +`helexa-neuron` and publish to COPR. ## Running ```sh # start the gateway -cortex serve --config cortex.toml +cortex serve --config /etc/cortex/cortex.toml # check fleet status cortex status diff --git a/cortex.example.toml b/cortex.example.toml index 7eb7058..b770148 100644 --- a/cortex.example.toml +++ b/cortex.example.toml @@ -11,14 +11,14 @@ metrics_listen = "0.0.0.0:31314" [eviction] strategy = "lru" -# Restart mistralrs after this many load/unload cycles to defragment VRAM. +# Restart neurons after this many load/unload cycles to defragment VRAM. # Set to 0 to disable. defrag_after_cycles = 50 # -- Nodes --------------------------------------------------------------- -# Each [[nodes]] entry declares a mistral.rs instance in the fleet. -# Models are discovered by polling the node's /v1/models endpoint. -# Pinned models are never evicted. +# Each [[nodes]] entry declares a neuron daemon in the fleet. +# Models are discovered by polling the neuron's /models endpoint. +# Pinned models (see models.toml) are never evicted. [[nodes]] name = "gpu-large" diff --git a/crates/cortex-core/src/anthropic.rs b/crates/cortex-core/src/anthropic.rs index 921eda7..3ba5756 100644 --- a/crates/cortex-core/src/anthropic.rs +++ b/crates/cortex-core/src/anthropic.rs @@ -2,7 +2,7 @@ //! //! These mirror the `/v1/messages` format used by the Anthropic API. //! The gateway accepts these, translates to OpenAI format, proxies to -//! mistral.rs, then translates the response back. +//! the inference backend (neuron), then translates the response back. use serde::{Deserialize, Serialize}; use serde_json::Value; diff --git a/crates/cortex-core/src/harness.rs b/crates/cortex-core/src/harness.rs index 6bf8fb8..1fcae56 100644 --- a/crates/cortex-core/src/harness.rs +++ b/crates/cortex-core/src/harness.rs @@ -9,13 +9,13 @@ use async_trait::async_trait; use serde::{Deserialize, Serialize}; /// Configuration for a harness instance on a neuron. +/// +/// All current harnesses are in-process (candle); per-harness tuning +/// (cache paths, device policies, etc.) lives in dedicated config +/// blocks rather than on this struct. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct HarnessConfig { pub name: String, - /// Base URL of the harness (e.g. "http://localhost:8080" for mistral.rs). - pub endpoint: Option, - /// Systemd unit name, if the harness is managed via systemd. - pub systemd_unit: Option, } /// Health status of a harness process. @@ -47,16 +47,24 @@ pub struct ModelInfo { } /// What an inference harness must do, from neuron's perspective. +/// +/// All current harnesses are in-process — they share neuron's address +/// space and lifecycle. `start`/`stop` therefore default to no-ops; a +/// future process-supervising harness would override them. #[async_trait] pub trait Harness: Send + Sync { - /// Human-readable name (e.g. "mistralrs", "llamacpp", "comfyui"). + /// Human-readable name (e.g. "candle"). fn name(&self) -> &str; - /// Start the harness process if it is not already running. - async fn start(&self, config: &HarnessConfig) -> Result<()>; + /// Start the harness. Default no-op for in-process harnesses. + async fn start(&self, _config: &HarnessConfig) -> Result<()> { + Ok(()) + } - /// Stop the harness process gracefully. - async fn stop(&self) -> Result<()>; + /// Stop the harness. Default no-op for in-process harnesses. + async fn stop(&self) -> Result<()> { + Ok(()) + } /// Health check. Returns the harness process status. async fn health(&self) -> HarnessHealth; diff --git a/crates/cortex-core/src/openai.rs b/crates/cortex-core/src/openai.rs index efdf7de..837a8de 100644 --- a/crates/cortex-core/src/openai.rs +++ b/crates/cortex-core/src/openai.rs @@ -3,7 +3,7 @@ //! These are a subset sufficient for chat completions (streaming + non-streaming). //! Fields not relevant to proxying are captured as `serde_json::Value` via //! `#[serde(flatten)]` so we forward them without needing to enumerate every -//! extension field mistral.rs supports. +//! extension field a backend might support. use serde::{Deserialize, Serialize}; use serde_json::Value; @@ -22,7 +22,7 @@ pub struct ChatCompletionRequest { pub max_tokens: Option, #[serde(skip_serializing_if = "Option::is_none")] pub stream: Option, - /// All other fields (tools, response_format, mistral.rs extensions, etc.) + /// All other fields (tools, response_format, backend extensions, etc.) #[serde(flatten)] pub extra: Value, } diff --git a/crates/cortex-gateway/tests/common/mod.rs b/crates/cortex-gateway/tests/common/mod.rs index bb8dee2..c080e53 100644 --- a/crates/cortex-gateway/tests/common/mod.rs +++ b/crates/cortex-gateway/tests/common/mod.rs @@ -22,6 +22,7 @@ use tokio::net::TcpListener; /// - GET /models/:id/endpoint (returns the inference URL) /// - POST /models/unload (accepts unload requests) /// - GET /v1/chat/completions + POST /v1/chat/completions (inference) +/// /// Returns the neuron base URL. pub async fn spawn_mock_neuron() -> String { let listener = TcpListener::bind("127.0.0.1:0").await.unwrap(); @@ -54,7 +55,7 @@ pub async fn spawn_mock_neuron() -> String { async fn mock_neuron_list_models() -> Json { Json(json!([ - {"id": "test-model", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000} + {"id": "test-model", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000} ])) } diff --git a/crates/cortex-gateway/tests/poller.rs b/crates/cortex-gateway/tests/poller.rs index 0167e73..c3f9d23 100644 --- a/crates/cortex-gateway/tests/poller.rs +++ b/crates/cortex-gateway/tests/poller.rs @@ -12,8 +12,8 @@ use std::sync::Arc; async fn test_poller_discovers_models() { // Mock neuron reports 2 models via /models endpoint (neuron format). let mock_url = common::spawn_mock_neuron_with_models(json!([ - {"id": "model-a", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": 8000}, - {"id": "model-b", "harness": "mistralrs", "status": "unloaded", "devices": [], "vram_used_mb": null} + {"id": "model-a", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": 8000}, + {"id": "model-b", "harness": "candle", "status": "unloaded", "devices": [], "vram_used_mb": null} ])) .await; @@ -63,8 +63,8 @@ async fn test_poller_discovers_models() { #[tokio::test] async fn test_poller_updates_gateway_models_endpoint() { let mock_url = common::spawn_mock_neuron_with_models(json!([ - {"id": "model-x", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}, - {"id": "model-y", "harness": "mistralrs", "status": "loaded", "devices": [1], "vram_used_mb": null} + {"id": "model-x", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}, + {"id": "model-y", "harness": "candle", "status": "loaded", "devices": [1], "vram_used_mb": null} ])) .await; @@ -152,8 +152,8 @@ async fn test_poller_marks_unreachable_node_unhealthy() { #[tokio::test] async fn test_poller_removes_stale_models() { let mock_url = common::spawn_mock_neuron_with_models(json!([ - {"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null}, - {"id": "drop-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null} + {"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null}, + {"id": "drop-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null} ])) .await; @@ -183,7 +183,7 @@ async fn test_poller_removes_stale_models() { // New mock with only one model. let new_mock_url = common::spawn_mock_neuron_with_models(json!([ - {"id": "keep-me", "harness": "mistralrs", "status": "loaded", "devices": [0], "vram_used_mb": null} + {"id": "keep-me", "harness": "candle", "status": "loaded", "devices": [0], "vram_used_mb": null} ])) .await; diff --git a/crates/cortex-gateway/tests/streaming.rs b/crates/cortex-gateway/tests/streaming.rs index 8b31d30..bb2f5d8 100644 --- a/crates/cortex-gateway/tests/streaming.rs +++ b/crates/cortex-gateway/tests/streaming.rs @@ -51,18 +51,18 @@ async fn test_streaming_sse_passthrough() { } assert!( - chunks.len() >= chunk_count + 1, - "expected at least {} chunks (got {}): {:?}", - chunk_count + 1, + chunks.len() > chunk_count, + "expected more than {} chunks (got {}): {:?}", + chunk_count, chunks.len(), chunks, ); assert_eq!(chunks.last().unwrap(), "[DONE]"); - for i in 0..chunk_count { + for (i, chunk) in chunks.iter().enumerate().take(chunk_count) { let chunk_json: serde_json::Value = - serde_json::from_str(&chunks[i]).expect("chunk should be valid JSON"); + serde_json::from_str(chunk).expect("chunk should be valid JSON"); assert_eq!( chunk_json["choices"][0]["delta"]["content"], format!("token{i}") diff --git a/crates/neuron/src/harness/candle.rs b/crates/neuron/src/harness/candle.rs new file mode 100644 index 0000000..c872fdf --- /dev/null +++ b/crates/neuron/src/harness/candle.rs @@ -0,0 +1,54 @@ +//! Candle harness — in-process inference using huggingface/candle. +//! +//! This is the sole `Harness` implementation. Unlike the previous +//! mistralrs/llamacpp harnesses, candle inference runs inside the neuron +//! process itself — no external subprocess, no systemd indirection. +//! +//! Stage 1 ships this as an inert skeleton; Stage 2 wires up actual +//! model load/unload via `candle-transformers`. + +use anyhow::Result; +use async_trait::async_trait; +use cortex_core::harness::{Harness, HarnessHealth, ModelInfo, ModelSpec}; + +pub struct CandleHarness { + /// URL where this neuron serves inference (its own bind address). + bind_url: String, +} + +impl CandleHarness { + pub fn new(bind_url: String) -> Self { + Self { bind_url } + } +} + +#[async_trait] +impl Harness for CandleHarness { + fn name(&self) -> &str { + "candle" + } + + async fn health(&self) -> HarnessHealth { + HarnessHealth { + name: "candle".into(), + running: true, + uptime_secs: None, + } + } + + async fn list_models(&self) -> Result> { + Ok(Vec::new()) + } + + async fn load_model(&self, _spec: &ModelSpec) -> Result<()> { + anyhow::bail!("candle harness load_model not implemented yet (Stage 2)") + } + + async fn unload_model(&self, _model_id: &str) -> Result<()> { + anyhow::bail!("candle harness unload_model not implemented yet (Stage 2)") + } + + async fn inference_endpoint(&self, _model_id: &str) -> Option { + Some(self.bind_url.clone()) + } +} diff --git a/crates/neuron/src/harness/llamacpp.rs b/crates/neuron/src/harness/llamacpp.rs deleted file mode 100644 index 2424d06..0000000 --- a/crates/neuron/src/harness/llamacpp.rs +++ /dev/null @@ -1 +0,0 @@ -// llama.cpp harness implementation — Phase 11. diff --git a/crates/neuron/src/harness/mistralrs.rs b/crates/neuron/src/harness/mistralrs.rs deleted file mode 100644 index c1d5780..0000000 --- a/crates/neuron/src/harness/mistralrs.rs +++ /dev/null @@ -1,163 +0,0 @@ -//! mistral.rs harness implementation. -//! -//! Wraps the mistral.rs HTTP API for model lifecycle management -//! and optionally manages the process via systemd. - -use anyhow::Result; -use async_trait::async_trait; -use cortex_core::harness::{Harness, HarnessConfig, HarnessHealth, ModelInfo, ModelSpec}; -use reqwest::Client; -use serde::Deserialize; - -pub struct MistralRsHarness { - endpoint: String, - systemd_unit: Option, - client: Client, -} - -impl MistralRsHarness { - pub fn new(endpoint: String, systemd_unit: Option) -> Self { - Self { - endpoint, - systemd_unit, - client: Client::builder() - .timeout(std::time::Duration::from_secs(30)) - .build() - .expect("failed to build HTTP client"), - } - } -} - -/// Response from mistral.rs `GET /v1/models`. -#[derive(Debug, Deserialize)] -struct ModelsResponse { - data: Vec, -} - -#[derive(Debug, Deserialize)] -struct ModelEntry { - id: String, - #[serde(default)] - status: Option, -} - -#[async_trait] -impl Harness for MistralRsHarness { - fn name(&self) -> &str { - "mistralrs" - } - - async fn start(&self, _config: &HarnessConfig) -> Result<()> { - let Some(unit) = &self.systemd_unit else { - anyhow::bail!("no systemd unit configured for mistralrs harness"); - }; - - let output = tokio::process::Command::new("systemctl") - .args(["start", unit]) - .output() - .await?; - - if !output.status.success() { - let stderr = String::from_utf8_lossy(&output.stderr); - anyhow::bail!("systemctl start {unit} failed: {stderr}"); - } - - // Wait for the health endpoint to respond (up to 30s). - let url = format!("{}/health", self.endpoint); - for _ in 0..30 { - tokio::time::sleep(std::time::Duration::from_secs(1)).await; - if self.client.get(&url).send().await.is_ok() { - tracing::info!(unit, "mistralrs started and healthy"); - return Ok(()); - } - } - anyhow::bail!("mistralrs started but health endpoint did not respond within 30s"); - } - - async fn stop(&self) -> Result<()> { - let Some(unit) = &self.systemd_unit else { - anyhow::bail!("no systemd unit configured for mistralrs harness"); - }; - - let output = tokio::process::Command::new("systemctl") - .args(["stop", unit]) - .output() - .await?; - - if !output.status.success() { - let stderr = String::from_utf8_lossy(&output.stderr); - anyhow::bail!("systemctl stop {unit} failed: {stderr}"); - } - Ok(()) - } - - async fn health(&self) -> HarnessHealth { - let url = format!("{}/health", self.endpoint); - let running = self.client.get(&url).send().await.is_ok(); - HarnessHealth { - name: "mistralrs".into(), - running, - uptime_secs: None, - } - } - - async fn list_models(&self) -> Result> { - let url = format!("{}/v1/models", self.endpoint); - let resp = self.client.get(&url).send().await?; - - if !resp.status().is_success() { - anyhow::bail!("GET /v1/models returned {}", resp.status()); - } - - let models_resp: ModelsResponse = resp.json().await?; - Ok(models_resp - .data - .into_iter() - .map(|m| ModelInfo { - id: m.id, - harness: "mistralrs".into(), - status: m.status.unwrap_or_else(|| "loaded".into()), - devices: vec![], - vram_used_mb: None, - }) - .collect()) - } - - async fn load_model(&self, spec: &ModelSpec) -> Result<()> { - let url = format!("{}/v1/models/reload", self.endpoint); - let resp = self - .client - .post(&url) - .json(&serde_json::json!({ "model_id": spec.model_id })) - .send() - .await?; - - if !resp.status().is_success() { - let body = resp.text().await.unwrap_or_default(); - anyhow::bail!("POST /v1/models/reload failed: {body}"); - } - Ok(()) - } - - async fn unload_model(&self, model_id: &str) -> Result<()> { - let url = format!("{}/v1/models/unload", self.endpoint); - let resp = self - .client - .post(&url) - .json(&serde_json::json!({ "model_id": model_id })) - .send() - .await?; - - if !resp.status().is_success() { - let body = resp.text().await.unwrap_or_default(); - anyhow::bail!("POST /v1/models/unload failed: {body}"); - } - Ok(()) - } - - async fn inference_endpoint(&self, _model_id: &str) -> Option { - // mistral.rs routes internally by model name in the request body, - // so the inference endpoint is always the base URL. - Some(self.endpoint.clone()) - } -} diff --git a/crates/neuron/src/harness/mod.rs b/crates/neuron/src/harness/mod.rs index 285ea3a..b8746fb 100644 --- a/crates/neuron/src/harness/mod.rs +++ b/crates/neuron/src/harness/mod.rs @@ -1,7 +1,6 @@ //! Harness registry — maps harness names to trait implementations. -pub mod llamacpp; -pub mod mistralrs; +pub mod candle; use anyhow::Result; use cortex_core::harness::{Harness, HarnessConfig, ModelInfo, ModelSpec}; @@ -81,19 +80,16 @@ impl HarnessRegistry { } /// Build a registry from harness configs. - pub fn from_configs(configs: &[HarnessConfig]) -> Self { + /// + /// `bind_url` is the URL where this neuron serves inference (its own + /// listen address). In-process harnesses (currently the only kind) + /// return this URL from `inference_endpoint`. + pub fn from_configs(configs: &[HarnessConfig], bind_url: &str) -> Self { let mut registry = Self::new(); for config in configs { match config.name.as_str() { - "mistralrs" => { - if let Some(endpoint) = &config.endpoint { - registry.register(Box::new(mistralrs::MistralRsHarness::new( - endpoint.clone(), - config.systemd_unit.clone(), - ))); - } else { - tracing::warn!("mistralrs harness missing endpoint, skipping"); - } + "candle" => { + registry.register(Box::new(candle::CandleHarness::new(bind_url.to_string()))); } other => { tracing::warn!(harness = other, "unknown harness type, skipping"); diff --git a/crates/neuron/src/main.rs b/crates/neuron/src/main.rs index 8d70c7f..ef1d13e 100644 --- a/crates/neuron/src/main.rs +++ b/crates/neuron/src/main.rs @@ -37,6 +37,7 @@ async fn main() -> Result<()> { }); let port = args.port.unwrap_or(cfg.port); + let bind_url = format!("http://localhost:{port}"); let start_time = Instant::now(); tracing::info!("running hardware discovery"); @@ -47,8 +48,10 @@ async fn main() -> Result<()> { "discovery complete" ); - // Build harness registry from config. - let registry = HarnessRegistry::from_configs(&cfg.harnesses); + // Build harness registry from config. In-process harnesses (candle) + // need to know neuron's own bind URL so they can return it from + // inference_endpoint. + let registry = HarnessRegistry::from_configs(&cfg.harnesses, &bind_url); discovery_result.harnesses = registry.names(); let health_cache = Arc::new(health::HealthCache::new()); diff --git a/crates/neuron/tests/api.rs b/crates/neuron/tests/api.rs index f5f5434..f6448ae 100644 --- a/crates/neuron/tests/api.rs +++ b/crates/neuron/tests/api.rs @@ -135,50 +135,18 @@ async fn test_models_empty_registry() { assert!(body.as_array().unwrap().is_empty()); } -/// Spawn a mock mistral.rs backend and a neuron with the mistralrs harness -/// pointing at it, then test the full model lifecycle through neuron's API. +/// Verify the candle harness registers and the load endpoint returns a +/// "not implemented" error in Stage 1 (Stage 2 wires up actual loading). #[tokio::test] -async fn test_models_via_mistralrs_harness() { - use axum::routing::{get, post}; - use axum::{Json, Router}; +async fn test_candle_harness_registers_but_load_unimplemented() { use cortex_core::harness::HarnessConfig; - use serde_json::Value; - // Mock mistral.rs backend. - let mock_app = Router::new() - .route( - "/v1/models", - get(|| async { - Json(json!({ - "data": [ - {"id": "test-model", "status": "loaded"}, - {"id": "other-model", "status": "unloaded"} - ] - })) - }), - ) - .route( - "/v1/models/unload", - post(|Json(_body): Json| async { Json(json!({"status": "ok"})) }), - ) - .route( - "/v1/models/reload", - post(|Json(_body): Json| async { Json(json!({"status": "ok"})) }), - ); - - let mock_listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap(); - let mock_addr = mock_listener.local_addr().unwrap(); - tokio::spawn(async move { - axum::serve(mock_listener, mock_app).await.unwrap(); - }); - let mock_url = format!("http://{mock_addr}"); - - // Build neuron with mistralrs harness pointing at mock. - let registry = HarnessRegistry::from_configs(&[HarnessConfig { - name: "mistralrs".into(), - endpoint: Some(mock_url.clone()), - systemd_unit: None, - }]); + let registry = HarnessRegistry::from_configs( + &[HarnessConfig { + name: "candle".into(), + }], + "http://localhost:13131", + ); let health_cache = Arc::new(HealthCache::new()); let state = Arc::new(NeuronState { @@ -197,7 +165,7 @@ async fn test_models_via_mistralrs_harness() { let client = reqwest::Client::new(); - // GET /models — should return models from mock mistralrs. + // GET /models — candle harness has no models loaded yet. let resp = client .get(format!("{neuron_url}/models")) .send() @@ -205,45 +173,14 @@ async fn test_models_via_mistralrs_harness() { .unwrap(); assert_eq!(resp.status(), 200); let models: Vec = resp.json().await.unwrap(); - assert_eq!(models.len(), 2); - assert_eq!(models[0]["id"], "test-model"); - assert_eq!(models[0]["harness"], "mistralrs"); - assert_eq!(models[0]["status"], "loaded"); - assert_eq!(models[1]["id"], "other-model"); - assert_eq!(models[1]["status"], "unloaded"); + assert!(models.is_empty()); - // GET /models/test-model/endpoint — should return mock URL. - let resp = client - .get(format!("{neuron_url}/models/test-model/endpoint")) - .send() - .await - .unwrap(); - assert_eq!(resp.status(), 200); - let body: serde_json::Value = resp.json().await.unwrap(); - assert_eq!(body["url"], mock_url); - - // POST /models/unload — should succeed. - let resp = client - .post(format!("{neuron_url}/models/unload")) - .json(&json!({"model_id": "test-model"})) - .send() - .await - .unwrap(); - assert_eq!(resp.status(), 200); - let body: serde_json::Value = resp.json().await.unwrap(); - assert_eq!(body["status"], "unloaded"); - - // POST /models/load — should succeed. + // POST /models/load — Stage 1 skeleton returns an error. let resp = client .post(format!("{neuron_url}/models/load")) - .json(&json!({ - "model_id": "test-model", - "harness": "mistralrs" - })) + .json(&json!({"model_id": "some-model", "harness": "candle"})) .send() .await .unwrap(); - assert_eq!(resp.status(), 200); - let body: serde_json::Value = resp.json().await.unwrap(); - assert_eq!(body["status"], "loaded"); + assert_eq!(resp.status(), 400); } diff --git a/helexa-neuron.spec b/helexa-neuron.spec index 4c23c8a..dfec856 100644 --- a/helexa-neuron.spec +++ b/helexa-neuron.spec @@ -37,8 +37,9 @@ Provides: user(neuron) %description Neuron is a per-node daemon for cortex inference clusters. It discovers -local GPU hardware via nvidia-smi, manages inference harnesses (mistral.rs, -llama.cpp), and exposes an HTTP API for model lifecycle management. +local GPU hardware via nvidia-smi, runs in-process inference via +huggingface/candle, and exposes an HTTP API for model lifecycle +management (load, unload, list, inference endpoint). %prep %autosetup diff --git a/models.example.toml b/models.example.toml index 073ece2..0f2c9c3 100644 --- a/models.example.toml +++ b/models.example.toml @@ -6,7 +6,7 @@ [[models]] id = "your-org/large-model" -harness = "mistralrs" +harness = "candle" quant = "Q4_K_M" vram_mb = 19000 min_devices = 2 @@ -15,7 +15,7 @@ pinned_on = ["gpu-large"] [[models]] id = "your-org/medium-model" -harness = "mistralrs" +harness = "candle" quant = "Q6_K" vram_mb = 12000 min_devices = 1 @@ -23,7 +23,7 @@ pinned_on = ["gpu-medium"] [[models]] id = "your-org/embedding-model" -harness = "mistralrs" +harness = "candle" quant = "Q8_0" vram_mb = 8000 min_devices = 1 diff --git a/neuron.example.toml b/neuron.example.toml index 7bf1220..b728b3d 100644 --- a/neuron.example.toml +++ b/neuron.example.toml @@ -8,9 +8,9 @@ port = 13131 # -- Harnesses --------------------------------------------------------------- -# Each [[harnesses]] entry declares an inference engine managed by neuron. +# Each [[harnesses]] entry declares an inference engine. Currently only +# "candle" is supported — it runs in-process and uses huggingface/candle +# for inference on local CUDA devices. [[harnesses]] -name = "mistralrs" -endpoint = "http://localhost:8080" -systemd_unit = "mistralrs.service" +name = "candle"