feat(neuron): wire candle harness load/unload via GGUF

Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a LoadedModel registry keyed by model_id, hf-hub-backed GGUF download, and Qwen3 quantized weight construction via candle-transformers' quantized_qwen3 module. unload_model drops the entry; Drop on the candle ModelWeights frees device memory. Device selection prefers CUDA (gated behind the new `cuda` feature), falling back to CPU when CUDA is unavailable so default builds work on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless `--features cuda` is passed, keeping CI green on CPU runners. Config gains a [harness.candle] block with an optional hf_cache path. HarnessRegistry::from_configs now takes HarnessSettings so per-harness config flows through. A gated tests/candle_lifecycle.rs exercises real load → list → unload → list-empty when run with `--features cuda-integration` against a host with HF network access. The default-feature test in tests/api.rs covers the wrong-harness rejection path without needing the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:02:49 +03:00
parent 3cccc2c56b
commit 5c2bd1a1da
9 changed files with 1934 additions and 47 deletions
--- a/crates/neuron/tests/api.rs
+++ b/crates/neuron/tests/api.rs
@@ -135,17 +135,21 @@ async fn test_models_empty_registry() {
    assert!(body.as_array().unwrap().is_empty());
 }

-/// Verify the candle harness registers and the load endpoint returns a
-/// "not implemented" error in Stage 1 (Stage 2 wires up actual loading).
+/// Verify the candle harness registers, list is empty by default, and a
+/// load attempt for an obviously-bogus model id returns a 4xx error
+/// without crashing the daemon. Real load/unload exercising actual GGUF
+/// download is covered by `tests/candle_lifecycle.rs` (cuda-integration).
 #[tokio::test]
-async fn test_candle_harness_registers_but_load_unimplemented() {
+async fn test_candle_harness_registers_and_rejects_bogus_model() {
    use cortex_core::harness::HarnessConfig;
+    use neuron::config::HarnessSettings;

    let registry = HarnessRegistry::from_configs(
        &[HarnessConfig {
            name: "candle".into(),
        }],
        "http://localhost:13131",
+        &HarnessSettings::default(),
    );

    let health_cache = Arc::new(HealthCache::new());
@@ -165,7 +169,6 @@ async fn test_candle_harness_registers_but_load_unimplemented() {

    let client = reqwest::Client::new();

-    // GET /models — candle harness has no models loaded yet.
    let resp = client
        .get(format!("{neuron_url}/models"))
        .send()
@@ -175,12 +178,22 @@ async fn test_candle_harness_registers_but_load_unimplemented() {
    let models: Vec<serde_json::Value> = resp.json().await.unwrap();
    assert!(models.is_empty());

-    // POST /models/load — Stage 1 skeleton returns an error.
+    // Sending a wrong-harness spec should be rejected synchronously
+    // without touching the network or the model registry.
    let resp = client
        .post(format!("{neuron_url}/models/load"))
-        .json(&json!({"model_id": "some-model", "harness": "candle"}))
+        .json(&json!({"model_id": "definitely/not-real", "harness": "not-candle"}))
        .send()
        .await
        .unwrap();
    assert_eq!(resp.status(), 400);
+
+    // Registry still empty.
+    let resp = client
+        .get(format!("{neuron_url}/models"))
+        .send()
+        .await
+        .unwrap();
+    let models: Vec<serde_json::Value> = resp.json().await.unwrap();
+    assert!(models.is_empty());
 }