feat(neuron): wire candle harness load/unload via GGUF

Stage 2 of the candle-native pivot. Fleshes out CandleHarness with a
LoadedModel registry keyed by model_id, hf-hub-backed GGUF download,
and Qwen3 quantized weight construction via candle-transformers'
quantized_qwen3 module. unload_model drops the entry; Drop on the
candle ModelWeights frees device memory.

Device selection prefers CUDA (gated behind the new `cuda` feature),
falling back to CPU when CUDA is unavailable so default builds work
on non-GPU hosts. The candle CUDA toolchain isn't pulled in unless
`--features cuda` is passed, keeping CI green on CPU runners.

Config gains a [harness.candle] block with an optional hf_cache path.
HarnessRegistry::from_configs now takes HarnessSettings so per-harness
config flows through.

A gated tests/candle_lifecycle.rs exercises real load → list → unload
→ list-empty when run with `--features cuda-integration` against a
host with HF network access. The default-feature test in tests/api.rs
covers the wrong-harness rejection path without needing the network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-18 16:02:49 +03:00
parent 3cccc2c56b
commit 5c2bd1a1da
9 changed files with 1934 additions and 47 deletions

View File

@@ -135,17 +135,21 @@ async fn test_models_empty_registry() {
assert!(body.as_array().unwrap().is_empty());
}
/// Verify the candle harness registers and the load endpoint returns a
/// "not implemented" error in Stage 1 (Stage 2 wires up actual loading).
/// Verify the candle harness registers, list is empty by default, and a
/// load attempt for an obviously-bogus model id returns a 4xx error
/// without crashing the daemon. Real load/unload exercising actual GGUF
/// download is covered by `tests/candle_lifecycle.rs` (cuda-integration).
#[tokio::test]
async fn test_candle_harness_registers_but_load_unimplemented() {
async fn test_candle_harness_registers_and_rejects_bogus_model() {
use cortex_core::harness::HarnessConfig;
use neuron::config::HarnessSettings;
let registry = HarnessRegistry::from_configs(
&[HarnessConfig {
name: "candle".into(),
}],
"http://localhost:13131",
&HarnessSettings::default(),
);
let health_cache = Arc::new(HealthCache::new());
@@ -165,7 +169,6 @@ async fn test_candle_harness_registers_but_load_unimplemented() {
let client = reqwest::Client::new();
// GET /models — candle harness has no models loaded yet.
let resp = client
.get(format!("{neuron_url}/models"))
.send()
@@ -175,12 +178,22 @@ async fn test_candle_harness_registers_but_load_unimplemented() {
let models: Vec<serde_json::Value> = resp.json().await.unwrap();
assert!(models.is_empty());
// POST /models/load — Stage 1 skeleton returns an error.
// Sending a wrong-harness spec should be rejected synchronously
// without touching the network or the model registry.
let resp = client
.post(format!("{neuron_url}/models/load"))
.json(&json!({"model_id": "some-model", "harness": "candle"}))
.json(&json!({"model_id": "definitely/not-real", "harness": "not-candle"}))
.send()
.await
.unwrap();
assert_eq!(resp.status(), 400);
// Registry still empty.
let resp = client
.get(format!("{neuron_url}/models"))
.send()
.await
.unwrap();
let models: Vec<serde_json::Value> = resp.json().await.unwrap();
assert!(models.is_empty());
}