Stage 3 of the candle-native pivot. neuron now serves POST /v1/chat/completions backed by candle's quantized_qwen3 forward pass on a per-model serialised generation loop, returning the standard OpenAI ChatCompletionResponse envelope. Pipeline per request: - Look up the LoadedModel by request.model (404 if absent). - Apply the Qwen3 chat template across all messages. - Tokenize, then spawn_blocking onto tokio's blocking pool to acquire the per-model arch lock and run prefill + greedy/temperature/top-p sampling via LogitsProcessor. - Stop on <|im_end|>/<|endoftext|> EOS or max_tokens (finish_reason "stop" vs "length"). - Decode with skip_special_tokens=true, build OpenAI response with prompt/completion/total usage counts. Supporting changes: - HarnessRegistry now stores Arc<dyn Harness> and caches a typed Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch. - LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard can be moved into spawn_blocking. - NeuronState gains an Option<Arc<CandleHarness>> field for the new inference route. - Typed InferenceError lets the handler map ModelNotLoaded → 404 and other failures → 500 without string-matching anyhow messages. - stream=true returns 501 until Stage 4 wires up SSE. - Two leftover mistral.rs string references in proxy.rs and cortex-cli (missed during the Stage 1 sweep) are corrected here. Three new default-feature tests cover the no-candle 503, model-not- loaded 404, and stream=true 501 paths. The cuda-integration test from Stage 2 still covers real load/unload; a streaming-feature gated test exercising actual generation will arrive with Stage 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
88 lines
3.0 KiB
Rust
88 lines
3.0 KiB
Rust
//! Real model load/unload lifecycle through the candle harness.
|
|
//!
|
|
//! Gated behind the `cuda-integration` feature because it downloads a
|
|
//! real (small) GGUF from HuggingFace and materialises tensors on the
|
|
//! configured device. Run on a host with network access and either a
|
|
//! CUDA GPU (when built with `--features cuda`) or enough CPU RAM to
|
|
//! hold the model.
|
|
//!
|
|
//! Usage:
|
|
//! cargo test -p neuron --features cuda-integration --test candle_lifecycle
|
|
//!
|
|
//! Optional environment variables:
|
|
//! NEURON_TEST_MODEL_ID — HuggingFace repo to load (default: a small
|
|
//! public Qwen3 GGUF repo).
|
|
//! NEURON_TEST_QUANT — quant substring matched against GGUF
|
|
//! filenames (default: "Q4_K_M").
|
|
//! HF_HOME — HuggingFace cache directory.
|
|
|
|
#![cfg(feature = "cuda-integration")]
|
|
|
|
use cortex_core::harness::{HarnessConfig, ModelSpec};
|
|
use neuron::config::HarnessSettings;
|
|
use neuron::harness::HarnessRegistry;
|
|
use std::path::PathBuf;
|
|
|
|
#[tokio::test]
|
|
async fn test_candle_qwen3_load_unload_lifecycle() {
|
|
let _ = tracing_subscriber::fmt()
|
|
.with_test_writer()
|
|
.with_env_filter("info,neuron=debug")
|
|
.try_init();
|
|
|
|
let model_id = std::env::var("NEURON_TEST_MODEL_ID")
|
|
.unwrap_or_else(|_| "Qwen/Qwen3-0.6B-GGUF".to_string());
|
|
let quant = std::env::var("NEURON_TEST_QUANT").unwrap_or_else(|_| "Q4_K_M".to_string());
|
|
|
|
let mut settings = HarnessSettings::default();
|
|
if let Ok(home) = std::env::var("HF_HOME") {
|
|
settings.candle.hf_cache = Some(PathBuf::from(home));
|
|
}
|
|
|
|
let registry = HarnessRegistry::from_configs(
|
|
&[HarnessConfig {
|
|
name: "candle".into(),
|
|
}],
|
|
"http://localhost:13131",
|
|
&settings,
|
|
);
|
|
|
|
let spec = ModelSpec {
|
|
model_id: model_id.clone(),
|
|
harness: "candle".into(),
|
|
quant: Some(quant),
|
|
tensor_parallel: None,
|
|
devices: Some(vec![0]),
|
|
};
|
|
|
|
registry
|
|
.load_model(&spec)
|
|
.await
|
|
.expect("load_model should succeed");
|
|
|
|
let models = registry.list_all_models().await.expect("list_all_models");
|
|
assert_eq!(models.len(), 1, "expected exactly one loaded model");
|
|
assert_eq!(models[0].id, model_id);
|
|
assert_eq!(models[0].harness, "candle");
|
|
assert_eq!(models[0].status, "loaded");
|
|
|
|
let url = registry.inference_endpoint(&model_id).await;
|
|
assert_eq!(url, Some("http://localhost:13131".into()));
|
|
|
|
// Re-loading the same model should be rejected.
|
|
let again = registry.load_model(&spec).await;
|
|
assert!(again.is_err(), "second load should error");
|
|
|
|
registry
|
|
.unload_model(&model_id)
|
|
.await
|
|
.expect("unload_model should succeed");
|
|
|
|
let models = registry.list_all_models().await.expect("list_all_models");
|
|
assert!(models.is_empty(), "registry should be empty after unload");
|
|
|
|
// Unloading a model that isn't loaded should error.
|
|
let err = registry.unload_model(&model_id).await;
|
|
assert!(err.is_err(), "unload of missing model should error");
|
|
}
|