feat(neuron): OpenAI-compatible non-streaming chat completion

Stage 3 of the candle-native pivot. neuron now serves
POST /v1/chat/completions backed by candle's quantized_qwen3 forward
pass on a per-model serialised generation loop, returning the standard
OpenAI ChatCompletionResponse envelope.

Pipeline per request:
- Look up the LoadedModel by request.model (404 if absent).
- Apply the Qwen3 chat template across all messages.
- Tokenize, then spawn_blocking onto tokio's blocking pool to acquire
  the per-model arch lock and run prefill + greedy/temperature/top-p
  sampling via LogitsProcessor.
- Stop on <|im_end|>/<|endoftext|> EOS or max_tokens (finish_reason
  "stop" vs "length").
- Decode with skip_special_tokens=true, build OpenAI response with
  prompt/completion/total usage counts.

Supporting changes:
- HarnessRegistry now stores Arc<dyn Harness> and caches a typed
  Arc<CandleHarness> so inference routes bypass dyn-Trait dispatch.
- LoadedModel.arch becomes Arc<Mutex<ModelArch>> so the lock guard
  can be moved into spawn_blocking.
- NeuronState gains an Option<Arc<CandleHarness>> field for the new
  inference route.
- Typed InferenceError lets the handler map ModelNotLoaded → 404 and
  other failures → 500 without string-matching anyhow messages.
- stream=true returns 501 until Stage 4 wires up SSE.
- Two leftover mistral.rs string references in proxy.rs and cortex-cli
  (missed during the Stage 1 sweep) are corrected here.

Three new default-feature tests cover the no-candle 503, model-not-
loaded 404, and stream=true 501 paths. The cuda-integration test from
Stage 2 still covers real load/unload; a streaming-feature gated test
exercising actual generation will arrive with Stage 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit is contained in:

rob thijssen

2026-05-18 16:47:58 +03:00

parent 5c2bd1a1da

commit 729317d1ef

10 changed files with 412 additions and 22 deletions

									
										5

crates/neuron/tests/candle_lifecycle.rs
									
												View File
												
				@@ -60,10 +60,7 @@ async fn test_candle_qwen3_load_unload_lifecycle() {

				        .await

				        .expect("load_model should succeed");

				    let models = registry

				        .list_all_models()

				        .await

				        .expect("list_all_models");

				    let models = registry.list_all_models().await.expect("list_all_models");

				    assert_eq!(models.len(), 1, "expected exactly one loaded model");

				    assert_eq!(models[0].id, model_id);

				    assert_eq!(models[0].harness, "candle");

feat(neuron): OpenAI-compatible non-streaming chat completion

5 crates/neuron/tests/candle_lifecycle.rs Unescape Escape View File

5

crates/neuron/tests/candle_lifecycle.rs

View File