feat(neuron): OpenAI-compatible SSE streaming chat completions

Stage 4 of the candle-native pivot. /v1/chat/completions now switches to text/event-stream when the request sets stream: true, emitting one chat.completion.chunk per generated token followed by the OpenAI [DONE] terminator. Pipeline: - chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32), sends the leading role chunk, then spawns a blocking task that acquires the per-model arch lock and runs the streaming generation loop. - run_inference_streaming tracks a cumulative decoded prefix so each chunk's delta.content is the substring added since the last chunk — safe across BPE byte-fallback boundaries that would otherwise split multi-byte UTF-8 chars. - The blocking task aborts cleanly if blocking_send fails (client disconnected), so generation stops when the SSE consumer hangs up. - Final chunk carries finish_reason ("stop" on EOS, "length" on max_tokens). The handler appends data: [DONE] after the channel closes. The Stage 3 streaming 501 placeholder test is repurposed: with the streaming path live, an unloaded model now hits the same 404 surface as the non-streaming path (the model lookup happens first). cortex-gateway's existing proxy is unchanged — it already forwards SSE bytes verbatim from Phase 2 work, so the candle SSE format passes through unmodified. Neuron Cargo.toml gains futures + tokio-stream (both already in workspace deps) for ReceiverStream and stream combinators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:53:14 +03:00
parent 249c9442e8
commit 84f5662df1
5 changed files with 282 additions and 29 deletions
--- a/crates/neuron/tests/api.rs
+++ b/crates/neuron/tests/api.rs
@@ -273,10 +273,11 @@ async fn test_chat_completions_model_not_loaded() {
    assert_eq!(resp.status(), 404);
 }

-/// `/v1/chat/completions` with `stream: true` returns 501 until Stage 4
-/// wires up SSE.
+/// `/v1/chat/completions` with `stream: true` returns 404 when the
+/// model isn't loaded — same surface as the non-streaming path. The
+/// streaming code only kicks in once the model lookup succeeds.
 #[tokio::test]
-async fn test_chat_completions_streaming_not_yet_implemented() {
+async fn test_chat_completions_streaming_model_not_loaded() {
    use cortex_core::harness::HarnessConfig;
    use neuron::config::HarnessSettings;

@@ -306,12 +307,12 @@ async fn test_chat_completions_streaming_not_yet_implemented() {
    let resp = reqwest::Client::new()
        .post(format!("{url}/v1/chat/completions"))
        .json(&json!({
-            "model": "anything",
+            "model": "definitely/not-loaded",
            "messages": [{"role": "user", "content": "hi"}],
            "stream": true
        }))
        .send()
        .await
        .unwrap();
-    assert_eq!(resp.status(), 501);
+    assert_eq!(resp.status(), 404);
 }