feat(neuron): OpenAI-compatible SSE streaming chat completions
Stage 4 of the candle-native pivot. /v1/chat/completions now switches
to text/event-stream when the request sets stream: true, emitting one
chat.completion.chunk per generated token followed by the OpenAI
[DONE] terminator.
Pipeline:
- chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32),
sends the leading role chunk, then spawns a blocking task that
acquires the per-model arch lock and runs the streaming generation
loop.
- run_inference_streaming tracks a cumulative decoded prefix so each
chunk's delta.content is the substring added since the last chunk —
safe across BPE byte-fallback boundaries that would otherwise split
multi-byte UTF-8 chars.
- The blocking task aborts cleanly if blocking_send fails (client
disconnected), so generation stops when the SSE consumer hangs up.
- Final chunk carries finish_reason ("stop" on EOS, "length" on
max_tokens). The handler appends data: [DONE] after the channel
closes.
The Stage 3 streaming 501 placeholder test is repurposed: with the
streaming path live, an unloaded model now hits the same 404 surface
as the non-streaming path (the model lookup happens first).
cortex-gateway's existing proxy is unchanged — it already forwards
SSE bytes verbatim from Phase 2 work, so the candle SSE format passes
through unmodified.
Neuron Cargo.toml gains futures + tokio-stream (both already in
workspace deps) for ReceiverStream and stream combinators.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -273,10 +273,11 @@ async fn test_chat_completions_model_not_loaded() {
|
||||
assert_eq!(resp.status(), 404);
|
||||
}
|
||||
|
||||
/// `/v1/chat/completions` with `stream: true` returns 501 until Stage 4
|
||||
/// wires up SSE.
|
||||
/// `/v1/chat/completions` with `stream: true` returns 404 when the
|
||||
/// model isn't loaded — same surface as the non-streaming path. The
|
||||
/// streaming code only kicks in once the model lookup succeeds.
|
||||
#[tokio::test]
|
||||
async fn test_chat_completions_streaming_not_yet_implemented() {
|
||||
async fn test_chat_completions_streaming_model_not_loaded() {
|
||||
use cortex_core::harness::HarnessConfig;
|
||||
use neuron::config::HarnessSettings;
|
||||
|
||||
@@ -306,12 +307,12 @@ async fn test_chat_completions_streaming_not_yet_implemented() {
|
||||
let resp = reqwest::Client::new()
|
||||
.post(format!("{url}/v1/chat/completions"))
|
||||
.json(&json!({
|
||||
"model": "anything",
|
||||
"model": "definitely/not-loaded",
|
||||
"messages": [{"role": "user", "content": "hi"}],
|
||||
"stream": true
|
||||
}))
|
||||
.send()
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(resp.status(), 501);
|
||||
assert_eq!(resp.status(), 404);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user