feat(neuron): OpenAI-compatible SSE streaming chat completions

Stage 4 of the candle-native pivot. /v1/chat/completions now switches
to text/event-stream when the request sets stream: true, emitting one
chat.completion.chunk per generated token followed by the OpenAI
[DONE] terminator.

Pipeline:
- chat_completion_stream creates a bounded mpsc::channel<ChatCompletionChunk>(32),
  sends the leading role chunk, then spawns a blocking task that
  acquires the per-model arch lock and runs the streaming generation
  loop.
- run_inference_streaming tracks a cumulative decoded prefix so each
  chunk's delta.content is the substring added since the last chunk —
  safe across BPE byte-fallback boundaries that would otherwise split
  multi-byte UTF-8 chars.
- The blocking task aborts cleanly if blocking_send fails (client
  disconnected), so generation stops when the SSE consumer hangs up.
- Final chunk carries finish_reason ("stop" on EOS, "length" on
  max_tokens). The handler appends data: [DONE] after the channel
  closes.

The Stage 3 streaming 501 placeholder test is repurposed: with the
streaming path live, an unloaded model now hits the same 404 surface
as the non-streaming path (the model lookup happens first).

cortex-gateway's existing proxy is unchanged — it already forwards
SSE bytes verbatim from Phase 2 work, so the candle SSE format passes
through unmodified.

Neuron Cargo.toml gains futures + tokio-stream (both already in
workspace deps) for ReceiverStream and stream combinators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit is contained in:

rob thijssen

2026-05-18 17:53:14 +03:00

parent 249c9442e8

commit 84f5662df1

5 changed files with 282 additions and 29 deletions

2

Cargo.lock generated

View File

@@ -2114,6 +2114,7 @@ dependencies = [
  "clap",
  "cortex-core",
  "figment",
  "futures",
  "hf-hub",
  "reqwest",
  "serde",
@@ -2121,6 +2122,7 @@ dependencies = [
  "thiserror 2.0.18",
  "tokenizers",
  "tokio",
  "tokio-stream",
  "toml",
  "tracing",
  "tracing-subscriber",

feat(neuron): OpenAI-compatible SSE streaming chat completions

2 Cargo.lock generated Unescape Escape View File

2

Cargo.lock generated

View File