refactor(neuron): introduce InferenceEvent + wire projection layer

Step 1 of the OpenAI Responses API rollout. Pure refactor — no new endpoints, no behaviour change on the wire. Lays the seam for emitting Responses-shaped streaming events from the same harness output as chat completions in Step 2. - New `neuron::wire` module tree: - `wire::event::InferenceEvent` — format-agnostic enum (Start, TextDelta, ReasoningDelta, Finish) the candle harness now emits as its native streaming currency. - `wire::event::FinishReason` — typed reason that maps cleanly onto OpenAI `finish_reason`, OpenAI Responses `status`, and Anthropic `stop_reason` strings. - `wire::openai_chat::project_chat_stream` — async task that consumes an InferenceEvent receiver and produces a ChatCompletionChunk receiver, stamping per-request metadata (id, created, model_id) onto every chunk. Output matches the pre-refactor wire shape bit-for-bit. - candle.rs refactored to emit InferenceEvent on its internal channel through all three streaming paths (CPU run_inference_streaming, CUDA single-GPU stream_inference_via_worker, CUDA TP chat_completion_tp_stream). The streaming functions lost their id/created/model_id parameters since wire-format metadata now lives in the projector. - emit_delta + emit_delta_blocking simplified to single-purpose TextDelta emitters with no wire-format coupling. - chat_completion_stream wraps the InferenceEvent receiver in wire_chat::project_chat_stream before returning so the /v1/chat/completions HTTP handler keeps consuming ChatCompletionChunks unchanged. External signature preserved. Also fixes a pre-existing helexa-acp test race (three modules each declared their own static LOCK for HOME mutation, so cross-module parallelism flaked tests that read HOME at runtime). Consolidated onto a single crate-wide path_util::ENV_LOCK. 122 helexa-acp tests + 44 neuron tests pass (5 new wire projection tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp suite 3x to confirm the env race is closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:30:17 +03:00
parent df0abfe4d4
commit 302ccfb982
7 changed files with 491 additions and 194 deletions
--- a/crates/neuron/src/wire/event.rs
+++ b/crates/neuron/src/wire/event.rs
@@ -0,0 +1,99 @@
+//! Format-agnostic inference event stream.
+//!
+//! The candle harness emits a sequence of these for every streaming
+//! request. Wire-format projections in sibling modules
+//! ([`super::openai_chat`], the eventual `openai_responses` /
+//! `anthropic_messages` projections) read this stream and produce
+//! the chunks / events their HTTP clients expect.
+//!
+//! Design notes:
+//!
+//! - [`Start`] carries no token of its own. It only signals "the
+//!   model has accepted the prompt and is about to begin emitting
+//!   text". OpenAI chat materialises this as a `role: assistant`
+//!   chunk; OpenAI Responses as the `response.created` +
+//!   `response.output_item.added` pair; Anthropic as
+//!   `message_start`. All three of those would otherwise have to
+//!   peek at the *first* token to know when to emit, which couples
+//!   the wire layer to the producer's pacing.
+//! - [`TextDelta`] is *visible* output. Reasoning / `<think>`
+//!   blocks go through a future [`ReasoningDelta`] variant once
+//!   the harness learns to split them (today they pass through as
+//!   plain text inside `TextDelta`; helexa-acp picks them apart on
+//!   the consumer side).
+//! - [`Finish`] is the only place a stream is allowed to end
+//!   cleanly. Projections rely on this to emit final usage
+//!   bookkeeping; absence means the producer crashed and the
+//!   consumer should treat the stream as truncated.
+//!
+//! [`Start`]: InferenceEvent::Start
+//! [`TextDelta`]: InferenceEvent::TextDelta
+//! [`Finish`]: InferenceEvent::Finish
+
+/// One unit of output from the inference loop.
+///
+/// Producers send these on an `mpsc::Sender<InferenceEvent>`;
+/// projection layers in sibling modules consume them and emit
+/// wire-format-specific frames downstream.
+#[derive(Debug, Clone)]
+pub enum InferenceEvent {
+    /// The producer has accepted the prompt and is about to emit
+    /// the first token. Sent at most once per stream.
+    Start,
+    /// A piece of visible assistant text. Multiple deltas
+    /// concatenate into the complete reply.
+    TextDelta(String),
+    /// Reasoning / scratchpad text the model emitted inside a
+    /// `<think>` block (or equivalent). Producers that don't
+    /// surface reasoning separately use [`TextDelta`] for
+    /// everything; future split lives here.
+    ///
+    /// Not yet emitted by the candle harness — present so future
+    /// stages (qwen3 `<think>` routing, OpenAI o-series reasoning)
+    /// have a typed home without breaking the existing
+    /// projections.
+    #[allow(dead_code)]
+    ReasoningDelta(String),
+    /// The stream is complete. Carries the reason so wire formats
+    /// that use it (OpenAI's `finish_reason`, Anthropic's
+    /// `stop_reason`) can render it without re-parsing.
+    Finish { reason: FinishReason },
+}
+
+/// Why a stream stopped. Stays small on purpose — anything that
+/// doesn't map cleanly to one of these collapses to [`Stop`].
+///
+/// Mappings to wire formats:
+///
+/// | variant | OpenAI `finish_reason` | OpenAI Responses `status` | Anthropic `stop_reason` |
+/// |---------|------------------------|---------------------------|-------------------------|
+/// | `Stop`  | `"stop"`               | `"completed"`             | `"end_turn"`            |
+/// | `Length`| `"length"`             | `"incomplete"`            | `"max_tokens"`          |
+/// | `ToolCalls` | `"tool_calls"`     | `"completed"`             | `"tool_use"`            |
+///
+/// [`Stop`]: FinishReason::Stop
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum FinishReason {
+    /// Model emitted EOS naturally.
+    Stop,
+    /// Hit `max_tokens` before EOS.
+    Length,
+    /// Stopped because the model called a tool and is waiting for
+    /// the result. Not yet emitted by the candle harness —
+    /// reserved for the day tool-call extraction lands.
+    #[allow(dead_code)]
+    ToolCalls,
+}
+
+impl FinishReason {
+    /// String form used by OpenAI chat completions and OpenAI
+    /// completions. Wire modules can call this directly or do their
+    /// own mapping for non-string formats.
+    pub fn as_openai_str(self) -> &'static str {
+        match self {
+            FinishReason::Stop => "stop",
+            FinishReason::Length => "length",
+            FinishReason::ToolCalls => "tool_calls",
+        }
+    }
+}
--- a/crates/neuron/src/wire/mod.rs
+++ b/crates/neuron/src/wire/mod.rs
@@ -0,0 +1,23 @@
+//! Wire-format projection layer.
+//!
+//! The candle harness produces a single, format-agnostic stream of
+//! [`InferenceEvent`]s. Each wire format (OpenAI chat completions,
+//! OpenAI Responses, Anthropic messages, …) lives in its own module
+//! under `wire::` and projects that event stream into the chunks /
+//! events its HTTP clients expect.
+//!
+//! The benefit over translating *between* wire shapes (OpenAI chat
+//! → Anthropic, etc.) is that we never have to reason about a
+//! wire-N → wire-M conversion: every translation is wire-N ↔ the
+//! internal event currency, and the projections are independent. A
+//! new wire format adds a new file under `wire::`; nothing else
+//! needs to know about it.
+//!
+//! Today: [`openai_chat`]. Stage 2 adds `openai_responses`. Stage 3
+//! could add a native Anthropic projection that replaces the
+//! gateway-side translation.
+
+pub mod event;
+pub mod openai_chat;
+
+pub use event::{FinishReason, InferenceEvent};
--- a/crates/neuron/src/wire/openai_chat.rs
+++ b/crates/neuron/src/wire/openai_chat.rs
@@ -0,0 +1,241 @@
+//! OpenAI chat completions projection.
+//!
+//! Reads [`InferenceEvent`]s from a receiver and produces
+//! [`ChatCompletionChunk`]s in the shape `POST /v1/chat/completions`
+//! clients expect on its streaming SSE response. The HTTP handler in
+//! [`crate::api`] wraps the resulting receiver in axum's
+//! `Sse::new(...)` adapter; nothing in this module touches HTTP
+//! framing or `data:` lines.
+//!
+//! Per the OpenAI streaming spec, three chunk shapes appear:
+//!
+//! 1. **Role chunk** — `delta: { "role": "assistant" }`, no content,
+//!    sent once at stream start. We emit this on [`InferenceEvent::Start`].
+//! 2. **Content chunks** — `delta: { "content": "<text>" }`, one per
+//!    [`InferenceEvent::TextDelta`].
+//! 3. **Final chunk** — empty `delta`, `finish_reason` populated.
+//!    Emitted on [`InferenceEvent::Finish`].
+//!
+//! `usage` stays `None` on every chunk; the legacy candle paths
+//! never surfaced usage on the streaming endpoint and we keep that
+//! behaviour bit-for-bit so existing clients see no diff.
+//!
+//! Back-pressure: the projection task awaits both `rx.recv()` and
+//! `tx.send()`. A slow consumer fills the output channel → the
+//! task blocks on send → it stops reading from the input → the
+//! producer blocks on its own send. The bounded channels
+//! propagate without us writing any logic.
+
+use cortex_core::openai::{ChatCompletionChunk, ChunkChoice};
+use serde_json::json;
+use tokio::sync::mpsc;
+
+use super::event::{FinishReason, InferenceEvent};
+
+/// Output channel buffer size. Mirrors the input side's bound; one
+/// event maps to at most one chunk, so equal capacity keeps the
+/// two ends in sync without surprising memory growth.
+const CHUNK_CHANNEL_CAPACITY: usize = 32;
+
+/// Project an [`InferenceEvent`] receiver into a
+/// [`ChatCompletionChunk`] receiver. Spawns one tokio task that
+/// owns the input receiver for the stream's lifetime and exits
+/// when either side closes.
+///
+/// `id`, `created`, and `model_id` are stamped into every emitted
+/// chunk so the receiver can stay generic (decoupled from
+/// per-request metadata).
+pub fn project_chat_stream(
+    mut rx: mpsc::Receiver<InferenceEvent>,
+    id: String,
+    created: u64,
+    model_id: String,
+) -> mpsc::Receiver<ChatCompletionChunk> {
+    let (tx, out_rx) = mpsc::channel::<ChatCompletionChunk>(CHUNK_CHANNEL_CAPACITY);
+
+    tokio::spawn(async move {
+        while let Some(event) = rx.recv().await {
+            let chunks = match event {
+                InferenceEvent::Start => vec![role_chunk(&id, created, &model_id)],
+                InferenceEvent::TextDelta(text) => {
+                    if text.is_empty() {
+                        // DecodeStream is buffering a multi-byte
+                        // codepoint; don't bother sending an empty
+                        // chunk downstream.
+                        continue;
+                    }
+                    vec![content_chunk(&id, created, &model_id, &text)]
+                }
+                InferenceEvent::ReasoningDelta(_) => {
+                    // Reasoning isn't representable in OpenAI chat
+                    // streaming today. The o-series uses a separate
+                    // `summary` event but it's gated by the
+                    // Responses API; chat-completions just drops it.
+                    continue;
+                }
+                InferenceEvent::Finish { reason } => {
+                    vec![final_chunk(&id, created, &model_id, reason)]
+                }
+            };
+            for chunk in chunks {
+                if tx.send(chunk).await.is_err() {
+                    // Consumer hung up; nothing more to do.
+                    return;
+                }
+            }
+        }
+    });
+
+    out_rx
+}
+
+fn role_chunk(id: &str, created: u64, model_id: &str) -> ChatCompletionChunk {
+    ChatCompletionChunk {
+        id: id.into(),
+        object: "chat.completion.chunk".into(),
+        created,
+        model: model_id.into(),
+        choices: vec![ChunkChoice {
+            index: 0,
+            delta: json!({ "role": "assistant" }),
+            finish_reason: None,
+            extra: serde_json::Value::Object(Default::default()),
+        }],
+        usage: None,
+        extra: serde_json::Value::Object(Default::default()),
+    }
+}
+
+fn content_chunk(id: &str, created: u64, model_id: &str, text: &str) -> ChatCompletionChunk {
+    ChatCompletionChunk {
+        id: id.into(),
+        object: "chat.completion.chunk".into(),
+        created,
+        model: model_id.into(),
+        choices: vec![ChunkChoice {
+            index: 0,
+            delta: json!({ "content": text }),
+            finish_reason: None,
+            extra: serde_json::Value::Object(Default::default()),
+        }],
+        usage: None,
+        extra: serde_json::Value::Object(Default::default()),
+    }
+}
+
+fn final_chunk(
+    id: &str,
+    created: u64,
+    model_id: &str,
+    reason: FinishReason,
+) -> ChatCompletionChunk {
+    ChatCompletionChunk {
+        id: id.into(),
+        object: "chat.completion.chunk".into(),
+        created,
+        model: model_id.into(),
+        choices: vec![ChunkChoice {
+            index: 0,
+            delta: serde_json::Value::Object(Default::default()),
+            finish_reason: Some(reason.as_openai_str().to_string()),
+            extra: serde_json::Value::Object(Default::default()),
+        }],
+        usage: None,
+        extra: serde_json::Value::Object(Default::default()),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    /// Drain the projection's output into a Vec for assertion.
+    async fn collect(mut rx: mpsc::Receiver<ChatCompletionChunk>) -> Vec<ChatCompletionChunk> {
+        let mut out = Vec::new();
+        while let Some(chunk) = rx.recv().await {
+            out.push(chunk);
+        }
+        out
+    }
+
+    #[tokio::test]
+    async fn empty_event_stream_yields_no_chunks() {
+        let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
+        drop(tx);
+        let out = collect(project_chat_stream(rx, "id-1".into(), 1700, "m".into())).await;
+        assert!(out.is_empty());
+    }
+
+    #[tokio::test]
+    async fn start_text_finish_produces_three_chunks() {
+        let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
+        let out_rx = project_chat_stream(rx, "id-1".into(), 1700, "m".into());
+
+        tx.send(InferenceEvent::Start).await.unwrap();
+        tx.send(InferenceEvent::TextDelta("hello".into()))
+            .await
+            .unwrap();
+        tx.send(InferenceEvent::Finish {
+            reason: FinishReason::Stop,
+        })
+        .await
+        .unwrap();
+        drop(tx);
+
+        let out = collect(out_rx).await;
+        assert_eq!(out.len(), 3);
+        assert_eq!(out[0].choices[0].delta["role"], "assistant");
+        assert_eq!(out[1].choices[0].delta["content"], "hello");
+        assert_eq!(out[2].choices[0].finish_reason.as_deref(), Some("stop"));
+        // Every chunk carries the stamped metadata.
+        for chunk in &out {
+            assert_eq!(chunk.id, "id-1");
+            assert_eq!(chunk.created, 1700);
+            assert_eq!(chunk.model, "m");
+            assert_eq!(chunk.object, "chat.completion.chunk");
+        }
+    }
+
+    #[tokio::test]
+    async fn empty_text_delta_is_dropped() {
+        let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
+        let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
+        tx.send(InferenceEvent::TextDelta(String::new()))
+            .await
+            .unwrap();
+        drop(tx);
+        let out = collect(out_rx).await;
+        assert!(out.is_empty(), "empty deltas must not produce chunks");
+    }
+
+    #[tokio::test]
+    async fn finish_length_maps_to_openai_string() {
+        let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
+        let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
+        tx.send(InferenceEvent::Finish {
+            reason: FinishReason::Length,
+        })
+        .await
+        .unwrap();
+        drop(tx);
+        let out = collect(out_rx).await;
+        assert_eq!(out.len(), 1);
+        assert_eq!(out[0].choices[0].finish_reason.as_deref(), Some("length"));
+    }
+
+    #[tokio::test]
+    async fn reasoning_delta_is_dropped_in_chat_projection() {
+        let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
+        let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
+        tx.send(InferenceEvent::ReasoningDelta("<think>".into()))
+            .await
+            .unwrap();
+        tx.send(InferenceEvent::TextDelta("real".into()))
+            .await
+            .unwrap();
+        drop(tx);
+        let out = collect(out_rx).await;
+        assert_eq!(out.len(), 1);
+        assert_eq!(out[0].choices[0].delta["content"], "real");
+    }
+}