refactor(neuron): introduce InferenceEvent + wire projection layer

Step 1 of the OpenAI Responses API rollout. Pure refactor — no new endpoints, no behaviour change on the wire. Lays the seam for emitting Responses-shaped streaming events from the same harness output as chat completions in Step 2. - New `neuron::wire` module tree: - `wire::event::InferenceEvent` — format-agnostic enum (Start, TextDelta, ReasoningDelta, Finish) the candle harness now emits as its native streaming currency. - `wire::event::FinishReason` — typed reason that maps cleanly onto OpenAI `finish_reason`, OpenAI Responses `status`, and Anthropic `stop_reason` strings. - `wire::openai_chat::project_chat_stream` — async task that consumes an InferenceEvent receiver and produces a ChatCompletionChunk receiver, stamping per-request metadata (id, created, model_id) onto every chunk. Output matches the pre-refactor wire shape bit-for-bit. - candle.rs refactored to emit InferenceEvent on its internal channel through all three streaming paths (CPU run_inference_streaming, CUDA single-GPU stream_inference_via_worker, CUDA TP chat_completion_tp_stream). The streaming functions lost their id/created/model_id parameters since wire-format metadata now lives in the projector. - emit_delta + emit_delta_blocking simplified to single-purpose TextDelta emitters with no wire-format coupling. - chat_completion_stream wraps the InferenceEvent receiver in wire_chat::project_chat_stream before returning so the /v1/chat/completions HTTP handler keeps consuming ChatCompletionChunks unchanged. External signature preserved. Also fixes a pre-existing helexa-acp test race (three modules each declared their own static LOCK for HOME mutation, so cross-module parallelism flaked tests that read HOME at runtime). Consolidated onto a single crate-wide path_util::ENV_LOCK. 122 helexa-acp tests + 44 neuron tests pass (5 new wire projection tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp suite 3x to confirm the env race is closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:30:17 +03:00
parent df0abfe4d4
commit 302ccfb982
7 changed files with 491 additions and 194 deletions
--- a/crates/neuron/src/wire/event.rs
+++ b/crates/neuron/src/wire/event.rs
@@ -0,0 +1,99 @@
+//! Format-agnostic inference event stream.
+//!
+//! The candle harness emits a sequence of these for every streaming
+//! request. Wire-format projections in sibling modules
+//! ([`super::openai_chat`], the eventual `openai_responses` /
+//! `anthropic_messages` projections) read this stream and produce
+//! the chunks / events their HTTP clients expect.
+//!
+//! Design notes:
+//!
+//! - [`Start`] carries no token of its own. It only signals "the
+//!   model has accepted the prompt and is about to begin emitting
+//!   text". OpenAI chat materialises this as a `role: assistant`
+//!   chunk; OpenAI Responses as the `response.created` +
+//!   `response.output_item.added` pair; Anthropic as
+//!   `message_start`. All three of those would otherwise have to
+//!   peek at the *first* token to know when to emit, which couples
+//!   the wire layer to the producer's pacing.
+//! - [`TextDelta`] is *visible* output. Reasoning / `<think>`
+//!   blocks go through a future [`ReasoningDelta`] variant once
+//!   the harness learns to split them (today they pass through as
+//!   plain text inside `TextDelta`; helexa-acp picks them apart on
+//!   the consumer side).
+//! - [`Finish`] is the only place a stream is allowed to end
+//!   cleanly. Projections rely on this to emit final usage
+//!   bookkeeping; absence means the producer crashed and the
+//!   consumer should treat the stream as truncated.
+//!
+//! [`Start`]: InferenceEvent::Start
+//! [`TextDelta`]: InferenceEvent::TextDelta
+//! [`Finish`]: InferenceEvent::Finish
+
+/// One unit of output from the inference loop.
+///
+/// Producers send these on an `mpsc::Sender<InferenceEvent>`;
+/// projection layers in sibling modules consume them and emit
+/// wire-format-specific frames downstream.
+#[derive(Debug, Clone)]
+pub enum InferenceEvent {
+    /// The producer has accepted the prompt and is about to emit
+    /// the first token. Sent at most once per stream.
+    Start,
+    /// A piece of visible assistant text. Multiple deltas
+    /// concatenate into the complete reply.
+    TextDelta(String),
+    /// Reasoning / scratchpad text the model emitted inside a
+    /// `<think>` block (or equivalent). Producers that don't
+    /// surface reasoning separately use [`TextDelta`] for
+    /// everything; future split lives here.
+    ///
+    /// Not yet emitted by the candle harness — present so future
+    /// stages (qwen3 `<think>` routing, OpenAI o-series reasoning)
+    /// have a typed home without breaking the existing
+    /// projections.
+    #[allow(dead_code)]
+    ReasoningDelta(String),
+    /// The stream is complete. Carries the reason so wire formats
+    /// that use it (OpenAI's `finish_reason`, Anthropic's
+    /// `stop_reason`) can render it without re-parsing.
+    Finish { reason: FinishReason },
+}
+
+/// Why a stream stopped. Stays small on purpose — anything that
+/// doesn't map cleanly to one of these collapses to [`Stop`].
+///
+/// Mappings to wire formats:
+///
+/// | variant | OpenAI `finish_reason` | OpenAI Responses `status` | Anthropic `stop_reason` |
+/// |---------|------------------------|---------------------------|-------------------------|
+/// | `Stop`  | `"stop"`               | `"completed"`             | `"end_turn"`            |
+/// | `Length`| `"length"`             | `"incomplete"`            | `"max_tokens"`          |
+/// | `ToolCalls` | `"tool_calls"`     | `"completed"`             | `"tool_use"`            |
+///
+/// [`Stop`]: FinishReason::Stop
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum FinishReason {
+    /// Model emitted EOS naturally.
+    Stop,
+    /// Hit `max_tokens` before EOS.
+    Length,
+    /// Stopped because the model called a tool and is waiting for
+    /// the result. Not yet emitted by the candle harness —
+    /// reserved for the day tool-call extraction lands.
+    #[allow(dead_code)]
+    ToolCalls,
+}
+
+impl FinishReason {
+    /// String form used by OpenAI chat completions and OpenAI
+    /// completions. Wire modules can call this directly or do their
+    /// own mapping for non-string formats.
+    pub fn as_openai_str(self) -> &'static str {
+        match self {
+            FinishReason::Stop => "stop",
+            FinishReason::Length => "length",
+            FinishReason::ToolCalls => "tool_calls",
+        }
+    }
+}