refactor(neuron): introduce InferenceEvent + wire projection layer
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 3m28s
build-prerelease / Build neuron-blackwell (push) Failing after 6m4s
build-prerelease / Build neuron-ampere (push) Failing after 7m20s
CI / Test (push) Successful in 7m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Failing after 4m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped

Step 1 of the OpenAI Responses API rollout. Pure refactor — no new
endpoints, no behaviour change on the wire. Lays the seam for
emitting Responses-shaped streaming events from the same harness
output as chat completions in Step 2.

- New `neuron::wire` module tree:
  - `wire::event::InferenceEvent` — format-agnostic enum
    (Start, TextDelta, ReasoningDelta, Finish) the candle harness
    now emits as its native streaming currency.
  - `wire::event::FinishReason` — typed reason that maps cleanly
    onto OpenAI `finish_reason`, OpenAI Responses `status`, and
    Anthropic `stop_reason` strings.
  - `wire::openai_chat::project_chat_stream` — async task that
    consumes an InferenceEvent receiver and produces a
    ChatCompletionChunk receiver, stamping per-request metadata
    (id, created, model_id) onto every chunk. Output matches the
    pre-refactor wire shape bit-for-bit.

- candle.rs refactored to emit InferenceEvent on its internal
  channel through all three streaming paths (CPU
  run_inference_streaming, CUDA single-GPU stream_inference_via_worker,
  CUDA TP chat_completion_tp_stream). The streaming functions lost
  their id/created/model_id parameters since wire-format metadata
  now lives in the projector.

- emit_delta + emit_delta_blocking simplified to single-purpose
  TextDelta emitters with no wire-format coupling.

- chat_completion_stream wraps the InferenceEvent receiver in
  wire_chat::project_chat_stream before returning so the
  /v1/chat/completions HTTP handler keeps consuming
  ChatCompletionChunks unchanged. External signature preserved.

Also fixes a pre-existing helexa-acp test race (three modules each
declared their own static LOCK for HOME mutation, so cross-module
parallelism flaked tests that read HOME at runtime). Consolidated
onto a single crate-wide path_util::ENV_LOCK.

122 helexa-acp tests + 44 neuron tests pass (5 new wire projection
tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp
suite 3x to confirm the env race is closed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 11:30:17 +03:00
parent df0abfe4d4
commit 302ccfb982
7 changed files with 491 additions and 194 deletions

View File

@@ -0,0 +1,99 @@
//! Format-agnostic inference event stream.
//!
//! The candle harness emits a sequence of these for every streaming
//! request. Wire-format projections in sibling modules
//! ([`super::openai_chat`], the eventual `openai_responses` /
//! `anthropic_messages` projections) read this stream and produce
//! the chunks / events their HTTP clients expect.
//!
//! Design notes:
//!
//! - [`Start`] carries no token of its own. It only signals "the
//! model has accepted the prompt and is about to begin emitting
//! text". OpenAI chat materialises this as a `role: assistant`
//! chunk; OpenAI Responses as the `response.created` +
//! `response.output_item.added` pair; Anthropic as
//! `message_start`. All three of those would otherwise have to
//! peek at the *first* token to know when to emit, which couples
//! the wire layer to the producer's pacing.
//! - [`TextDelta`] is *visible* output. Reasoning / `<think>`
//! blocks go through a future [`ReasoningDelta`] variant once
//! the harness learns to split them (today they pass through as
//! plain text inside `TextDelta`; helexa-acp picks them apart on
//! the consumer side).
//! - [`Finish`] is the only place a stream is allowed to end
//! cleanly. Projections rely on this to emit final usage
//! bookkeeping; absence means the producer crashed and the
//! consumer should treat the stream as truncated.
//!
//! [`Start`]: InferenceEvent::Start
//! [`TextDelta`]: InferenceEvent::TextDelta
//! [`Finish`]: InferenceEvent::Finish
/// One unit of output from the inference loop.
///
/// Producers send these on an `mpsc::Sender<InferenceEvent>`;
/// projection layers in sibling modules consume them and emit
/// wire-format-specific frames downstream.
#[derive(Debug, Clone)]
pub enum InferenceEvent {
/// The producer has accepted the prompt and is about to emit
/// the first token. Sent at most once per stream.
Start,
/// A piece of visible assistant text. Multiple deltas
/// concatenate into the complete reply.
TextDelta(String),
/// Reasoning / scratchpad text the model emitted inside a
/// `<think>` block (or equivalent). Producers that don't
/// surface reasoning separately use [`TextDelta`] for
/// everything; future split lives here.
///
/// Not yet emitted by the candle harness — present so future
/// stages (qwen3 `<think>` routing, OpenAI o-series reasoning)
/// have a typed home without breaking the existing
/// projections.
#[allow(dead_code)]
ReasoningDelta(String),
/// The stream is complete. Carries the reason so wire formats
/// that use it (OpenAI's `finish_reason`, Anthropic's
/// `stop_reason`) can render it without re-parsing.
Finish { reason: FinishReason },
}
/// Why a stream stopped. Stays small on purpose — anything that
/// doesn't map cleanly to one of these collapses to [`Stop`].
///
/// Mappings to wire formats:
///
/// | variant | OpenAI `finish_reason` | OpenAI Responses `status` | Anthropic `stop_reason` |
/// |---------|------------------------|---------------------------|-------------------------|
/// | `Stop` | `"stop"` | `"completed"` | `"end_turn"` |
/// | `Length`| `"length"` | `"incomplete"` | `"max_tokens"` |
/// | `ToolCalls` | `"tool_calls"` | `"completed"` | `"tool_use"` |
///
/// [`Stop`]: FinishReason::Stop
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FinishReason {
/// Model emitted EOS naturally.
Stop,
/// Hit `max_tokens` before EOS.
Length,
/// Stopped because the model called a tool and is waiting for
/// the result. Not yet emitted by the candle harness —
/// reserved for the day tool-call extraction lands.
#[allow(dead_code)]
ToolCalls,
}
impl FinishReason {
/// String form used by OpenAI chat completions and OpenAI
/// completions. Wire modules can call this directly or do their
/// own mapping for non-string formats.
pub fn as_openai_str(self) -> &'static str {
match self {
FinishReason::Stop => "stop",
FinishReason::Length => "length",
FinishReason::ToolCalls => "tool_calls",
}
}
}

View File

@@ -0,0 +1,23 @@
//! Wire-format projection layer.
//!
//! The candle harness produces a single, format-agnostic stream of
//! [`InferenceEvent`]s. Each wire format (OpenAI chat completions,
//! OpenAI Responses, Anthropic messages, …) lives in its own module
//! under `wire::` and projects that event stream into the chunks /
//! events its HTTP clients expect.
//!
//! The benefit over translating *between* wire shapes (OpenAI chat
//! → Anthropic, etc.) is that we never have to reason about a
//! wire-N → wire-M conversion: every translation is wire-N ↔ the
//! internal event currency, and the projections are independent. A
//! new wire format adds a new file under `wire::`; nothing else
//! needs to know about it.
//!
//! Today: [`openai_chat`]. Stage 2 adds `openai_responses`. Stage 3
//! could add a native Anthropic projection that replaces the
//! gateway-side translation.
pub mod event;
pub mod openai_chat;
pub use event::{FinishReason, InferenceEvent};

View File

@@ -0,0 +1,241 @@
//! OpenAI chat completions projection.
//!
//! Reads [`InferenceEvent`]s from a receiver and produces
//! [`ChatCompletionChunk`]s in the shape `POST /v1/chat/completions`
//! clients expect on its streaming SSE response. The HTTP handler in
//! [`crate::api`] wraps the resulting receiver in axum's
//! `Sse::new(...)` adapter; nothing in this module touches HTTP
//! framing or `data:` lines.
//!
//! Per the OpenAI streaming spec, three chunk shapes appear:
//!
//! 1. **Role chunk** — `delta: { "role": "assistant" }`, no content,
//! sent once at stream start. We emit this on [`InferenceEvent::Start`].
//! 2. **Content chunks** — `delta: { "content": "<text>" }`, one per
//! [`InferenceEvent::TextDelta`].
//! 3. **Final chunk** — empty `delta`, `finish_reason` populated.
//! Emitted on [`InferenceEvent::Finish`].
//!
//! `usage` stays `None` on every chunk; the legacy candle paths
//! never surfaced usage on the streaming endpoint and we keep that
//! behaviour bit-for-bit so existing clients see no diff.
//!
//! Back-pressure: the projection task awaits both `rx.recv()` and
//! `tx.send()`. A slow consumer fills the output channel → the
//! task blocks on send → it stops reading from the input → the
//! producer blocks on its own send. The bounded channels
//! propagate without us writing any logic.
use cortex_core::openai::{ChatCompletionChunk, ChunkChoice};
use serde_json::json;
use tokio::sync::mpsc;
use super::event::{FinishReason, InferenceEvent};
/// Output channel buffer size. Mirrors the input side's bound; one
/// event maps to at most one chunk, so equal capacity keeps the
/// two ends in sync without surprising memory growth.
const CHUNK_CHANNEL_CAPACITY: usize = 32;
/// Project an [`InferenceEvent`] receiver into a
/// [`ChatCompletionChunk`] receiver. Spawns one tokio task that
/// owns the input receiver for the stream's lifetime and exits
/// when either side closes.
///
/// `id`, `created`, and `model_id` are stamped into every emitted
/// chunk so the receiver can stay generic (decoupled from
/// per-request metadata).
pub fn project_chat_stream(
mut rx: mpsc::Receiver<InferenceEvent>,
id: String,
created: u64,
model_id: String,
) -> mpsc::Receiver<ChatCompletionChunk> {
let (tx, out_rx) = mpsc::channel::<ChatCompletionChunk>(CHUNK_CHANNEL_CAPACITY);
tokio::spawn(async move {
while let Some(event) = rx.recv().await {
let chunks = match event {
InferenceEvent::Start => vec![role_chunk(&id, created, &model_id)],
InferenceEvent::TextDelta(text) => {
if text.is_empty() {
// DecodeStream is buffering a multi-byte
// codepoint; don't bother sending an empty
// chunk downstream.
continue;
}
vec![content_chunk(&id, created, &model_id, &text)]
}
InferenceEvent::ReasoningDelta(_) => {
// Reasoning isn't representable in OpenAI chat
// streaming today. The o-series uses a separate
// `summary` event but it's gated by the
// Responses API; chat-completions just drops it.
continue;
}
InferenceEvent::Finish { reason } => {
vec![final_chunk(&id, created, &model_id, reason)]
}
};
for chunk in chunks {
if tx.send(chunk).await.is_err() {
// Consumer hung up; nothing more to do.
return;
}
}
}
});
out_rx
}
fn role_chunk(id: &str, created: u64, model_id: &str) -> ChatCompletionChunk {
ChatCompletionChunk {
id: id.into(),
object: "chat.completion.chunk".into(),
created,
model: model_id.into(),
choices: vec![ChunkChoice {
index: 0,
delta: json!({ "role": "assistant" }),
finish_reason: None,
extra: serde_json::Value::Object(Default::default()),
}],
usage: None,
extra: serde_json::Value::Object(Default::default()),
}
}
fn content_chunk(id: &str, created: u64, model_id: &str, text: &str) -> ChatCompletionChunk {
ChatCompletionChunk {
id: id.into(),
object: "chat.completion.chunk".into(),
created,
model: model_id.into(),
choices: vec![ChunkChoice {
index: 0,
delta: json!({ "content": text }),
finish_reason: None,
extra: serde_json::Value::Object(Default::default()),
}],
usage: None,
extra: serde_json::Value::Object(Default::default()),
}
}
fn final_chunk(
id: &str,
created: u64,
model_id: &str,
reason: FinishReason,
) -> ChatCompletionChunk {
ChatCompletionChunk {
id: id.into(),
object: "chat.completion.chunk".into(),
created,
model: model_id.into(),
choices: vec![ChunkChoice {
index: 0,
delta: serde_json::Value::Object(Default::default()),
finish_reason: Some(reason.as_openai_str().to_string()),
extra: serde_json::Value::Object(Default::default()),
}],
usage: None,
extra: serde_json::Value::Object(Default::default()),
}
}
#[cfg(test)]
mod tests {
use super::*;
/// Drain the projection's output into a Vec for assertion.
async fn collect(mut rx: mpsc::Receiver<ChatCompletionChunk>) -> Vec<ChatCompletionChunk> {
let mut out = Vec::new();
while let Some(chunk) = rx.recv().await {
out.push(chunk);
}
out
}
#[tokio::test]
async fn empty_event_stream_yields_no_chunks() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
drop(tx);
let out = collect(project_chat_stream(rx, "id-1".into(), 1700, "m".into())).await;
assert!(out.is_empty());
}
#[tokio::test]
async fn start_text_finish_produces_three_chunks() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id-1".into(), 1700, "m".into());
tx.send(InferenceEvent::Start).await.unwrap();
tx.send(InferenceEvent::TextDelta("hello".into()))
.await
.unwrap();
tx.send(InferenceEvent::Finish {
reason: FinishReason::Stop,
})
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
assert_eq!(out.len(), 3);
assert_eq!(out[0].choices[0].delta["role"], "assistant");
assert_eq!(out[1].choices[0].delta["content"], "hello");
assert_eq!(out[2].choices[0].finish_reason.as_deref(), Some("stop"));
// Every chunk carries the stamped metadata.
for chunk in &out {
assert_eq!(chunk.id, "id-1");
assert_eq!(chunk.created, 1700);
assert_eq!(chunk.model, "m");
assert_eq!(chunk.object, "chat.completion.chunk");
}
}
#[tokio::test]
async fn empty_text_delta_is_dropped() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
tx.send(InferenceEvent::TextDelta(String::new()))
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
assert!(out.is_empty(), "empty deltas must not produce chunks");
}
#[tokio::test]
async fn finish_length_maps_to_openai_string() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
tx.send(InferenceEvent::Finish {
reason: FinishReason::Length,
})
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
assert_eq!(out.len(), 1);
assert_eq!(out[0].choices[0].finish_reason.as_deref(), Some("length"));
}
#[tokio::test]
async fn reasoning_delta_is_dropped_in_chat_projection() {
let (tx, rx) = mpsc::channel::<InferenceEvent>(4);
let out_rx = project_chat_stream(rx, "id".into(), 1, "m".into());
tx.send(InferenceEvent::ReasoningDelta("<think>".into()))
.await
.unwrap();
tx.send(InferenceEvent::TextDelta("real".into()))
.await
.unwrap();
drop(tx);
let out = collect(out_rx).await;
assert_eq!(out.len(), 1);
assert_eq!(out[0].choices[0].delta["content"], "real");
}
}