refactor(neuron): introduce InferenceEvent + wire projection layer
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 3m28s
build-prerelease / Build neuron-blackwell (push) Failing after 6m4s
build-prerelease / Build neuron-ampere (push) Failing after 7m20s
CI / Test (push) Successful in 7m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Failing after 4m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 3m28s
build-prerelease / Build neuron-blackwell (push) Failing after 6m4s
build-prerelease / Build neuron-ampere (push) Failing after 7m20s
CI / Test (push) Successful in 7m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Failing after 4m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Step 1 of the OpenAI Responses API rollout. Pure refactor — no new
endpoints, no behaviour change on the wire. Lays the seam for
emitting Responses-shaped streaming events from the same harness
output as chat completions in Step 2.
- New `neuron::wire` module tree:
- `wire::event::InferenceEvent` — format-agnostic enum
(Start, TextDelta, ReasoningDelta, Finish) the candle harness
now emits as its native streaming currency.
- `wire::event::FinishReason` — typed reason that maps cleanly
onto OpenAI `finish_reason`, OpenAI Responses `status`, and
Anthropic `stop_reason` strings.
- `wire::openai_chat::project_chat_stream` — async task that
consumes an InferenceEvent receiver and produces a
ChatCompletionChunk receiver, stamping per-request metadata
(id, created, model_id) onto every chunk. Output matches the
pre-refactor wire shape bit-for-bit.
- candle.rs refactored to emit InferenceEvent on its internal
channel through all three streaming paths (CPU
run_inference_streaming, CUDA single-GPU stream_inference_via_worker,
CUDA TP chat_completion_tp_stream). The streaming functions lost
their id/created/model_id parameters since wire-format metadata
now lives in the projector.
- emit_delta + emit_delta_blocking simplified to single-purpose
TextDelta emitters with no wire-format coupling.
- chat_completion_stream wraps the InferenceEvent receiver in
wire_chat::project_chat_stream before returning so the
/v1/chat/completions HTTP handler keeps consuming
ChatCompletionChunks unchanged. External signature preserved.
Also fixes a pre-existing helexa-acp test race (three modules each
declared their own static LOCK for HOME mutation, so cross-module
parallelism flaked tests that read HOME at runtime). Consolidated
onto a single crate-wide path_util::ENV_LOCK.
122 helexa-acp tests + 44 neuron tests pass (5 new wire projection
tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp
suite 3x to confirm the env race is closed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
99
crates/neuron/src/wire/event.rs
Normal file
99
crates/neuron/src/wire/event.rs
Normal file
@@ -0,0 +1,99 @@
|
||||
//! Format-agnostic inference event stream.
|
||||
//!
|
||||
//! The candle harness emits a sequence of these for every streaming
|
||||
//! request. Wire-format projections in sibling modules
|
||||
//! ([`super::openai_chat`], the eventual `openai_responses` /
|
||||
//! `anthropic_messages` projections) read this stream and produce
|
||||
//! the chunks / events their HTTP clients expect.
|
||||
//!
|
||||
//! Design notes:
|
||||
//!
|
||||
//! - [`Start`] carries no token of its own. It only signals "the
|
||||
//! model has accepted the prompt and is about to begin emitting
|
||||
//! text". OpenAI chat materialises this as a `role: assistant`
|
||||
//! chunk; OpenAI Responses as the `response.created` +
|
||||
//! `response.output_item.added` pair; Anthropic as
|
||||
//! `message_start`. All three of those would otherwise have to
|
||||
//! peek at the *first* token to know when to emit, which couples
|
||||
//! the wire layer to the producer's pacing.
|
||||
//! - [`TextDelta`] is *visible* output. Reasoning / `<think>`
|
||||
//! blocks go through a future [`ReasoningDelta`] variant once
|
||||
//! the harness learns to split them (today they pass through as
|
||||
//! plain text inside `TextDelta`; helexa-acp picks them apart on
|
||||
//! the consumer side).
|
||||
//! - [`Finish`] is the only place a stream is allowed to end
|
||||
//! cleanly. Projections rely on this to emit final usage
|
||||
//! bookkeeping; absence means the producer crashed and the
|
||||
//! consumer should treat the stream as truncated.
|
||||
//!
|
||||
//! [`Start`]: InferenceEvent::Start
|
||||
//! [`TextDelta`]: InferenceEvent::TextDelta
|
||||
//! [`Finish`]: InferenceEvent::Finish
|
||||
|
||||
/// One unit of output from the inference loop.
|
||||
///
|
||||
/// Producers send these on an `mpsc::Sender<InferenceEvent>`;
|
||||
/// projection layers in sibling modules consume them and emit
|
||||
/// wire-format-specific frames downstream.
|
||||
#[derive(Debug, Clone)]
|
||||
pub enum InferenceEvent {
|
||||
/// The producer has accepted the prompt and is about to emit
|
||||
/// the first token. Sent at most once per stream.
|
||||
Start,
|
||||
/// A piece of visible assistant text. Multiple deltas
|
||||
/// concatenate into the complete reply.
|
||||
TextDelta(String),
|
||||
/// Reasoning / scratchpad text the model emitted inside a
|
||||
/// `<think>` block (or equivalent). Producers that don't
|
||||
/// surface reasoning separately use [`TextDelta`] for
|
||||
/// everything; future split lives here.
|
||||
///
|
||||
/// Not yet emitted by the candle harness — present so future
|
||||
/// stages (qwen3 `<think>` routing, OpenAI o-series reasoning)
|
||||
/// have a typed home without breaking the existing
|
||||
/// projections.
|
||||
#[allow(dead_code)]
|
||||
ReasoningDelta(String),
|
||||
/// The stream is complete. Carries the reason so wire formats
|
||||
/// that use it (OpenAI's `finish_reason`, Anthropic's
|
||||
/// `stop_reason`) can render it without re-parsing.
|
||||
Finish { reason: FinishReason },
|
||||
}
|
||||
|
||||
/// Why a stream stopped. Stays small on purpose — anything that
|
||||
/// doesn't map cleanly to one of these collapses to [`Stop`].
|
||||
///
|
||||
/// Mappings to wire formats:
|
||||
///
|
||||
/// | variant | OpenAI `finish_reason` | OpenAI Responses `status` | Anthropic `stop_reason` |
|
||||
/// |---------|------------------------|---------------------------|-------------------------|
|
||||
/// | `Stop` | `"stop"` | `"completed"` | `"end_turn"` |
|
||||
/// | `Length`| `"length"` | `"incomplete"` | `"max_tokens"` |
|
||||
/// | `ToolCalls` | `"tool_calls"` | `"completed"` | `"tool_use"` |
|
||||
///
|
||||
/// [`Stop`]: FinishReason::Stop
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum FinishReason {
|
||||
/// Model emitted EOS naturally.
|
||||
Stop,
|
||||
/// Hit `max_tokens` before EOS.
|
||||
Length,
|
||||
/// Stopped because the model called a tool and is waiting for
|
||||
/// the result. Not yet emitted by the candle harness —
|
||||
/// reserved for the day tool-call extraction lands.
|
||||
#[allow(dead_code)]
|
||||
ToolCalls,
|
||||
}
|
||||
|
||||
impl FinishReason {
|
||||
/// String form used by OpenAI chat completions and OpenAI
|
||||
/// completions. Wire modules can call this directly or do their
|
||||
/// own mapping for non-string formats.
|
||||
pub fn as_openai_str(self) -> &'static str {
|
||||
match self {
|
||||
FinishReason::Stop => "stop",
|
||||
FinishReason::Length => "length",
|
||||
FinishReason::ToolCalls => "tool_calls",
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user