feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled

Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.

## Surface

- `cortex-core::responses` envelope types: `ResponsesRequest`,
  `ResponsesInput` (text | items), `ResponsesInputItem` (message |
  function_call | function_call_output | reasoning),
  `ResponsesContentPart` (input_text | input_image | output_text),
  `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
  a `events::*` constant module so the projector and the wire shape
  stay in sync without string-typos.

- `neuron::wire::openai_responses`:
  - `request_to_chat(req)` flattens Responses input + instructions
    into a `ChatCompletionRequest` the candle harness already
    understands. Text-only Parts collapse to a string; mixed
    text+image Parts go to chat's content-array shape; reasoning
    items drop; function_call / function_call_output round-trip
    via tool_calls / tool_call_id metadata so the surface is
    consistent for the day the harness emits tool calls.
  - `project_responses_stream(rx, meta)` reads InferenceEvents
    and emits the eight named events that compose a Responses
    stream: response.created → output_item.added → content_part.added
    → output_text.delta×N → output_text.done → content_part.done
    → output_item.done → response.completed. Synthesises start
    frames if the producer skips Start (poisoned model, early
    disconnect) so the stream stays coherent.
  - `build_response(meta, text, reason, usage)` for the
    non-streaming path.

- `CandleHarness::inference_stream(req)` extracted from
  `chat_completion_stream`, returning a typed `InferenceStream`
  (event receiver + id/created/model_id metadata). Both
  `chat_completion_stream` and the new `responses_stream` are now
  thin wrappers that pick their wire projection. TP path got the
  same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).

- `POST /v1/responses` route on neuron. Non-streaming returns one
  buffered `ResponsesResponse`; streaming returns axum SSE with
  both event names and JSON data per frame (Responses, unlike
  chat completions, uses named `event:` lines). Reused
  `inference_error_response` helper hoisted out so the chat and
  responses handlers share the InferenceError → HTTP mapping.

## CI

Also bundles the `cuda-check` runner-label fix from feedback on
commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so
cudarc's nvcc-version build script blew up. Switched to
`runs-on: cuda-13.0` per the existing labels.

## Scope cuts (documented in the modules)

- `previous_response_id` rejected at translate time with 400
  (`code: chained_conversation_not_supported`) — stateful chained
  conversations need a persistence layer we haven't built.
- Reasoning items dropped (no Qwen3 `<think>` routing yet).
- Single output item per response (one `"message"` carrying text);
  `function_call` items reserved but not synthesised.
- Streaming events cover the core set; `response.in_progress`
  and the web_search / image_generation event families are
  out-of-scope.

22 new tests: 5 in cortex-core (envelope round-trips), 13 in
neuron::wire (request translator + projector + non-streaming
builder), 4 in neuron's tests/api.rs (route surface — 503 when no
candle, 400 on previous_response_id, 404 on missing model for
both stream and non-stream).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-31 11:13:44 +03:00
parent 1859777332
commit 957f704efa
8 changed files with 1635 additions and 22 deletions

View File

@@ -1595,6 +1595,49 @@ impl CandleHarness {
&self,
request: ChatCompletionRequest,
) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
let stream = self.inference_stream(request).await?;
Ok(wire_chat::project_chat_stream(
stream.events,
stream.id,
stream.created,
stream.model_id,
))
}
/// Streaming OpenAI Responses API entry point. Same harness
/// output as [`Self::chat_completion_stream`], projected into
/// the named-event SSE frames the Responses API client wants.
/// `response_id` and `message_item_id` are stamped into every
/// frame so the consumer can correlate.
pub async fn responses_stream(
&self,
request: ChatCompletionRequest,
response_id: String,
message_item_id: String,
) -> Result<mpsc::Receiver<crate::wire::openai_responses::ResponseStreamFrame>, InferenceError>
{
let stream = self.inference_stream(request).await?;
let meta = crate::wire::openai_responses::ResponseMeta {
response_id,
created_at: stream.created,
model_id: stream.model_id,
message_item_id,
};
Ok(crate::wire::openai_responses::project_responses_stream(
stream.events,
meta,
))
}
/// Format-agnostic streaming inference. Returns the raw
/// [`InferenceEvent`] receiver plus the per-request metadata
/// wire projectors stamp onto their frames. Lets every wire
/// format land on the same harness output without duplicating
/// setup / dispatch / spawn logic.
async fn inference_stream(
&self,
request: ChatCompletionRequest,
) -> Result<InferenceStream, InferenceError> {
let handle = {
let models = self.models.read().await;
models.get(&request.model).cloned()
@@ -1608,7 +1651,7 @@ impl CandleHarness {
LoadedHandle::Single(m) => m,
#[cfg(feature = "cuda")]
LoadedHandle::Tp(m) => {
return self.chat_completion_tp_stream(m, request).await;
return self.inference_tp_stream(m, request).await;
}
};
@@ -1807,16 +1850,39 @@ impl CandleHarness {
)));
}
// Wrap the InferenceEvent receiver in the OpenAI chat
// projection so the HTTP handler keeps receiving
// ChatCompletionChunks bit-for-bit identical to before.
// The id/created/model_id snapshot taken at request setup
// gets stamped into every emitted chunk.
let rx = wire_chat::project_chat_stream(event_rx, id, created, model_id);
Ok(rx)
// Hand the raw event channel back to the public entry
// points (chat_completion_stream / responses_stream); they
// pick the wire projection.
Ok(InferenceStream {
events: event_rx,
id,
created,
model_id,
})
}
}
/// The seam between inference (one shape, always) and wire formats
/// (many shapes, projector-per-format). Public so the format
/// projectors live outside the harness and the harness's
/// streaming-inference internals stay encapsulated.
pub struct InferenceStream {
/// Stream of model-output events. Producers (the various
/// inference loops) emit on this; consumers (wire projectors)
/// read from it.
pub events: mpsc::Receiver<InferenceEvent>,
/// Request id stamped into every wire-format frame
/// (`chatcmpl-…` for chat completions; the Responses path
/// makes its own `resp_…` id separately and ignores this one).
pub id: String,
/// Unix seconds when inference began. Same field threads into
/// every wire format's `created` / `created_at` slot.
pub created: u64,
/// Local model id (no endpoint prefix). Stamped into every
/// wire-format frame so consumers can correlate.
pub model_id: String,
}
#[async_trait]
impl Harness for CandleHarness {
fn name(&self) -> &str {
@@ -2234,11 +2300,11 @@ impl CandleHarness {
/// So we `tokio::spawn` the orchestration task and use plain
/// `Sender::send`.
#[cfg(feature = "cuda")]
async fn chat_completion_tp_stream(
async fn inference_tp_stream(
&self,
tp: Arc<TpLoadedModel>,
request: ChatCompletionRequest,
) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
) -> Result<InferenceStream, InferenceError> {
if tp.poisoned.load(Ordering::Acquire) {
return Err(poisoned_error(&request.model));
}
@@ -2542,14 +2608,16 @@ impl CandleHarness {
.instrument(span),
);
// Wrap the InferenceEvent receiver in the OpenAI chat
// projection so the HTTP handler keeps consuming
// ChatCompletionChunks unchanged. Uses the clones we
// stashed before the spawn — the originals were moved
// Hand the raw event channel back to the public entry
// points; they pick the wire projection. Uses the clones
// we stashed before the spawn — the originals were moved
// into the orchestration task above.
let rx =
wire_chat::project_chat_stream(event_rx, projector_id, created, projector_model_id);
Ok(rx)
Ok(InferenceStream {
events: event_rx,
id: projector_id,
created,
model_id: projector_model_id,
})
}
}