feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.
## Surface
- `cortex-core::responses` envelope types: `ResponsesRequest`,
`ResponsesInput` (text | items), `ResponsesInputItem` (message |
function_call | function_call_output | reasoning),
`ResponsesContentPart` (input_text | input_image | output_text),
`ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
a `events::*` constant module so the projector and the wire shape
stay in sync without string-typos.
- `neuron::wire::openai_responses`:
- `request_to_chat(req)` flattens Responses input + instructions
into a `ChatCompletionRequest` the candle harness already
understands. Text-only Parts collapse to a string; mixed
text+image Parts go to chat's content-array shape; reasoning
items drop; function_call / function_call_output round-trip
via tool_calls / tool_call_id metadata so the surface is
consistent for the day the harness emits tool calls.
- `project_responses_stream(rx, meta)` reads InferenceEvents
and emits the eight named events that compose a Responses
stream: response.created → output_item.added → content_part.added
→ output_text.delta×N → output_text.done → content_part.done
→ output_item.done → response.completed. Synthesises start
frames if the producer skips Start (poisoned model, early
disconnect) so the stream stays coherent.
- `build_response(meta, text, reason, usage)` for the
non-streaming path.
- `CandleHarness::inference_stream(req)` extracted from
`chat_completion_stream`, returning a typed `InferenceStream`
(event receiver + id/created/model_id metadata). Both
`chat_completion_stream` and the new `responses_stream` are now
thin wrappers that pick their wire projection. TP path got the
same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).
- `POST /v1/responses` route on neuron. Non-streaming returns one
buffered `ResponsesResponse`; streaming returns axum SSE with
both event names and JSON data per frame (Responses, unlike
chat completions, uses named `event:` lines). Reused
`inference_error_response` helper hoisted out so the chat and
responses handlers share the InferenceError → HTTP mapping.
## CI
Also bundles the `cuda-check` runner-label fix from feedback on
commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so
cudarc's nvcc-version build script blew up. Switched to
`runs-on: cuda-13.0` per the existing labels.
## Scope cuts (documented in the modules)
- `previous_response_id` rejected at translate time with 400
(`code: chained_conversation_not_supported`) — stateful chained
conversations need a persistence layer we haven't built.
- Reasoning items dropped (no Qwen3 `<think>` routing yet).
- Single output item per response (one `"message"` carrying text);
`function_call` items reserved but not synthesised.
- Streaming events cover the core set; `response.in_progress`
and the web_search / image_generation event families are
out-of-scope.
22 new tests: 5 in cortex-core (envelope round-trips), 13 in
neuron::wire (request translator + projector + non-streaming
builder), 4 in neuron's tests/api.rs (route surface — 503 when no
candle, 400 on previous_response_id, 404 on missing model for
both stream and non-stream).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1595,6 +1595,49 @@ impl CandleHarness {
|
||||
&self,
|
||||
request: ChatCompletionRequest,
|
||||
) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
|
||||
let stream = self.inference_stream(request).await?;
|
||||
Ok(wire_chat::project_chat_stream(
|
||||
stream.events,
|
||||
stream.id,
|
||||
stream.created,
|
||||
stream.model_id,
|
||||
))
|
||||
}
|
||||
|
||||
/// Streaming OpenAI Responses API entry point. Same harness
|
||||
/// output as [`Self::chat_completion_stream`], projected into
|
||||
/// the named-event SSE frames the Responses API client wants.
|
||||
/// `response_id` and `message_item_id` are stamped into every
|
||||
/// frame so the consumer can correlate.
|
||||
pub async fn responses_stream(
|
||||
&self,
|
||||
request: ChatCompletionRequest,
|
||||
response_id: String,
|
||||
message_item_id: String,
|
||||
) -> Result<mpsc::Receiver<crate::wire::openai_responses::ResponseStreamFrame>, InferenceError>
|
||||
{
|
||||
let stream = self.inference_stream(request).await?;
|
||||
let meta = crate::wire::openai_responses::ResponseMeta {
|
||||
response_id,
|
||||
created_at: stream.created,
|
||||
model_id: stream.model_id,
|
||||
message_item_id,
|
||||
};
|
||||
Ok(crate::wire::openai_responses::project_responses_stream(
|
||||
stream.events,
|
||||
meta,
|
||||
))
|
||||
}
|
||||
|
||||
/// Format-agnostic streaming inference. Returns the raw
|
||||
/// [`InferenceEvent`] receiver plus the per-request metadata
|
||||
/// wire projectors stamp onto their frames. Lets every wire
|
||||
/// format land on the same harness output without duplicating
|
||||
/// setup / dispatch / spawn logic.
|
||||
async fn inference_stream(
|
||||
&self,
|
||||
request: ChatCompletionRequest,
|
||||
) -> Result<InferenceStream, InferenceError> {
|
||||
let handle = {
|
||||
let models = self.models.read().await;
|
||||
models.get(&request.model).cloned()
|
||||
@@ -1608,7 +1651,7 @@ impl CandleHarness {
|
||||
LoadedHandle::Single(m) => m,
|
||||
#[cfg(feature = "cuda")]
|
||||
LoadedHandle::Tp(m) => {
|
||||
return self.chat_completion_tp_stream(m, request).await;
|
||||
return self.inference_tp_stream(m, request).await;
|
||||
}
|
||||
};
|
||||
|
||||
@@ -1807,16 +1850,39 @@ impl CandleHarness {
|
||||
)));
|
||||
}
|
||||
|
||||
// Wrap the InferenceEvent receiver in the OpenAI chat
|
||||
// projection so the HTTP handler keeps receiving
|
||||
// ChatCompletionChunks bit-for-bit identical to before.
|
||||
// The id/created/model_id snapshot taken at request setup
|
||||
// gets stamped into every emitted chunk.
|
||||
let rx = wire_chat::project_chat_stream(event_rx, id, created, model_id);
|
||||
Ok(rx)
|
||||
// Hand the raw event channel back to the public entry
|
||||
// points (chat_completion_stream / responses_stream); they
|
||||
// pick the wire projection.
|
||||
Ok(InferenceStream {
|
||||
events: event_rx,
|
||||
id,
|
||||
created,
|
||||
model_id,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// The seam between inference (one shape, always) and wire formats
|
||||
/// (many shapes, projector-per-format). Public so the format
|
||||
/// projectors live outside the harness and the harness's
|
||||
/// streaming-inference internals stay encapsulated.
|
||||
pub struct InferenceStream {
|
||||
/// Stream of model-output events. Producers (the various
|
||||
/// inference loops) emit on this; consumers (wire projectors)
|
||||
/// read from it.
|
||||
pub events: mpsc::Receiver<InferenceEvent>,
|
||||
/// Request id stamped into every wire-format frame
|
||||
/// (`chatcmpl-…` for chat completions; the Responses path
|
||||
/// makes its own `resp_…` id separately and ignores this one).
|
||||
pub id: String,
|
||||
/// Unix seconds when inference began. Same field threads into
|
||||
/// every wire format's `created` / `created_at` slot.
|
||||
pub created: u64,
|
||||
/// Local model id (no endpoint prefix). Stamped into every
|
||||
/// wire-format frame so consumers can correlate.
|
||||
pub model_id: String,
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl Harness for CandleHarness {
|
||||
fn name(&self) -> &str {
|
||||
@@ -2234,11 +2300,11 @@ impl CandleHarness {
|
||||
/// So we `tokio::spawn` the orchestration task and use plain
|
||||
/// `Sender::send`.
|
||||
#[cfg(feature = "cuda")]
|
||||
async fn chat_completion_tp_stream(
|
||||
async fn inference_tp_stream(
|
||||
&self,
|
||||
tp: Arc<TpLoadedModel>,
|
||||
request: ChatCompletionRequest,
|
||||
) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
|
||||
) -> Result<InferenceStream, InferenceError> {
|
||||
if tp.poisoned.load(Ordering::Acquire) {
|
||||
return Err(poisoned_error(&request.model));
|
||||
}
|
||||
@@ -2542,14 +2608,16 @@ impl CandleHarness {
|
||||
.instrument(span),
|
||||
);
|
||||
|
||||
// Wrap the InferenceEvent receiver in the OpenAI chat
|
||||
// projection so the HTTP handler keeps consuming
|
||||
// ChatCompletionChunks unchanged. Uses the clones we
|
||||
// stashed before the spawn — the originals were moved
|
||||
// Hand the raw event channel back to the public entry
|
||||
// points; they pick the wire projection. Uses the clones
|
||||
// we stashed before the spawn — the originals were moved
|
||||
// into the orchestration task above.
|
||||
let rx =
|
||||
wire_chat::project_chat_stream(event_rx, projector_id, created, projector_model_id);
|
||||
Ok(rx)
|
||||
Ok(InferenceStream {
|
||||
events: event_rx,
|
||||
id: projector_id,
|
||||
created,
|
||||
model_id: projector_model_id,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user