feat(neuron): OpenAI Responses API + ci cuda-check runner label

Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text | items), `ResponsesInputItem` (message | function_call | function_call_output | reasoning), `ResponsesContentPart` (input_text | input_image | output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 11:13:44 +03:00
parent 1859777332
commit 957f704efa
8 changed files with 1635 additions and 22 deletions
--- a/crates/neuron/src/harness/candle.rs
+++ b/crates/neuron/src/harness/candle.rs
@@ -1595,6 +1595,49 @@ impl CandleHarness {
        &self,
        request: ChatCompletionRequest,
    ) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
+        let stream = self.inference_stream(request).await?;
+        Ok(wire_chat::project_chat_stream(
+            stream.events,
+            stream.id,
+            stream.created,
+            stream.model_id,
+        ))
+    }
+
+    /// Streaming OpenAI Responses API entry point. Same harness
+    /// output as [`Self::chat_completion_stream`], projected into
+    /// the named-event SSE frames the Responses API client wants.
+    /// `response_id` and `message_item_id` are stamped into every
+    /// frame so the consumer can correlate.
+    pub async fn responses_stream(
+        &self,
+        request: ChatCompletionRequest,
+        response_id: String,
+        message_item_id: String,
+    ) -> Result<mpsc::Receiver<crate::wire::openai_responses::ResponseStreamFrame>, InferenceError>
+    {
+        let stream = self.inference_stream(request).await?;
+        let meta = crate::wire::openai_responses::ResponseMeta {
+            response_id,
+            created_at: stream.created,
+            model_id: stream.model_id,
+            message_item_id,
+        };
+        Ok(crate::wire::openai_responses::project_responses_stream(
+            stream.events,
+            meta,
+        ))
+    }
+
+    /// Format-agnostic streaming inference. Returns the raw
+    /// [`InferenceEvent`] receiver plus the per-request metadata
+    /// wire projectors stamp onto their frames. Lets every wire
+    /// format land on the same harness output without duplicating
+    /// setup / dispatch / spawn logic.
+    async fn inference_stream(
+        &self,
+        request: ChatCompletionRequest,
+    ) -> Result<InferenceStream, InferenceError> {
        let handle = {
            let models = self.models.read().await;
            models.get(&request.model).cloned()
@@ -1608,7 +1651,7 @@ impl CandleHarness {
            LoadedHandle::Single(m) => m,
            #[cfg(feature = "cuda")]
            LoadedHandle::Tp(m) => {
-                return self.chat_completion_tp_stream(m, request).await;
+                return self.inference_tp_stream(m, request).await;
            }
        };

@@ -1807,16 +1850,39 @@ impl CandleHarness {
            )));
        }

-        // Wrap the InferenceEvent receiver in the OpenAI chat
-        // projection so the HTTP handler keeps receiving
-        // ChatCompletionChunks bit-for-bit identical to before.
-        // The id/created/model_id snapshot taken at request setup
-        // gets stamped into every emitted chunk.
-        let rx = wire_chat::project_chat_stream(event_rx, id, created, model_id);
-        Ok(rx)
+        // Hand the raw event channel back to the public entry
+        // points (chat_completion_stream / responses_stream); they
+        // pick the wire projection.
+        Ok(InferenceStream {
+            events: event_rx,
+            id,
+            created,
+            model_id,
+        })
    }
 }

+/// The seam between inference (one shape, always) and wire formats
+/// (many shapes, projector-per-format). Public so the format
+/// projectors live outside the harness and the harness's
+/// streaming-inference internals stay encapsulated.
+pub struct InferenceStream {
+    /// Stream of model-output events. Producers (the various
+    /// inference loops) emit on this; consumers (wire projectors)
+    /// read from it.
+    pub events: mpsc::Receiver<InferenceEvent>,
+    /// Request id stamped into every wire-format frame
+    /// (`chatcmpl-…` for chat completions; the Responses path
+    /// makes its own `resp_…` id separately and ignores this one).
+    pub id: String,
+    /// Unix seconds when inference began. Same field threads into
+    /// every wire format's `created` / `created_at` slot.
+    pub created: u64,
+    /// Local model id (no endpoint prefix). Stamped into every
+    /// wire-format frame so consumers can correlate.
+    pub model_id: String,
+}
+
 #[async_trait]
 impl Harness for CandleHarness {
    fn name(&self) -> &str {
@@ -2234,11 +2300,11 @@ impl CandleHarness {
    /// So we `tokio::spawn` the orchestration task and use plain
    /// `Sender::send`.
    #[cfg(feature = "cuda")]
-    async fn chat_completion_tp_stream(
+    async fn inference_tp_stream(
        &self,
        tp: Arc<TpLoadedModel>,
        request: ChatCompletionRequest,
-    ) -> Result<mpsc::Receiver<ChatCompletionChunk>, InferenceError> {
+    ) -> Result<InferenceStream, InferenceError> {
        if tp.poisoned.load(Ordering::Acquire) {
            return Err(poisoned_error(&request.model));
        }
@@ -2542,14 +2608,16 @@ impl CandleHarness {
            .instrument(span),
        );

-        // Wrap the InferenceEvent receiver in the OpenAI chat
-        // projection so the HTTP handler keeps consuming
-        // ChatCompletionChunks unchanged. Uses the clones we
-        // stashed before the spawn — the originals were moved
+        // Hand the raw event channel back to the public entry
+        // points; they pick the wire projection. Uses the clones
+        // we stashed before the spawn — the originals were moved
        // into the orchestration task above.
-        let rx =
-            wire_chat::project_chat_stream(event_rx, projector_id, created, projector_model_id);
-        Ok(rx)
+        Ok(InferenceStream {
+            events: event_rx,
+            id: projector_id,
+            created,
+            model_id: projector_model_id,
+        })
    }
 }