feat(neuron): OpenAI Responses API + ci cuda-check runner label

Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text | items), `ResponsesInputItem` (message | function_call | function_call_output | reasoning), `ResponsesContentPart` (input_text | input_image | output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 11:13:44 +03:00
parent 1859777332
commit 957f704efa
8 changed files with 1635 additions and 22 deletions
--- a/crates/neuron/src/api.rs
+++ b/crates/neuron/src/api.rs
@@ -4,6 +4,7 @@ use crate::activation::ActivationTracker;
 use crate::harness::HarnessRegistry;
 use crate::harness::candle::{CandleHarness, InferenceError};
 use crate::health::HealthCache;
+use crate::wire::openai_responses;
 use axum::Router;
 use axum::extract::{Path, State};
 use axum::http::StatusCode;
@@ -12,11 +13,13 @@ use axum::response::{IntoResponse, Json};
 use axum::routing::{get, post};
 use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
 use cortex_core::harness::ModelSpec;
-use cortex_core::openai::ChatCompletionRequest;
+use cortex_core::openai::{ChatCompletionRequest, MessageContent};
+use cortex_core::responses::{ResponsesRequest, ResponsesUsage};
 use futures::stream::{self, StreamExt};
 use serde_json::{Value, json};
 use std::convert::Infallible;
 use std::sync::Arc;
+use std::time::{SystemTime, UNIX_EPOCH};
 use tokio::sync::RwLock;
 use tokio_stream::wrappers::ReceiverStream;

@@ -44,6 +47,7 @@ pub fn neuron_routes() -> Router<Arc<NeuronState>> {
        .route("/models/unload", post(unload_model))
        .route("/models/{model_id}/endpoint", get(model_endpoint))
        .route("/v1/chat/completions", post(chat_completions))
+        .route("/v1/responses", post(responses))
 }

 async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> {
@@ -246,3 +250,187 @@ async fn chat_completions(
        }
    }
 }
+
+/// OpenAI Responses API (`POST /v1/responses`). Translates the
+/// Responses-shaped request into a chat-completions one the candle
+/// harness already understands, then re-projects the harness's
+/// event stream into the Responses event family.
+async fn responses(
+    State(state): State<Arc<NeuronState>>,
+    Json(req): Json<ResponsesRequest>,
+) -> impl IntoResponse {
+    let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
+        return (
+            StatusCode::SERVICE_UNAVAILABLE,
+            Json(json!({"error": "candle harness not enabled on this neuron"})),
+        )
+            .into_response();
+    };
+
+    let stream_requested = req.stream;
+    let model_id = req.model.clone();
+    let response_id = mint_response_id();
+    let message_item_id = mint_message_item_id();
+
+    // Translate Responses → chat completions. The only failure
+    // mode today is `previous_response_id` set, which we reject
+    // with 400 — stateful conversations need a persistence layer
+    // we haven't built.
+    let mut chat_req = match openai_responses::request_to_chat(req) {
+        Ok(r) => r,
+        Err(openai_responses::TranslateError::ChainedConversationNotSupported) => {
+            return (
+                StatusCode::BAD_REQUEST,
+                Json(json!({
+                    "error": "previous_response_id is not supported on this neuron",
+                    "code": "chained_conversation_not_supported"
+                })),
+            )
+                .into_response();
+        }
+    };
+    chat_req.stream = Some(stream_requested);
+
+    if stream_requested {
+        match candle
+            .responses_stream(chat_req, response_id, message_item_id)
+            .await
+        {
+            Ok(rx) => {
+                // Each ResponseStreamFrame → one SSE event carrying
+                // both an event name and JSON data. The Responses
+                // API doesn't use a `[DONE]` terminator — clients
+                // see the `response.completed` event as the end of
+                // the stream.
+                let body_stream = ReceiverStream::new(rx).map(|frame| {
+                    let body = serde_json::to_string(&frame.data).unwrap_or_else(|_| "{}".into());
+                    Ok::<_, Infallible>(Event::default().event(frame.event_name).data(body))
+                });
+                Sse::new(body_stream)
+                    .keep_alive(KeepAlive::default())
+                    .into_response()
+            }
+            Err(e) => inference_error_response(e),
+        }
+    } else {
+        // Non-streaming: drive the existing chat completion path
+        // and translate the result. We don't currently re-tokenise
+        // to compute usage; the harness returns it via the chat
+        // response and we pass it through.
+        match candle.chat_completion(chat_req).await {
+            Ok(chat_resp) => {
+                // Extract the assistant text (chat completions
+                // always emits one choice on the candle path).
+                let text = chat_resp
+                    .choices
+                    .first()
+                    .map(|c| match &c.message.content {
+                        MessageContent::Text(t) => t.clone(),
+                        MessageContent::Parts(_) => {
+                            // Candle output is always text today;
+                            // a Parts response would be surprising.
+                            // Empty-string fallback is safer than
+                            // a panic.
+                            String::new()
+                        }
+                    })
+                    .unwrap_or_default();
+                let finish = chat_resp
+                    .choices
+                    .first()
+                    .and_then(|c| c.finish_reason.as_deref())
+                    .map(finish_reason_from_str)
+                    .unwrap_or(crate::wire::FinishReason::Stop);
+                let usage = chat_resp.usage.as_ref().map(|u| ResponsesUsage {
+                    input_tokens: u.prompt_tokens,
+                    output_tokens: u.completion_tokens,
+                    total_tokens: u.prompt_tokens + u.completion_tokens,
+                });
+                let meta = openai_responses::ResponseMeta {
+                    response_id: mint_response_id(),
+                    created_at: unix_now_secs(),
+                    model_id,
+                    message_item_id: mint_message_item_id(),
+                };
+                let _ = chat_resp; // make the borrow-checker happy if `text` consumed it
+                let resp = openai_responses::build_response(&meta, text, finish, usage);
+                Json(resp).into_response()
+            }
+            Err(e) => inference_error_response(e),
+        }
+    }
+}
+
+fn finish_reason_from_str(s: &str) -> crate::wire::FinishReason {
+    use crate::wire::FinishReason;
+    match s {
+        "length" => FinishReason::Length,
+        "tool_calls" => FinishReason::ToolCalls,
+        _ => FinishReason::Stop,
+    }
+}
+
+/// Centralised mapping from [`InferenceError`] to an HTTP response.
+/// Lifted out so the chat-completions and responses handlers stay
+/// readable and changes to error-code semantics happen in one spot.
+fn inference_error_response(err: InferenceError) -> axum::response::Response {
+    match err {
+        InferenceError::ModelNotLoaded(id) => (
+            StatusCode::NOT_FOUND,
+            Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
+        )
+            .into_response(),
+        InferenceError::PromptTooLong { prompt_len, max } => (
+            StatusCode::BAD_REQUEST,
+            Json(json!({
+                "error": format!("prompt has {prompt_len} tokens but max is {max}"),
+                "code": "prompt_too_long",
+                "prompt_len": prompt_len,
+                "max": max,
+            })),
+        )
+            .into_response(),
+        InferenceError::InsufficientVram {
+            free_mb,
+            required_mb,
+        } => (
+            StatusCode::SERVICE_UNAVAILABLE,
+            Json(json!({
+                "error": format!(
+                    "insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
+                ),
+                "code": "insufficient_vram",
+                "free_mb": free_mb,
+                "required_mb": required_mb,
+            })),
+        )
+            .into_response(),
+        InferenceError::Other(e) => (
+            StatusCode::INTERNAL_SERVER_ERROR,
+            Json(json!({"error": format!("{e:#}")})),
+        )
+            .into_response(),
+    }
+}
+
+fn mint_response_id() -> String {
+    format!("resp_{:x}", unix_subsec_nanos())
+}
+
+fn mint_message_item_id() -> String {
+    format!("msg_{:x}", unix_subsec_nanos())
+}
+
+fn unix_now_secs() -> u64 {
+    SystemTime::now()
+        .duration_since(UNIX_EPOCH)
+        .map(|d| d.as_secs())
+        .unwrap_or(0)
+}
+
+fn unix_subsec_nanos() -> u64 {
+    SystemTime::now()
+        .duration_since(UNIX_EPOCH)
+        .map(|d| d.as_nanos() as u64)
+        .unwrap_or(0)
+}