feat(neuron): OpenAI Responses API + ci cuda-check runner label

Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text | items), `ResponsesInputItem` (message | function_call | function_call_output | reasoning), `ResponsesContentPart` (input_text | input_image | output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 11:13:44 +03:00
parent 1859777332
commit 957f704efa
8 changed files with 1635 additions and 22 deletions
--- a/crates/neuron/tests/api.rs
+++ b/crates/neuron/tests/api.rs
@@ -322,3 +322,168 @@ async fn test_chat_completions_streaming_model_not_loaded() {
        .unwrap();
    assert_eq!(resp.status(), 404);
 }
+
+// ── /v1/responses ────────────────────────────────────────────────────
+
+/// `/v1/responses` returns 503 when no candle harness is registered —
+/// matches the chat-completions error shape so a client can swap
+/// endpoints without re-handling 503s.
+#[tokio::test]
+async fn test_responses_no_candle_harness() {
+    let registry = HarnessRegistry::new();
+    let health_cache = Arc::new(HealthCache::new());
+    let state = Arc::new(NeuronState {
+        discovery: fake_discovery(),
+        health_cache,
+        registry: RwLock::new(registry),
+        candle: None,
+        activation: Arc::new(ActivationTracker::new(&[])),
+    });
+    let app = api::neuron_routes().with_state(state);
+    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
+    let addr = listener.local_addr().unwrap();
+    tokio::spawn(async move {
+        axum::serve(listener, app).await.unwrap();
+    });
+    let url = format!("http://{addr}");
+
+    let resp = reqwest::Client::new()
+        .post(format!("{url}/v1/responses"))
+        .json(&json!({"model": "anything", "input": "hi"}))
+        .send()
+        .await
+        .unwrap();
+    assert_eq!(resp.status(), 503);
+}
+
+/// `previous_response_id` is rejected at translate time with 400 —
+/// we don't store responses server-side yet, so chained
+/// conversations can't be honoured.
+#[tokio::test]
+async fn test_responses_rejects_previous_response_id() {
+    use cortex_core::harness::HarnessConfig;
+    use neuron::config::HarnessSettings;
+
+    let registry = HarnessRegistry::from_configs(
+        &[HarnessConfig {
+            name: "candle".into(),
+        }],
+        "http://localhost:0",
+        &HarnessSettings::default(),
+    );
+    let candle = registry.candle();
+    let health_cache = Arc::new(HealthCache::new());
+    let state = Arc::new(NeuronState {
+        discovery: fake_discovery(),
+        health_cache,
+        registry: RwLock::new(registry),
+        candle,
+        activation: Arc::new(ActivationTracker::new(&[])),
+    });
+    let app = api::neuron_routes().with_state(state);
+    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
+    let addr = listener.local_addr().unwrap();
+    tokio::spawn(async move {
+        axum::serve(listener, app).await.unwrap();
+    });
+    let url = format!("http://{addr}");
+
+    let resp = reqwest::Client::new()
+        .post(format!("{url}/v1/responses"))
+        .json(&json!({
+            "model": "anything",
+            "input": "hi",
+            "previous_response_id": "resp_prev_42"
+        }))
+        .send()
+        .await
+        .unwrap();
+    assert_eq!(resp.status(), 400);
+    let body: serde_json::Value = resp.json().await.unwrap();
+    assert_eq!(body["code"], "chained_conversation_not_supported");
+}
+
+/// `/v1/responses` returns 404 when the model isn't loaded — same
+/// surface as chat completions.
+#[tokio::test]
+async fn test_responses_model_not_loaded() {
+    use cortex_core::harness::HarnessConfig;
+    use neuron::config::HarnessSettings;
+
+    let registry = HarnessRegistry::from_configs(
+        &[HarnessConfig {
+            name: "candle".into(),
+        }],
+        "http://localhost:0",
+        &HarnessSettings::default(),
+    );
+    let candle = registry.candle();
+    let health_cache = Arc::new(HealthCache::new());
+    let state = Arc::new(NeuronState {
+        discovery: fake_discovery(),
+        health_cache,
+        registry: RwLock::new(registry),
+        candle,
+        activation: Arc::new(ActivationTracker::new(&[])),
+    });
+    let app = api::neuron_routes().with_state(state);
+    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
+    let addr = listener.local_addr().unwrap();
+    tokio::spawn(async move {
+        axum::serve(listener, app).await.unwrap();
+    });
+    let url = format!("http://{addr}");
+
+    let resp = reqwest::Client::new()
+        .post(format!("{url}/v1/responses"))
+        .json(&json!({"model": "not-loaded", "input": "hi"}))
+        .send()
+        .await
+        .unwrap();
+    assert_eq!(resp.status(), 404);
+}
+
+/// Same model-not-loaded surface on the streaming path. The
+/// stream is opened only after model lookup succeeds, so a
+/// missing model fails fast with a non-SSE 404 response.
+#[tokio::test]
+async fn test_responses_streaming_model_not_loaded() {
+    use cortex_core::harness::HarnessConfig;
+    use neuron::config::HarnessSettings;
+
+    let registry = HarnessRegistry::from_configs(
+        &[HarnessConfig {
+            name: "candle".into(),
+        }],
+        "http://localhost:0",
+        &HarnessSettings::default(),
+    );
+    let candle = registry.candle();
+    let health_cache = Arc::new(HealthCache::new());
+    let state = Arc::new(NeuronState {
+        discovery: fake_discovery(),
+        health_cache,
+        registry: RwLock::new(registry),
+        candle,
+        activation: Arc::new(ActivationTracker::new(&[])),
+    });
+    let app = api::neuron_routes().with_state(state);
+    let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
+    let addr = listener.local_addr().unwrap();
+    tokio::spawn(async move {
+        axum::serve(listener, app).await.unwrap();
+    });
+    let url = format!("http://{addr}");
+
+    let resp = reqwest::Client::new()
+        .post(format!("{url}/v1/responses"))
+        .json(&json!({
+            "model": "not-loaded",
+            "input": "hi",
+            "stream": true
+        }))
+        .send()
+        .await
+        .unwrap();
+    assert_eq!(resp.status(), 404);
+}