feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled

Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.

## Surface

- `cortex-core::responses` envelope types: `ResponsesRequest`,
  `ResponsesInput` (text | items), `ResponsesInputItem` (message |
  function_call | function_call_output | reasoning),
  `ResponsesContentPart` (input_text | input_image | output_text),
  `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
  a `events::*` constant module so the projector and the wire shape
  stay in sync without string-typos.

- `neuron::wire::openai_responses`:
  - `request_to_chat(req)` flattens Responses input + instructions
    into a `ChatCompletionRequest` the candle harness already
    understands. Text-only Parts collapse to a string; mixed
    text+image Parts go to chat's content-array shape; reasoning
    items drop; function_call / function_call_output round-trip
    via tool_calls / tool_call_id metadata so the surface is
    consistent for the day the harness emits tool calls.
  - `project_responses_stream(rx, meta)` reads InferenceEvents
    and emits the eight named events that compose a Responses
    stream: response.created → output_item.added → content_part.added
    → output_text.delta×N → output_text.done → content_part.done
    → output_item.done → response.completed. Synthesises start
    frames if the producer skips Start (poisoned model, early
    disconnect) so the stream stays coherent.
  - `build_response(meta, text, reason, usage)` for the
    non-streaming path.

- `CandleHarness::inference_stream(req)` extracted from
  `chat_completion_stream`, returning a typed `InferenceStream`
  (event receiver + id/created/model_id metadata). Both
  `chat_completion_stream` and the new `responses_stream` are now
  thin wrappers that pick their wire projection. TP path got the
  same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).

- `POST /v1/responses` route on neuron. Non-streaming returns one
  buffered `ResponsesResponse`; streaming returns axum SSE with
  both event names and JSON data per frame (Responses, unlike
  chat completions, uses named `event:` lines). Reused
  `inference_error_response` helper hoisted out so the chat and
  responses handlers share the InferenceError → HTTP mapping.

## CI

Also bundles the `cuda-check` runner-label fix from feedback on
commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so
cudarc's nvcc-version build script blew up. Switched to
`runs-on: cuda-13.0` per the existing labels.

## Scope cuts (documented in the modules)

- `previous_response_id` rejected at translate time with 400
  (`code: chained_conversation_not_supported`) — stateful chained
  conversations need a persistence layer we haven't built.
- Reasoning items dropped (no Qwen3 `<think>` routing yet).
- Single output item per response (one `"message"` carrying text);
  `function_call` items reserved but not synthesised.
- Streaming events cover the core set; `response.in_progress`
  and the web_search / image_generation event families are
  out-of-scope.

22 new tests: 5 in cortex-core (envelope round-trips), 13 in
neuron::wire (request translator + projector + non-streaming
builder), 4 in neuron's tests/api.rs (route surface — 503 when no
candle, 400 on previous_response_id, 404 on missing model for
both stream and non-stream).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-31 11:13:44 +03:00
parent 1859777332
commit 957f704efa
8 changed files with 1635 additions and 22 deletions

View File

@@ -4,6 +4,7 @@ use crate::activation::ActivationTracker;
use crate::harness::HarnessRegistry;
use crate::harness::candle::{CandleHarness, InferenceError};
use crate::health::HealthCache;
use crate::wire::openai_responses;
use axum::Router;
use axum::extract::{Path, State};
use axum::http::StatusCode;
@@ -12,11 +13,13 @@ use axum::response::{IntoResponse, Json};
use axum::routing::{get, post};
use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
use cortex_core::harness::ModelSpec;
use cortex_core::openai::ChatCompletionRequest;
use cortex_core::openai::{ChatCompletionRequest, MessageContent};
use cortex_core::responses::{ResponsesRequest, ResponsesUsage};
use futures::stream::{self, StreamExt};
use serde_json::{Value, json};
use std::convert::Infallible;
use std::sync::Arc;
use std::time::{SystemTime, UNIX_EPOCH};
use tokio::sync::RwLock;
use tokio_stream::wrappers::ReceiverStream;
@@ -44,6 +47,7 @@ pub fn neuron_routes() -> Router<Arc<NeuronState>> {
.route("/models/unload", post(unload_model))
.route("/models/{model_id}/endpoint", get(model_endpoint))
.route("/v1/chat/completions", post(chat_completions))
.route("/v1/responses", post(responses))
}
async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> {
@@ -246,3 +250,187 @@ async fn chat_completions(
}
}
}
/// OpenAI Responses API (`POST /v1/responses`). Translates the
/// Responses-shaped request into a chat-completions one the candle
/// harness already understands, then re-projects the harness's
/// event stream into the Responses event family.
async fn responses(
State(state): State<Arc<NeuronState>>,
Json(req): Json<ResponsesRequest>,
) -> impl IntoResponse {
let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
return (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({"error": "candle harness not enabled on this neuron"})),
)
.into_response();
};
let stream_requested = req.stream;
let model_id = req.model.clone();
let response_id = mint_response_id();
let message_item_id = mint_message_item_id();
// Translate Responses → chat completions. The only failure
// mode today is `previous_response_id` set, which we reject
// with 400 — stateful conversations need a persistence layer
// we haven't built.
let mut chat_req = match openai_responses::request_to_chat(req) {
Ok(r) => r,
Err(openai_responses::TranslateError::ChainedConversationNotSupported) => {
return (
StatusCode::BAD_REQUEST,
Json(json!({
"error": "previous_response_id is not supported on this neuron",
"code": "chained_conversation_not_supported"
})),
)
.into_response();
}
};
chat_req.stream = Some(stream_requested);
if stream_requested {
match candle
.responses_stream(chat_req, response_id, message_item_id)
.await
{
Ok(rx) => {
// Each ResponseStreamFrame → one SSE event carrying
// both an event name and JSON data. The Responses
// API doesn't use a `[DONE]` terminator — clients
// see the `response.completed` event as the end of
// the stream.
let body_stream = ReceiverStream::new(rx).map(|frame| {
let body = serde_json::to_string(&frame.data).unwrap_or_else(|_| "{}".into());
Ok::<_, Infallible>(Event::default().event(frame.event_name).data(body))
});
Sse::new(body_stream)
.keep_alive(KeepAlive::default())
.into_response()
}
Err(e) => inference_error_response(e),
}
} else {
// Non-streaming: drive the existing chat completion path
// and translate the result. We don't currently re-tokenise
// to compute usage; the harness returns it via the chat
// response and we pass it through.
match candle.chat_completion(chat_req).await {
Ok(chat_resp) => {
// Extract the assistant text (chat completions
// always emits one choice on the candle path).
let text = chat_resp
.choices
.first()
.map(|c| match &c.message.content {
MessageContent::Text(t) => t.clone(),
MessageContent::Parts(_) => {
// Candle output is always text today;
// a Parts response would be surprising.
// Empty-string fallback is safer than
// a panic.
String::new()
}
})
.unwrap_or_default();
let finish = chat_resp
.choices
.first()
.and_then(|c| c.finish_reason.as_deref())
.map(finish_reason_from_str)
.unwrap_or(crate::wire::FinishReason::Stop);
let usage = chat_resp.usage.as_ref().map(|u| ResponsesUsage {
input_tokens: u.prompt_tokens,
output_tokens: u.completion_tokens,
total_tokens: u.prompt_tokens + u.completion_tokens,
});
let meta = openai_responses::ResponseMeta {
response_id: mint_response_id(),
created_at: unix_now_secs(),
model_id,
message_item_id: mint_message_item_id(),
};
let _ = chat_resp; // make the borrow-checker happy if `text` consumed it
let resp = openai_responses::build_response(&meta, text, finish, usage);
Json(resp).into_response()
}
Err(e) => inference_error_response(e),
}
}
}
fn finish_reason_from_str(s: &str) -> crate::wire::FinishReason {
use crate::wire::FinishReason;
match s {
"length" => FinishReason::Length,
"tool_calls" => FinishReason::ToolCalls,
_ => FinishReason::Stop,
}
}
/// Centralised mapping from [`InferenceError`] to an HTTP response.
/// Lifted out so the chat-completions and responses handlers stay
/// readable and changes to error-code semantics happen in one spot.
fn inference_error_response(err: InferenceError) -> axum::response::Response {
match err {
InferenceError::ModelNotLoaded(id) => (
StatusCode::NOT_FOUND,
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
)
.into_response(),
InferenceError::PromptTooLong { prompt_len, max } => (
StatusCode::BAD_REQUEST,
Json(json!({
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
"code": "prompt_too_long",
"prompt_len": prompt_len,
"max": max,
})),
)
.into_response(),
InferenceError::InsufficientVram {
free_mb,
required_mb,
} => (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": format!(
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
),
"code": "insufficient_vram",
"free_mb": free_mb,
"required_mb": required_mb,
})),
)
.into_response(),
InferenceError::Other(e) => (
StatusCode::INTERNAL_SERVER_ERROR,
Json(json!({"error": format!("{e:#}")})),
)
.into_response(),
}
}
fn mint_response_id() -> String {
format!("resp_{:x}", unix_subsec_nanos())
}
fn mint_message_item_id() -> String {
format!("msg_{:x}", unix_subsec_nanos())
}
fn unix_now_secs() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|d| d.as_secs())
.unwrap_or(0)
}
fn unix_subsec_nanos() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|d| d.as_nanos() as u64)
.unwrap_or(0)
}