feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.
## Surface
- `cortex-core::responses` envelope types: `ResponsesRequest`,
`ResponsesInput` (text | items), `ResponsesInputItem` (message |
function_call | function_call_output | reasoning),
`ResponsesContentPart` (input_text | input_image | output_text),
`ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
a `events::*` constant module so the projector and the wire shape
stay in sync without string-typos.
- `neuron::wire::openai_responses`:
- `request_to_chat(req)` flattens Responses input + instructions
into a `ChatCompletionRequest` the candle harness already
understands. Text-only Parts collapse to a string; mixed
text+image Parts go to chat's content-array shape; reasoning
items drop; function_call / function_call_output round-trip
via tool_calls / tool_call_id metadata so the surface is
consistent for the day the harness emits tool calls.
- `project_responses_stream(rx, meta)` reads InferenceEvents
and emits the eight named events that compose a Responses
stream: response.created → output_item.added → content_part.added
→ output_text.delta×N → output_text.done → content_part.done
→ output_item.done → response.completed. Synthesises start
frames if the producer skips Start (poisoned model, early
disconnect) so the stream stays coherent.
- `build_response(meta, text, reason, usage)` for the
non-streaming path.
- `CandleHarness::inference_stream(req)` extracted from
`chat_completion_stream`, returning a typed `InferenceStream`
(event receiver + id/created/model_id metadata). Both
`chat_completion_stream` and the new `responses_stream` are now
thin wrappers that pick their wire projection. TP path got the
same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).
- `POST /v1/responses` route on neuron. Non-streaming returns one
buffered `ResponsesResponse`; streaming returns axum SSE with
both event names and JSON data per frame (Responses, unlike
chat completions, uses named `event:` lines). Reused
`inference_error_response` helper hoisted out so the chat and
responses handlers share the InferenceError → HTTP mapping.
## CI
Also bundles the `cuda-check` runner-label fix from feedback on
commit 1859777: `runs-on: rpm` doesn't ship the CUDA toolkit so
cudarc's nvcc-version build script blew up. Switched to
`runs-on: cuda-13.0` per the existing labels.
## Scope cuts (documented in the modules)
- `previous_response_id` rejected at translate time with 400
(`code: chained_conversation_not_supported`) — stateful chained
conversations need a persistence layer we haven't built.
- Reasoning items dropped (no Qwen3 `<think>` routing yet).
- Single output item per response (one `"message"` carrying text);
`function_call` items reserved but not synthesised.
- Streaming events cover the core set; `response.in_progress`
and the web_search / image_generation event families are
out-of-scope.
22 new tests: 5 in cortex-core (envelope round-trips), 13 in
neuron::wire (request translator + projector + non-streaming
builder), 4 in neuron's tests/api.rs (route surface — 503 when no
candle, 400 on previous_response_id, 404 on missing model for
both stream and non-stream).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -4,6 +4,7 @@ use crate::activation::ActivationTracker;
|
||||
use crate::harness::HarnessRegistry;
|
||||
use crate::harness::candle::{CandleHarness, InferenceError};
|
||||
use crate::health::HealthCache;
|
||||
use crate::wire::openai_responses;
|
||||
use axum::Router;
|
||||
use axum::extract::{Path, State};
|
||||
use axum::http::StatusCode;
|
||||
@@ -12,11 +13,13 @@ use axum::response::{IntoResponse, Json};
|
||||
use axum::routing::{get, post};
|
||||
use cortex_core::discovery::{DiscoveryResponse, HealthResponse};
|
||||
use cortex_core::harness::ModelSpec;
|
||||
use cortex_core::openai::ChatCompletionRequest;
|
||||
use cortex_core::openai::{ChatCompletionRequest, MessageContent};
|
||||
use cortex_core::responses::{ResponsesRequest, ResponsesUsage};
|
||||
use futures::stream::{self, StreamExt};
|
||||
use serde_json::{Value, json};
|
||||
use std::convert::Infallible;
|
||||
use std::sync::Arc;
|
||||
use std::time::{SystemTime, UNIX_EPOCH};
|
||||
use tokio::sync::RwLock;
|
||||
use tokio_stream::wrappers::ReceiverStream;
|
||||
|
||||
@@ -44,6 +47,7 @@ pub fn neuron_routes() -> Router<Arc<NeuronState>> {
|
||||
.route("/models/unload", post(unload_model))
|
||||
.route("/models/{model_id}/endpoint", get(model_endpoint))
|
||||
.route("/v1/chat/completions", post(chat_completions))
|
||||
.route("/v1/responses", post(responses))
|
||||
}
|
||||
|
||||
async fn discovery_handler(State(state): State<Arc<NeuronState>>) -> Json<DiscoveryResponse> {
|
||||
@@ -246,3 +250,187 @@ async fn chat_completions(
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// OpenAI Responses API (`POST /v1/responses`). Translates the
|
||||
/// Responses-shaped request into a chat-completions one the candle
|
||||
/// harness already understands, then re-projects the harness's
|
||||
/// event stream into the Responses event family.
|
||||
async fn responses(
|
||||
State(state): State<Arc<NeuronState>>,
|
||||
Json(req): Json<ResponsesRequest>,
|
||||
) -> impl IntoResponse {
|
||||
let Some(candle) = state.candle.as_ref().map(Arc::clone) else {
|
||||
return (
|
||||
StatusCode::SERVICE_UNAVAILABLE,
|
||||
Json(json!({"error": "candle harness not enabled on this neuron"})),
|
||||
)
|
||||
.into_response();
|
||||
};
|
||||
|
||||
let stream_requested = req.stream;
|
||||
let model_id = req.model.clone();
|
||||
let response_id = mint_response_id();
|
||||
let message_item_id = mint_message_item_id();
|
||||
|
||||
// Translate Responses → chat completions. The only failure
|
||||
// mode today is `previous_response_id` set, which we reject
|
||||
// with 400 — stateful conversations need a persistence layer
|
||||
// we haven't built.
|
||||
let mut chat_req = match openai_responses::request_to_chat(req) {
|
||||
Ok(r) => r,
|
||||
Err(openai_responses::TranslateError::ChainedConversationNotSupported) => {
|
||||
return (
|
||||
StatusCode::BAD_REQUEST,
|
||||
Json(json!({
|
||||
"error": "previous_response_id is not supported on this neuron",
|
||||
"code": "chained_conversation_not_supported"
|
||||
})),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
chat_req.stream = Some(stream_requested);
|
||||
|
||||
if stream_requested {
|
||||
match candle
|
||||
.responses_stream(chat_req, response_id, message_item_id)
|
||||
.await
|
||||
{
|
||||
Ok(rx) => {
|
||||
// Each ResponseStreamFrame → one SSE event carrying
|
||||
// both an event name and JSON data. The Responses
|
||||
// API doesn't use a `[DONE]` terminator — clients
|
||||
// see the `response.completed` event as the end of
|
||||
// the stream.
|
||||
let body_stream = ReceiverStream::new(rx).map(|frame| {
|
||||
let body = serde_json::to_string(&frame.data).unwrap_or_else(|_| "{}".into());
|
||||
Ok::<_, Infallible>(Event::default().event(frame.event_name).data(body))
|
||||
});
|
||||
Sse::new(body_stream)
|
||||
.keep_alive(KeepAlive::default())
|
||||
.into_response()
|
||||
}
|
||||
Err(e) => inference_error_response(e),
|
||||
}
|
||||
} else {
|
||||
// Non-streaming: drive the existing chat completion path
|
||||
// and translate the result. We don't currently re-tokenise
|
||||
// to compute usage; the harness returns it via the chat
|
||||
// response and we pass it through.
|
||||
match candle.chat_completion(chat_req).await {
|
||||
Ok(chat_resp) => {
|
||||
// Extract the assistant text (chat completions
|
||||
// always emits one choice on the candle path).
|
||||
let text = chat_resp
|
||||
.choices
|
||||
.first()
|
||||
.map(|c| match &c.message.content {
|
||||
MessageContent::Text(t) => t.clone(),
|
||||
MessageContent::Parts(_) => {
|
||||
// Candle output is always text today;
|
||||
// a Parts response would be surprising.
|
||||
// Empty-string fallback is safer than
|
||||
// a panic.
|
||||
String::new()
|
||||
}
|
||||
})
|
||||
.unwrap_or_default();
|
||||
let finish = chat_resp
|
||||
.choices
|
||||
.first()
|
||||
.and_then(|c| c.finish_reason.as_deref())
|
||||
.map(finish_reason_from_str)
|
||||
.unwrap_or(crate::wire::FinishReason::Stop);
|
||||
let usage = chat_resp.usage.as_ref().map(|u| ResponsesUsage {
|
||||
input_tokens: u.prompt_tokens,
|
||||
output_tokens: u.completion_tokens,
|
||||
total_tokens: u.prompt_tokens + u.completion_tokens,
|
||||
});
|
||||
let meta = openai_responses::ResponseMeta {
|
||||
response_id: mint_response_id(),
|
||||
created_at: unix_now_secs(),
|
||||
model_id,
|
||||
message_item_id: mint_message_item_id(),
|
||||
};
|
||||
let _ = chat_resp; // make the borrow-checker happy if `text` consumed it
|
||||
let resp = openai_responses::build_response(&meta, text, finish, usage);
|
||||
Json(resp).into_response()
|
||||
}
|
||||
Err(e) => inference_error_response(e),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn finish_reason_from_str(s: &str) -> crate::wire::FinishReason {
|
||||
use crate::wire::FinishReason;
|
||||
match s {
|
||||
"length" => FinishReason::Length,
|
||||
"tool_calls" => FinishReason::ToolCalls,
|
||||
_ => FinishReason::Stop,
|
||||
}
|
||||
}
|
||||
|
||||
/// Centralised mapping from [`InferenceError`] to an HTTP response.
|
||||
/// Lifted out so the chat-completions and responses handlers stay
|
||||
/// readable and changes to error-code semantics happen in one spot.
|
||||
fn inference_error_response(err: InferenceError) -> axum::response::Response {
|
||||
match err {
|
||||
InferenceError::ModelNotLoaded(id) => (
|
||||
StatusCode::NOT_FOUND,
|
||||
Json(json!({"error": format!("model '{id}' not loaded on this neuron")})),
|
||||
)
|
||||
.into_response(),
|
||||
InferenceError::PromptTooLong { prompt_len, max } => (
|
||||
StatusCode::BAD_REQUEST,
|
||||
Json(json!({
|
||||
"error": format!("prompt has {prompt_len} tokens but max is {max}"),
|
||||
"code": "prompt_too_long",
|
||||
"prompt_len": prompt_len,
|
||||
"max": max,
|
||||
})),
|
||||
)
|
||||
.into_response(),
|
||||
InferenceError::InsufficientVram {
|
||||
free_mb,
|
||||
required_mb,
|
||||
} => (
|
||||
StatusCode::SERVICE_UNAVAILABLE,
|
||||
Json(json!({
|
||||
"error": format!(
|
||||
"insufficient free VRAM: {free_mb} MiB free, need at least {required_mb} MiB"
|
||||
),
|
||||
"code": "insufficient_vram",
|
||||
"free_mb": free_mb,
|
||||
"required_mb": required_mb,
|
||||
})),
|
||||
)
|
||||
.into_response(),
|
||||
InferenceError::Other(e) => (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
Json(json!({"error": format!("{e:#}")})),
|
||||
)
|
||||
.into_response(),
|
||||
}
|
||||
}
|
||||
|
||||
fn mint_response_id() -> String {
|
||||
format!("resp_{:x}", unix_subsec_nanos())
|
||||
}
|
||||
|
||||
fn mint_message_item_id() -> String {
|
||||
format!("msg_{:x}", unix_subsec_nanos())
|
||||
}
|
||||
|
||||
fn unix_now_secs() -> u64 {
|
||||
SystemTime::now()
|
||||
.duration_since(UNIX_EPOCH)
|
||||
.map(|d| d.as_secs())
|
||||
.unwrap_or(0)
|
||||
}
|
||||
|
||||
fn unix_subsec_nanos() -> u64 {
|
||||
SystemTime::now()
|
||||
.duration_since(UNIX_EPOCH)
|
||||
.map(|d| d.as_nanos() as u64)
|
||||
.unwrap_or(0)
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user