Responses API: surface Qwen3 <think> blocks as reasoning items
#5
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Scope cut from Step 2 (commit
957f704)InferenceEvent::ReasoningDelta(crates/neuron/src/wire/event.rs) is defined but never emitted by the candle harness. As a result:response.reasoning.*event family that should carry this).Qwen3 emits
<think>…</think>blocks inline in its content stream today, and the harness folds them into plainInferenceEvent::TextDeltas indistinguishable from real assistant output. Downstream consumers (helexa-acp, the Responses client) parse the marker tags themselves.See
crates/neuron/src/harness/candle.rs::run_inference_streaming(CPU),stream_inference_via_worker(CUDA single-GPU), andchat_completion_tp_stream(CUDA TP) for the three sites that emitTextDelta.Why it was cut
Splitting
<think>from regular content requires a tag-state machine inside the inference loop (or one layer up, beforeemit_delta). The existing detokeniser doesn't know about model-specific markers. Adding it touched three hot loops in candle.rs that we wanted to leave alone for Step 1's refactor and Step 2's new surface.What implementation looks like
neuron::wire::event— a small struct that takes incremental text chunks and emits a stream of(Visible | Reasoning, String)tuples. Handles tag boundaries split across decoder steps (the same byte-streaming concerntokenizers::DecodeStreamalready solves for UTF-8).emit_delta/emit_delta_blocking— instead of always emittingTextDelta, route each parser output through the rightInferenceEventvariant.<think>); future models may use different markers (e.g. o-series uses internal tokens). Either:<think>parser, others → passthrough).reasoning_tag_pattern: Option<&str>field.project_responses_streamto emitresponse.reasoning_summary_part.added,response.reasoning_summary_text.delta, etc. when it seesReasoningDelta. The exact event names need to be cross-referenced against OpenAI's docs (the family lives alongsideoutput_text.*).<think>tags itself (crates/helexa-acp/src/qwen3.rs::ThinkParser) and just consume the typed event.Acceptance
TextDeltaevents (final answer) andReasoningDeltaevents (the<think>body) on the InferenceEvent stream.response.reasoning_*events for the thought content andresponse.output_text.*for the final answer.ThinkParserbecomes redundant for cortex-backed sessions; tests that exercise it still pass because the parser is the fallback for other endpoints.Tracking
Cosmetic improvement for Responses API consumers (they can render the thinking spinner properly). Reduces duplicate parsing in helexa-acp. Depends on nothing currently blocked.
Closing in favour of a model-agnostic reframe — see #8 (strip reasoning content by default on chat completions) and #9 (chat_template_kwargs passthrough).
Why this issue is wrong as written
The original proposal — "route Qwen3
<think>toReasoningDelta" — assumed Qwen3-specific tag parsing. Investigating the actual leak (Zed's commit-message generator showing<think>blocks in the field) surfaced two problems:crates/open_ai/src/completion.rs— nochat_template_kwargs, no Responses-API capability detection). The wire format has no slot for reasoning, so anything inside<think>arrives as plain content. #5's proposed fix wouldn't help that path because there's no reasoning-event family in chat completions to route to.What replaces it
The leak fixes cleanly with a model-agnostic seam: at model load time, probe the tokenizer's
added_tokensfor any token whose content matches a known reasoning-marker convention. Store the open/close token IDs onLoadedModel(orNonefor non-reasoning models). The inference loop's token-level state machine routes betweenTextDeltaandReasoningDeltawithout any hardcoded model knowledge.The chat-completions projector then drops
ReasoningDeltaby default (matching the wire format's lack of a reasoning slot), opt-in via header for callers like helexa-acp that want the markers back.That's #8. Companion is #9 (pass
chat_template_kwargsthrough to the chat template at tokenisation), which gives clients a request-side lever to suppress thinking at generation time — also model-agnostic since neuron doesn't interpret the kwarg, just forwards it.The Responses-API mapping (ReasoningDelta →
response.reasoning_summary_text.delta) is still worth doing eventually but only matters once a Responses-API consumer of cortex exists; tracking under #7's reasoning sub-bullet rather than as a separate issue today.