Responses API: surface Qwen3 <think> blocks as reasoning items #5

Closed
opened 2026-05-31 08:18:54 +00:00 by grenade · 1 comment
Owner

Scope cut from Step 2 (commit 957f704)

InferenceEvent::ReasoningDelta (crates/neuron/src/wire/event.rs) is defined but never emitted by the candle harness. As a result:

  • The OpenAI chat projector drops it (correct — chat completions has no reasoning slot).
  • The Responses projector drops it (incorrect — Responses has a response.reasoning.* event family that should carry this).

Qwen3 emits <think>…</think> blocks inline in its content stream today, and the harness folds them into plain InferenceEvent::TextDeltas indistinguishable from real assistant output. Downstream consumers (helexa-acp, the Responses client) parse the marker tags themselves.

See crates/neuron/src/harness/candle.rs::run_inference_streaming (CPU), stream_inference_via_worker (CUDA single-GPU), and chat_completion_tp_stream (CUDA TP) for the three sites that emit TextDelta.

Why it was cut

Splitting <think> from regular content requires a tag-state machine inside the inference loop (or one layer up, before emit_delta). The existing detokeniser doesn't know about model-specific markers. Adding it touched three hot loops in candle.rs that we wanted to leave alone for Step 1's refactor and Step 2's new surface.

What implementation looks like

  1. Tag parser in neuron::wire::event — a small struct that takes incremental text chunks and emits a stream of (Visible | Reasoning, String) tuples. Handles tag boundaries split across decoder steps (the same byte-streaming concern tokenizers::DecodeStream already solves for UTF-8).
  2. Wire the parser into emit_delta / emit_delta_blocking — instead of always emitting TextDelta, route each parser output through the right InferenceEvent variant.
  3. Per-model toggle — the tag set is qwen3-specific (<think>); future models may use different markers (e.g. o-series uses internal tokens). Either:
    • Detect by model_id prefix (qwen3 → <think> parser, others → passthrough).
    • Config-driven: each loaded model carries a reasoning_tag_pattern: Option<&str> field.
  4. OpenAI Responses projection — extend project_responses_stream to emit response.reasoning_summary_part.added, response.reasoning_summary_text.delta, etc. when it sees ReasoningDelta. The exact event names need to be cross-referenced against OpenAI's docs (the family lives alongside output_text.*).
  5. helexa-acp side — once neuron surfaces reasoning natively, helexa-acp's openai_chat provider can stop parsing <think> tags itself (crates/helexa-acp/src/qwen3.rs::ThinkParser) and just consume the typed event.

Acceptance

  • A prompt that triggers Qwen3 thinking produces both TextDelta events (final answer) and ReasoningDelta events (the <think> body) on the InferenceEvent stream.
  • The Responses projector emits response.reasoning_* events for the thought content and response.output_text.* for the final answer.
  • helexa-acp's ThinkParser becomes redundant for cortex-backed sessions; tests that exercise it still pass because the parser is the fallback for other endpoints.

Tracking

Cosmetic improvement for Responses API consumers (they can render the thinking spinner properly). Reduces duplicate parsing in helexa-acp. Depends on nothing currently blocked.

## Scope cut from Step 2 (commit [`957f704`](https://git.lair.cafe/helexa/cortex/commit/957f704)) `InferenceEvent::ReasoningDelta` (`crates/neuron/src/wire/event.rs`) is defined but never emitted by the candle harness. As a result: - The OpenAI chat projector drops it (correct — chat completions has no reasoning slot). - The Responses projector drops it (incorrect — Responses has a `response.reasoning.*` event family that should carry this). Qwen3 emits `<think>…</think>` blocks inline in its content stream today, and the harness folds them into plain `InferenceEvent::TextDelta`s indistinguishable from real assistant output. Downstream consumers (helexa-acp, the Responses client) parse the marker tags themselves. See `crates/neuron/src/harness/candle.rs::run_inference_streaming` (CPU), `stream_inference_via_worker` (CUDA single-GPU), and `chat_completion_tp_stream` (CUDA TP) for the three sites that emit `TextDelta`. ## Why it was cut Splitting `<think>` from regular content requires a tag-state machine inside the inference loop (or one layer up, before `emit_delta`). The existing detokeniser doesn't know about model-specific markers. Adding it touched three hot loops in candle.rs that we wanted to leave alone for Step 1's refactor and Step 2's new surface. ## What implementation looks like 1. **Tag parser in `neuron::wire::event`** — a small struct that takes incremental text chunks and emits a stream of `(Visible | Reasoning, String)` tuples. Handles tag boundaries split across decoder steps (the same byte-streaming concern `tokenizers::DecodeStream` already solves for UTF-8). 2. **Wire the parser into `emit_delta` / `emit_delta_blocking`** — instead of always emitting `TextDelta`, route each parser output through the right `InferenceEvent` variant. 3. **Per-model toggle** — the tag set is qwen3-specific (`<think>`); future models may use different markers (e.g. o-series uses internal tokens). Either: - Detect by model_id prefix (qwen3 → `<think>` parser, others → passthrough). - Config-driven: each loaded model carries a `reasoning_tag_pattern: Option<&str>` field. 4. **OpenAI Responses projection** — extend `project_responses_stream` to emit `response.reasoning_summary_part.added`, `response.reasoning_summary_text.delta`, etc. when it sees `ReasoningDelta`. The exact event names need to be cross-referenced against OpenAI's docs (the family lives alongside `output_text.*`). 5. **helexa-acp side** — once neuron surfaces reasoning natively, helexa-acp's openai_chat provider can stop parsing `<think>` tags itself (`crates/helexa-acp/src/qwen3.rs::ThinkParser`) and just consume the typed event. ## Acceptance - A prompt that triggers Qwen3 thinking produces both `TextDelta` events (final answer) and `ReasoningDelta` events (the `<think>` body) on the InferenceEvent stream. - The Responses projector emits `response.reasoning_*` events for the thought content and `response.output_text.*` for the final answer. - helexa-acp's `ThinkParser` becomes redundant for cortex-backed sessions; tests that exercise it still pass because the parser is the fallback for other endpoints. ## Tracking Cosmetic improvement for Responses API consumers (they can render the thinking spinner properly). Reduces duplicate parsing in helexa-acp. Depends on nothing currently blocked.
Author
Owner

Closing in favour of a model-agnostic reframe — see #8 (strip reasoning content by default on chat completions) and #9 (chat_template_kwargs passthrough).

Why this issue is wrong as written

The original proposal — "route Qwen3 <think> to ReasoningDelta" — assumed Qwen3-specific tag parsing. Investigating the actual leak (Zed's commit-message generator showing <think> blocks in the field) surfaced two problems:

  1. Zed's chat-completions client doesn't know about reasoning at all on the chat-completions surface (confirmed against their crates/open_ai/src/completion.rs — no chat_template_kwargs, no Responses-API capability detection). The wire format has no slot for reasoning, so anything inside <think> arrives as plain content. #5's proposed fix wouldn't help that path because there's no reasoning-event family in chat completions to route to.
  2. A model-specific tag parser in the candle harness's hot loops creates a coupling we don't want — DeepSeek-R1, Mistral Magistral, gpt-oss, and future reasoning models all use different markers. Per-model parser config is the wrong shape.

What replaces it

The leak fixes cleanly with a model-agnostic seam: at model load time, probe the tokenizer's added_tokens for any token whose content matches a known reasoning-marker convention. Store the open/close token IDs on LoadedModel (or None for non-reasoning models). The inference loop's token-level state machine routes between TextDelta and ReasoningDelta without any hardcoded model knowledge.

The chat-completions projector then drops ReasoningDelta by default (matching the wire format's lack of a reasoning slot), opt-in via header for callers like helexa-acp that want the markers back.

That's #8. Companion is #9 (pass chat_template_kwargs through to the chat template at tokenisation), which gives clients a request-side lever to suppress thinking at generation time — also model-agnostic since neuron doesn't interpret the kwarg, just forwards it.

The Responses-API mapping (ReasoningDelta → response.reasoning_summary_text.delta) is still worth doing eventually but only matters once a Responses-API consumer of cortex exists; tracking under #7's reasoning sub-bullet rather than as a separate issue today.

Closing in favour of a model-agnostic reframe — see #8 (strip reasoning content by default on chat completions) and #9 (chat_template_kwargs passthrough). ## Why this issue is wrong as written The original proposal — "route Qwen3 `<think>` to `ReasoningDelta`" — assumed Qwen3-specific tag parsing. Investigating the actual leak (Zed's commit-message generator showing `<think>` blocks in the field) surfaced two problems: 1. **Zed's chat-completions client doesn't know about reasoning at all on the chat-completions surface** (confirmed against their `crates/open_ai/src/completion.rs` — no `chat_template_kwargs`, no Responses-API capability detection). The wire format has no slot for reasoning, so anything inside `<think>` arrives as plain content. #5's proposed fix wouldn't help that path because there's no reasoning-event family in chat completions to route to. 2. **A model-specific tag parser in the candle harness's hot loops** creates a coupling we don't want — DeepSeek-R1, Mistral Magistral, gpt-oss, and future reasoning models all use different markers. Per-model parser config is the wrong shape. ## What replaces it The leak fixes cleanly with a model-agnostic seam: at model load time, probe the tokenizer's `added_tokens` for any token whose content matches a known reasoning-marker convention. Store the open/close token IDs on `LoadedModel` (or `None` for non-reasoning models). The inference loop's token-level state machine routes between `TextDelta` and `ReasoningDelta` without any hardcoded model knowledge. The chat-completions projector then drops `ReasoningDelta` by default (matching the wire format's lack of a reasoning slot), opt-in via header for callers like helexa-acp that want the markers back. That's #8. Companion is #9 (pass `chat_template_kwargs` through to the chat template at tokenisation), which gives clients a request-side lever to suppress thinking at generation time — also model-agnostic since neuron doesn't interpret the kwarg, just forwards it. The Responses-API mapping (ReasoningDelta → `response.reasoning_summary_text.delta`) is still worth doing eventually but only matters once a Responses-API consumer of cortex exists; tracking under #7's reasoning sub-bullet rather than as a separate issue today.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#5