Strip reasoning content from chat-completions output by default; opt-in via header #8

Closed
opened 2026-05-31 14:43:14 +00:00 by grenade · 0 comments
Owner

Problem

When a reasoning-capable model (Qwen3, DeepSeek-R1, Mistral Magistral, gpt-oss, …) is loaded on neuron and a client hits /v1/chat/completions, the model's <think>…</think> markers — and the body between them — arrive at the client as ordinary delta.content. The chat-completions wire format has no slot for reasoning, so consumers that don't know about reasoning conventions (Zed's commit-message generator, any vanilla OpenAI client) write the entire stream into their UI field.

Reproduced (cortex 0.1.16-0.1.20260531…, neuron Qwen3-8B on benjy): Zed's git-panel "generate commit message" feature populates the commit message field with:

<think>
Okay, let's tackle this commit message...
[hundreds of words of reasoning]
</think>

feat: Add assignment_confidence column and indexes
…

The <think> block is the model's internal scratchpad, not output the user wants in their commit message.

Why a server-side fix

  • The wire format has no place to put reasoning. There is nothing to "do right" client-side except parse the tags out, which every consumer would have to implement independently.
  • Zed's chat-completions client (verified against crates/open_ai/src/completion.rs in zed-industries/zed) has no chat_template_kwargs field, no Responses-API capability probe, no reasoning-tag awareness. Clients aren't going to fix this.
  • helexa-acp's existing client-side ThinkParser works because helexa-acp authored it. Every other client needs neuron to behave correctly by default.

Why a model-agnostic shape

Reasoning conventions are not Qwen3-specific:

  • Qwen3: <think> / </think> declared as added_tokens in tokenizer.json.
  • DeepSeek-R1: same markers.
  • gpt-oss: also <think> per its system card.
  • Mistral Magistral: [THINK] / [/THINK].

Modern HF tokenizers declare these as named special tokens in added_tokens. neuron should read what the tokenizer declares, not hardcode the syntax.

Proposed implementation

  1. At model load time, walk the loaded tokenizers::Tokenizer's added-tokens table and look for tokens whose content matches a known reasoning-marker pattern (case-insensitive think, reasoning, etc., bracketed by <> / []). When a matched pair (open + close) is found, stash the token IDs on LoadedModel.reasoning_token_pair: Option<(u32, u32)>. Models without declared reasoning markers get None and pass through unchanged.

  2. At inference time, route at the token-sample step rather than after detokenisation. Before feeding next_token to the DecodeStream:

    • If Some(open) == next_token → set reasoning state, do NOT feed the marker through the decoder, continue.
    • If Some(close) == next_token → unset reasoning state, do NOT feed through, continue.
    • Otherwise → feed to decoder; emit InferenceEvent::ReasoningDelta when in reasoning state, TextDelta when not.

    This sidesteps the byte-buffering complexity of text-level parsing — the markers are always single special tokens by definition, so we know when we cross a boundary.

    Sites to update: run_inference_streaming (CPU), stream_inference_via_worker (CUDA single-GPU), chat_completion_tp_stream (CUDA TP). All three thread LoadedModel already, so the token-pair is reachable.

  3. In wire::openai_chat::project_chat_stream: by default, drop InferenceEvent::ReasoningDelta. Add a per-request header path so callers can opt in to receiving reasoning content. When opt-in is set, the projector re-wraps reasoning content with the literal <think> / </think> strings (looked up from the same tokenizer source so the syntax matches the model's actual markers) and emits as TextDelta. This keeps helexa-acp's existing ThinkParser working unchanged — helexa-acp adds x-include-thinking: true to its requests.

    Header naming open to bikeshedding; x-include-thinking is the candidate. Could also be a request-body extension like extra: { include_thinking: true }.

Acceptance

  • Zed's commit-message generator produces clean commit messages (no <think> block) against an unconfigured chat-completions client.
  • helexa-acp continues to render thinking in Zed's thought UI (opt-in path).
  • Models without reasoning tokens declared in their tokenizer pass through unchanged.
  • The implementation contains zero hardcoded references to "qwen3" or any specific model.
  • Replaces #5 (which proposed Qwen3-specific routing).
  • Companion: #9 (chat_template_kwargs passthrough — request-side lever to suppress reasoning at generation time).
  • Future: Responses API projector should map ReasoningDeltaresponse.reasoning_summary_text.delta so Responses consumers get typed reasoning events. Tracked under #7.
## Problem When a reasoning-capable model (Qwen3, DeepSeek-R1, Mistral Magistral, gpt-oss, …) is loaded on neuron and a client hits `/v1/chat/completions`, the model's `<think>…</think>` markers — and the body between them — arrive at the client as ordinary `delta.content`. The chat-completions wire format has no slot for reasoning, so consumers that don't know about reasoning conventions (Zed's commit-message generator, any vanilla OpenAI client) write the entire stream into their UI field. **Reproduced** (cortex `0.1.16-0.1.20260531…`, neuron Qwen3-8B on benjy): Zed's git-panel "generate commit message" feature populates the commit message field with: ``` <think> Okay, let's tackle this commit message... [hundreds of words of reasoning] </think> feat: Add assignment_confidence column and indexes … ``` The `<think>` block is the model's internal scratchpad, not output the user wants in their commit message. ## Why a server-side fix - The wire format has no place to put reasoning. There is nothing to "do right" client-side except parse the tags out, which every consumer would have to implement independently. - Zed's chat-completions client (verified against `crates/open_ai/src/completion.rs` in zed-industries/zed) has no `chat_template_kwargs` field, no Responses-API capability probe, no reasoning-tag awareness. Clients aren't going to fix this. - helexa-acp's existing client-side `ThinkParser` works because helexa-acp authored it. Every other client needs neuron to behave correctly by default. ## Why a model-agnostic shape Reasoning conventions are not Qwen3-specific: - Qwen3: `<think>` / `</think>` declared as added_tokens in `tokenizer.json`. - DeepSeek-R1: same markers. - gpt-oss: also `<think>` per its system card. - Mistral Magistral: `[THINK]` / `[/THINK]`. Modern HF tokenizers declare these as named special tokens in `added_tokens`. neuron should read what the tokenizer declares, not hardcode the syntax. ## Proposed implementation 1. **At model load time**, walk the loaded `tokenizers::Tokenizer`'s added-tokens table and look for tokens whose content matches a known reasoning-marker pattern (case-insensitive `think`, `reasoning`, etc., bracketed by `<>` / `[]`). When a matched pair (open + close) is found, stash the token IDs on `LoadedModel.reasoning_token_pair: Option<(u32, u32)>`. Models without declared reasoning markers get `None` and pass through unchanged. 2. **At inference time**, route at the token-sample step rather than after detokenisation. Before feeding `next_token` to the `DecodeStream`: - If `Some(open) == next_token` → set reasoning state, do NOT feed the marker through the decoder, continue. - If `Some(close) == next_token` → unset reasoning state, do NOT feed through, continue. - Otherwise → feed to decoder; emit `InferenceEvent::ReasoningDelta` when in reasoning state, `TextDelta` when not. This sidesteps the byte-buffering complexity of text-level parsing — the markers are always single special tokens by definition, so we know when we cross a boundary. Sites to update: `run_inference_streaming` (CPU), `stream_inference_via_worker` (CUDA single-GPU), `chat_completion_tp_stream` (CUDA TP). All three thread `LoadedModel` already, so the token-pair is reachable. 3. **In `wire::openai_chat::project_chat_stream`**: by default, drop `InferenceEvent::ReasoningDelta`. Add a per-request header path so callers can opt in to receiving reasoning content. When opt-in is set, the projector re-wraps reasoning content with the literal `<think>` / `</think>` strings (looked up from the same tokenizer source so the syntax matches the model's actual markers) and emits as `TextDelta`. This keeps helexa-acp's existing `ThinkParser` working unchanged — helexa-acp adds `x-include-thinking: true` to its requests. Header naming open to bikeshedding; `x-include-thinking` is the candidate. Could also be a request-body extension like `extra: { include_thinking: true }`. ## Acceptance - Zed's commit-message generator produces clean commit messages (no `<think>` block) against an unconfigured chat-completions client. - helexa-acp continues to render thinking in Zed's thought UI (opt-in path). - Models without reasoning tokens declared in their tokenizer pass through unchanged. - The implementation contains zero hardcoded references to "qwen3" or any specific model. ## Related - Replaces #5 (which proposed Qwen3-specific routing). - Companion: #9 (chat_template_kwargs passthrough — request-side lever to suppress reasoning at generation time). - Future: Responses API projector should map `ReasoningDelta` → `response.reasoning_summary_text.delta` so Responses consumers get typed reasoning events. Tracked under #7.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#8