Strip reasoning content from chat-completions output by default; opt-in via header #8
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When a reasoning-capable model (Qwen3, DeepSeek-R1, Mistral Magistral, gpt-oss, …) is loaded on neuron and a client hits
/v1/chat/completions, the model's<think>…</think>markers — and the body between them — arrive at the client as ordinarydelta.content. The chat-completions wire format has no slot for reasoning, so consumers that don't know about reasoning conventions (Zed's commit-message generator, any vanilla OpenAI client) write the entire stream into their UI field.Reproduced (cortex
0.1.16-0.1.20260531…, neuron Qwen3-8B on benjy): Zed's git-panel "generate commit message" feature populates the commit message field with:The
<think>block is the model's internal scratchpad, not output the user wants in their commit message.Why a server-side fix
crates/open_ai/src/completion.rsin zed-industries/zed) has nochat_template_kwargsfield, no Responses-API capability probe, no reasoning-tag awareness. Clients aren't going to fix this.ThinkParserworks because helexa-acp authored it. Every other client needs neuron to behave correctly by default.Why a model-agnostic shape
Reasoning conventions are not Qwen3-specific:
<think>/</think>declared as added_tokens intokenizer.json.<think>per its system card.[THINK]/[/THINK].Modern HF tokenizers declare these as named special tokens in
added_tokens. neuron should read what the tokenizer declares, not hardcode the syntax.Proposed implementation
At model load time, walk the loaded
tokenizers::Tokenizer's added-tokens table and look for tokens whose content matches a known reasoning-marker pattern (case-insensitivethink,reasoning, etc., bracketed by<>/[]). When a matched pair (open + close) is found, stash the token IDs onLoadedModel.reasoning_token_pair: Option<(u32, u32)>. Models without declared reasoning markers getNoneand pass through unchanged.At inference time, route at the token-sample step rather than after detokenisation. Before feeding
next_tokento theDecodeStream:Some(open) == next_token→ set reasoning state, do NOT feed the marker through the decoder, continue.Some(close) == next_token→ unset reasoning state, do NOT feed through, continue.InferenceEvent::ReasoningDeltawhen in reasoning state,TextDeltawhen not.This sidesteps the byte-buffering complexity of text-level parsing — the markers are always single special tokens by definition, so we know when we cross a boundary.
Sites to update:
run_inference_streaming(CPU),stream_inference_via_worker(CUDA single-GPU),chat_completion_tp_stream(CUDA TP). All three threadLoadedModelalready, so the token-pair is reachable.In
wire::openai_chat::project_chat_stream: by default, dropInferenceEvent::ReasoningDelta. Add a per-request header path so callers can opt in to receiving reasoning content. When opt-in is set, the projector re-wraps reasoning content with the literal<think>/</think>strings (looked up from the same tokenizer source so the syntax matches the model's actual markers) and emits asTextDelta. This keeps helexa-acp's existingThinkParserworking unchanged — helexa-acp addsx-include-thinking: trueto its requests.Header naming open to bikeshedding;
x-include-thinkingis the candidate. Could also be a request-body extension likeextra: { include_thinking: true }.Acceptance
<think>block) against an unconfigured chat-completions client.Related
ReasoningDelta→response.reasoning_summary_text.deltaso Responses consumers get typed reasoning events. Tracked under #7.