feat(neuron): strip reasoning from chat completions by default · 7733eecba5 - cortex

feat(neuron): strip reasoning from chat completions by default

Some checks failed

CI / CUDA type-check (push) Failing after 18s

Details

build-prerelease / Resolve version stamps (push) Successful in 32s

Details

CI / Format (push) Successful in 32s

Details

CI / Clippy (push) Successful in 2m36s

Details

build-prerelease / Build cortex binary (push) Successful in 4m29s

Details

CI / Test (push) Successful in 5m19s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

build-prerelease / Build neuron-blackwell (push) Successful in 5m56s

Details

build-prerelease / Package cortex RPM (push) Successful in 1m21s

Details

build-prerelease / Build neuron-ampere (push) Successful in 7m45s

Details

build-prerelease / Build neuron-ada (push) Successful in 5m24s

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s

Details

Closes #8.

Reasoning-capable models (Qwen3, DeepSeek-R1, gpt-oss, Mistral
Magistral, …) emit `<think>...</think>` blocks inline in their
content stream. The chat-completions wire format has no slot for
reasoning, so until this change every consumer either parsed the
markers themselves (helexa-acp) or wrote the raw scratchpad
content into their UI (Zed's commit-message generator — visible
as the leaked reasoning block on every generated commit message
against benjy's Qwen3-8B).

## Implementation, model-agnostic by design

The neuron side now does token-level routing without any
hardcoded model knowledge:

1. **At load time** (`detect_reasoning_token_pair` in
   `wire::event`), probe the tokenizer's vocabulary for a known
   reasoning-marker pair: `<think>` / `</think>` (Qwen3,
   DeepSeek-R1, gpt-oss), `[THINK]` / `[/THINK]` (Mistral
   Magistral), and a couple of derivatives. Each marker must
   resolve to a single token id; if both open and close resolve,
   stash on `LoadedModel.reasoning_tokens` (similarly
   `TpLoadedModel`). Non-reasoning models get `None` and pass
   through unchanged.

2. **At inference time**, the three streaming paths
   (`run_inference_streaming` CPU, `stream_inference_via_worker`
   CUDA single-GPU, `chat_completion_tp_stream` CUDA TP) now
   check each sampled token against the pair via the new
   `handle_reasoning_marker` helper before feeding it to the
   detokeniser. Open marker → set `in_reasoning = true`, drop
   the marker. Close marker → unset, drop. Other tokens go
   through `emit_delta(_blocking)` which now picks
   `ReasoningDelta` or `TextDelta` based on state. Markers
   never appear in the streamed output.

3. **In `wire::openai_chat`**, the projector splits into:
   - `project_chat_stream` (unchanged signature; default
     behaviour — drops `ReasoningDelta`)
   - `project_chat_stream_with(rx, …, ChatProjectionConfig)` —
     when `include_thinking: true` and `reasoning_markers:
     Some(_)`, re-wraps reasoning content with the literal
     open/close marker text and emits as content deltas.
     Preserves the on-the-wire shape that helexa-acp's
     `ThinkParser` expects.

4. **HTTP handler** reads `x-include-thinking: true` (case-
   insensitive `1`/`true`/`yes`) from the request headers and
   threads it into the projection config. cortex-gateway already
   forwards arbitrary headers verbatim, so the opt-in works
   end-to-end without gateway changes.

5. **helexa-acp's `openai_chat` provider** sets
   `x-include-thinking: true` on every request so its existing
   `ThinkParser` keeps receiving the marked content stream.
   `ThinkParser` itself is unchanged — needed for endpoints that
   aren't reasoning-aware (OpenRouter, OpenAI directly, etc.).

## Acceptance

- Zed's commit-message generator (vanilla chat-completions
  client, no `x-include-thinking`) gets clean commit messages
  with no `<think>` block.
- helexa-acp sessions continue to render thinking in Zed's
  thought UI via the opt-in path.
- Models without reasoning tokens declared in their tokenizer
  pass through unchanged.
- Implementation contains zero references to "qwen3" or any
  specific model — entirely driven by tokenizer metadata.

## Tests

9 new tests in `wire::event` (token-pair detection across 4
marker conventions, edge cases) and `wire::openai_chat` (default
drop, opt-in re-wrap with multi-chunk reasoning, close-marker on
Finish, fallback when markers absent, off-switch with markers
present). All 213 workspace tests pass; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This commit is contained in:

rob thijssen

2026-05-31 17:55:04 +03:00

parent fdc0adb738

commit 7733eecba5

6 changed files with 645 additions and 67 deletions

									
										2

crates/neuron/src/wire/mod.rs
									
												View File
												
				@@ -21,4 +21,4 @@ pub mod event;

				pub mod openai_chat;

				pub mod openai_responses;

				pub use event::{FinishReason, InferenceEvent};

				pub use event::{FinishReason, InferenceEvent, ReasoningTokenPair, detect_reasoning_token_pair};

2 crates/neuron/src/wire/mod.rs Unescape Escape View File

2

crates/neuron/src/wire/mod.rs

View File