Pass through chat_template_kwargs to the chat template at tokenization #9

Closed
opened 2026-05-31 14:43:38 +00:00 by grenade · 0 comments
Owner

Problem

Some open-weight model families expose generation-time toggles through their chat template's kwargs interface:

  • Qwen3: chat_template_kwargs: { enable_thinking: false } injects /no_think into the conversation, telling the model not to emit <think>...</think> blocks at all. Saves tokens, latency, and avoids the cleanup downstream.
  • Other models will have their own kwargs as they ship — there isn't a standard set, but every modern HF chat template supports arbitrary kwargs.

Today, cortex's ChatCompletionRequest.extra (#[serde(flatten)] extra: Value) captures these fields when clients send them, but neuron's format_qwen3_prompt (crates/neuron/src/harness/candle.rs) is hardcoded to one template shape and ignores the request's chat_template_kwargs. Even if a client does send the kwarg (helexa-acp could, a future Zed feature might), it gets dropped.

Why model-agnostic

We don't want neuron to interpret enable_thinking (it's a Qwen3 token-template concept; another model would have something else). What neuron should do is forward the kwarg dict to the tokenizer's chat-template application and let the model's own chat template react.

The Rust tokenizers crate's apply_chat_template supports kwargs — it just passes them to the underlying Jinja-like template the tokenizer ships. We're missing the wiring, not the capability.

Proposed implementation

  1. Stop using format_qwen3_prompt as the only path. Replace it with a chat-template-driven format_prompt(messages, kwargs) that calls the loaded tokenizer's apply_chat_template (or equivalent) so the prompt formatting comes from the model's own template rather than a hardcoded Qwen3 string-glue function.

  2. Extract chat_template_kwargs from the request's extra field if present; pass it through to step 1.

  3. Fall back to the current format_qwen3_prompt behaviour when a model's tokenizer doesn't ship a chat template (older GGUFs without tokenizer.chat_template, etc.). Log at debug level so the operator can tell which path ran.

Acceptance

  • A client sending chat_template_kwargs: { enable_thinking: false } against a Qwen3 model gets a clean response with no <think> block.
  • The same payload against a non-Qwen3 model whose chat template doesn't define enable_thinking is silently ignored by the template (no error — Jinja's default() filter handles this gracefully).
  • The implementation contains zero hardcoded knowledge of the enable_thinking kwarg or any other model-specific kwarg.

Tradeoffs

  • Replacing format_qwen3_prompt with chat-template application means we rely on the model's chat template being correct. Pre-trained tokenisers from HF have been the source of subtle bugs (extra BOS tokens, wrong system role names). Mitigate with an integration test per supported model that asserts the assembled prompt matches a fixture.
  • For some quantised models (GGUFs that strip the chat template), we keep the hardcoded fallback. Document which models fall into that bucket as they appear.
  • Replaces #5 (Qwen3-specific reasoning routing); composes with #8 (server-side reasoning strip) — clients that want clean output can use either lever depending on what their endpoint supports.
  • Independent of #4 (chained conversations), #6 (tool-call extraction), #7 (in_progress event).
## Problem Some open-weight model families expose generation-time toggles through their chat template's kwargs interface: - **Qwen3**: `chat_template_kwargs: { enable_thinking: false }` injects `/no_think` into the conversation, telling the model not to emit `<think>...</think>` blocks at all. Saves tokens, latency, and avoids the cleanup downstream. - **Other models** will have their own kwargs as they ship — there isn't a standard set, but every modern HF chat template supports arbitrary kwargs. Today, cortex's `ChatCompletionRequest.extra` (`#[serde(flatten)] extra: Value`) captures these fields when clients send them, but neuron's `format_qwen3_prompt` (`crates/neuron/src/harness/candle.rs`) is hardcoded to one template shape and ignores the request's `chat_template_kwargs`. Even if a client *does* send the kwarg (helexa-acp could, a future Zed feature might), it gets dropped. ## Why model-agnostic We don't want neuron to interpret `enable_thinking` (it's a Qwen3 token-template concept; another model would have something else). What neuron should do is **forward the kwarg dict to the tokenizer's chat-template application** and let the model's own chat template react. The Rust `tokenizers` crate's `apply_chat_template` supports kwargs — it just passes them to the underlying Jinja-like template the tokenizer ships. We're missing the wiring, not the capability. ## Proposed implementation 1. **Stop using `format_qwen3_prompt`** as the only path. Replace it with a chat-template-driven `format_prompt(messages, kwargs)` that calls the loaded tokenizer's `apply_chat_template` (or equivalent) so the prompt formatting comes from the model's own template rather than a hardcoded Qwen3 string-glue function. 2. **Extract `chat_template_kwargs`** from the request's `extra` field if present; pass it through to step 1. 3. **Fall back** to the current `format_qwen3_prompt` behaviour when a model's tokenizer doesn't ship a chat template (older GGUFs without `tokenizer.chat_template`, etc.). Log at debug level so the operator can tell which path ran. ## Acceptance - A client sending `chat_template_kwargs: { enable_thinking: false }` against a Qwen3 model gets a clean response with no `<think>` block. - The same payload against a non-Qwen3 model whose chat template doesn't define `enable_thinking` is silently ignored by the template (no error — Jinja's `default()` filter handles this gracefully). - The implementation contains zero hardcoded knowledge of the `enable_thinking` kwarg or any other model-specific kwarg. ## Tradeoffs - Replacing `format_qwen3_prompt` with chat-template application means we rely on the model's chat template being correct. Pre-trained tokenisers from HF have been the source of subtle bugs (extra BOS tokens, wrong system role names). Mitigate with an integration test per supported model that asserts the assembled prompt matches a fixture. - For some quantised models (GGUFs that strip the chat template), we keep the hardcoded fallback. Document which models fall into that bucket as they appear. ## Related - Replaces #5 (Qwen3-specific reasoning routing); composes with #8 (server-side reasoning strip) — clients that want clean output can use either lever depending on what their endpoint supports. - Independent of #4 (chained conversations), #6 (tool-call extraction), #7 (in_progress event).
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#9