Pass through chat_template_kwargs to the chat template at tokenization #9

New Issue

grenade · 2026-05-31T14:43:38Z

grenade commented

2026-05-31 14:43:38 +00:00

Problem

Some open-weight model families expose generation-time toggles through their chat template's kwargs interface:

Qwen3: chat_template_kwargs: { enable_thinking: false } injects /no_think into the conversation, telling the model not to emit <think>...</think> blocks at all. Saves tokens, latency, and avoids the cleanup downstream.
Other models will have their own kwargs as they ship — there isn't a standard set, but every modern HF chat template supports arbitrary kwargs.

Today, cortex's ChatCompletionRequest.extra (#[serde(flatten)] extra: Value) captures these fields when clients send them, but neuron's format_qwen3_prompt (crates/neuron/src/harness/candle.rs) is hardcoded to one template shape and ignores the request's chat_template_kwargs. Even if a client does send the kwarg (helexa-acp could, a future Zed feature might), it gets dropped.

Why model-agnostic

We don't want neuron to interpret enable_thinking (it's a Qwen3 token-template concept; another model would have something else). What neuron should do is forward the kwarg dict to the tokenizer's chat-template application and let the model's own chat template react.

The Rust tokenizers crate's apply_chat_template supports kwargs — it just passes them to the underlying Jinja-like template the tokenizer ships. We're missing the wiring, not the capability.

Proposed implementation

Stop using format_qwen3_prompt as the only path. Replace it with a chat-template-driven format_prompt(messages, kwargs) that calls the loaded tokenizer's apply_chat_template (or equivalent) so the prompt formatting comes from the model's own template rather than a hardcoded Qwen3 string-glue function.
Extract chat_template_kwargs from the request's extra field if present; pass it through to step 1.
Fall back to the current format_qwen3_prompt behaviour when a model's tokenizer doesn't ship a chat template (older GGUFs without tokenizer.chat_template, etc.). Log at debug level so the operator can tell which path ran.

Acceptance

A client sending chat_template_kwargs: { enable_thinking: false } against a Qwen3 model gets a clean response with no <think> block.
The same payload against a non-Qwen3 model whose chat template doesn't define enable_thinking is silently ignored by the template (no error — Jinja's default() filter handles this gracefully).
The implementation contains zero hardcoded knowledge of the enable_thinking kwarg or any other model-specific kwarg.

Tradeoffs

Replacing format_qwen3_prompt with chat-template application means we rely on the model's chat template being correct. Pre-trained tokenisers from HF have been the source of subtle bugs (extra BOS tokens, wrong system role names). Mitigate with an integration test per supported model that asserts the assembled prompt matches a fixture.
For some quantised models (GGUFs that strip the chat template), we keep the hardcoded fallback. Document which models fall into that bucket as they appear.

Replaces #5 (Qwen3-specific reasoning routing); composes with #8 (server-side reasoning strip) — clients that want clean output can use either lever depending on what their endpoint supports.
Independent of #4 (chained conversations), #6 (tool-call extraction), #7 (in_progress event).

## Problem Some open-weight model families expose generation-time toggles through their chat template's kwargs interface: - **Qwen3**: `chat_template_kwargs: { enable_thinking: false }` injects `/no_think` into the conversation, telling the model not to emit `<think>...</think>` blocks at all. Saves tokens, latency, and avoids the cleanup downstream. - **Other models** will have their own kwargs as they ship — there isn't a standard set, but every modern HF chat template supports arbitrary kwargs. Today, cortex's `ChatCompletionRequest.extra` (`#[serde(flatten)] extra: Value`) captures these fields when clients send them, but neuron's `format_qwen3_prompt` (`crates/neuron/src/harness/candle.rs`) is hardcoded to one template shape and ignores the request's `chat_template_kwargs`. Even if a client *does* send the kwarg (helexa-acp could, a future Zed feature might), it gets dropped. ## Why model-agnostic We don't want neuron to interpret `enable_thinking` (it's a Qwen3 token-template concept; another model would have something else). What neuron should do is **forward the kwarg dict to the tokenizer's chat-template application** and let the model's own chat template react. The Rust `tokenizers` crate's `apply_chat_template` supports kwargs — it just passes them to the underlying Jinja-like template the tokenizer ships. We're missing the wiring, not the capability. ## Proposed implementation 1. **Stop using `format_qwen3_prompt`** as the only path. Replace it with a chat-template-driven `format_prompt(messages, kwargs)` that calls the loaded tokenizer's `apply_chat_template` (or equivalent) so the prompt formatting comes from the model's own template rather than a hardcoded Qwen3 string-glue function. 2. **Extract `chat_template_kwargs`** from the request's `extra` field if present; pass it through to step 1. 3. **Fall back** to the current `format_qwen3_prompt` behaviour when a model's tokenizer doesn't ship a chat template (older GGUFs without `tokenizer.chat_template`, etc.). Log at debug level so the operator can tell which path ran. ## Acceptance - A client sending `chat_template_kwargs: { enable_thinking: false }` against a Qwen3 model gets a clean response with no `<think>` block. - The same payload against a non-Qwen3 model whose chat template doesn't define `enable_thinking` is silently ignored by the template (no error — Jinja's `default()` filter handles this gracefully). - The implementation contains zero hardcoded knowledge of the `enable_thinking` kwarg or any other model-specific kwarg. ## Tradeoffs - Replacing `format_qwen3_prompt` with chat-template application means we rely on the model's chat template being correct. Pre-trained tokenisers from HF have been the source of subtle bugs (extra BOS tokens, wrong system role names). Mitigate with an integration test per supported model that asserts the assembled prompt matches a fixture. - For some quantised models (GGUFs that strip the chat template), we keep the hardcoded fallback. Document which models fall into that bucket as they appear. ## Related - Replaces #5 (Qwen3-specific reasoning routing); composes with #8 (server-side reasoning strip) — clients that want clean output can use either lever depending on what their endpoint supports. - Independent of #4 (chained conversations), #6 (tool-call extraction), #7 (in_progress event).

grenade referenced this issue from a commit

2026-05-31 20:43:16 +00:00

feat(neuron): render the model's chat_template with chat_template_kwargs

grenade closed this issue

2026-05-31 20:43:16 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: helexa/cortex#9

Pass through chat_template_kwargs to the chat template at tokenization #9

Problem

Why model-agnostic

Proposed implementation

Acceptance

Tradeoffs

Related

Pass through `chat_template_kwargs` to the chat template at tokenization #9