Pass through chat_template_kwargs to the chat template at tokenization
#9
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Some open-weight model families expose generation-time toggles through their chat template's kwargs interface:
chat_template_kwargs: { enable_thinking: false }injects/no_thinkinto the conversation, telling the model not to emit<think>...</think>blocks at all. Saves tokens, latency, and avoids the cleanup downstream.Today, cortex's
ChatCompletionRequest.extra(#[serde(flatten)] extra: Value) captures these fields when clients send them, but neuron'sformat_qwen3_prompt(crates/neuron/src/harness/candle.rs) is hardcoded to one template shape and ignores the request'schat_template_kwargs. Even if a client does send the kwarg (helexa-acp could, a future Zed feature might), it gets dropped.Why model-agnostic
We don't want neuron to interpret
enable_thinking(it's a Qwen3 token-template concept; another model would have something else). What neuron should do is forward the kwarg dict to the tokenizer's chat-template application and let the model's own chat template react.The Rust
tokenizerscrate'sapply_chat_templatesupports kwargs — it just passes them to the underlying Jinja-like template the tokenizer ships. We're missing the wiring, not the capability.Proposed implementation
Stop using
format_qwen3_promptas the only path. Replace it with a chat-template-drivenformat_prompt(messages, kwargs)that calls the loaded tokenizer'sapply_chat_template(or equivalent) so the prompt formatting comes from the model's own template rather than a hardcoded Qwen3 string-glue function.Extract
chat_template_kwargsfrom the request'sextrafield if present; pass it through to step 1.Fall back to the current
format_qwen3_promptbehaviour when a model's tokenizer doesn't ship a chat template (older GGUFs withouttokenizer.chat_template, etc.). Log at debug level so the operator can tell which path ran.Acceptance
chat_template_kwargs: { enable_thinking: false }against a Qwen3 model gets a clean response with no<think>block.enable_thinkingis silently ignored by the template (no error — Jinja'sdefault()filter handles this gracefully).enable_thinkingkwarg or any other model-specific kwarg.Tradeoffs
format_qwen3_promptwith chat-template application means we rely on the model's chat template being correct. Pre-trained tokenisers from HF have been the source of subtle bugs (extra BOS tokens, wrong system role names). Mitigate with an integration test per supported model that asserts the assembled prompt matches a fixture.Related