feat(neuron): render the model's chat_template with chat_template_kwargs
Some checks failed
CI / CUDA type-check (push) Failing after 58s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
build-prerelease / Build neuron-ampere (push) Failing after 1s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m47s
CI / Test (push) Successful in 6m13s
build-prerelease / Build neuron-blackwell (push) Failing after 5m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ada (push) Failing after 7m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped

Closes #9.

Replaces the hardcoded `format_qwen3_prompt` ChatML glue with
`minijinja`-driven rendering of the model's own `chat_template`
from `tokenizer_config.json`. The request's `chat_template_kwargs`
flow into the Jinja context so model-specific levers
(Qwen3's `enable_thinking: false`, etc.) actually take effect.

## Implementation

- New `harness::chat_template` module with three entry points:
  - `load_chat_template_alongside(tokenizer_json_path)` — probes
    `tokenizer_config.json` in the same hf-hub snapshot directory.
    Supports both the canonical string-form `chat_template` and
    the array-form some tokenizers ship (multi-template models).
  - `render_chat_template(template, messages, tools, kwargs)` —
    renders via `minijinja`. Messages flatten into the
    `[{role, content}]` shape HF templates iterate, with
    per-message extras (`tool_calls`, `tool_call_id`) preserved.
    `tools` and `kwargs` add into the Jinja context so templates
    that reference them work without us interpreting their shape.
  - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE`
    (default true). Falsy values force the fallback path
    everywhere — a kill switch for emergency rollback without a
    rebuild.

- `LoadedModel.chat_template: Option<String>` and the TP
  equivalent are populated once at load time. `None` (no
  tokenizer_config.json, parse error, missing field) routes the
  fallback path silently; logs go through `tracing::debug`/`warn`
  per condition.

- New `build_prompt_for_request(chat_template, request)` wraps
  the decision: when both the template is present AND the kill
  switch is off, render with kwargs from `request.extra` (looks
  up `chat_template_kwargs` and `tools` lazily). On render error
  → warn + fallback to `format_qwen3_prompt`. Wired into all four
  current prompt-build sites (single-GPU stream + non-stream, TP
  stream + non-stream).

## Dependency

`minijinja = "2"` with the `builtins`, `json`, and `serde`
features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used
internally by HF's `tokenizers-rs` for its own chat templating;
the API surface we touch (`Environment::add_template` +
`Template::render(serde_value)`) is stable.

## Validation strategy

I can't byte-compare the new path's output against
`format_qwen3_prompt` for live models without GPU (CI doesn't
have one). The fallback path and kill switch are the mitigations
— a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the
neuron service env if the chat template renders surprisingly on
Qwen3-8B in production. The legacy formatter stays the
fail-closed default.

## Scope cuts (documented in module header)

- Tool-definition lifting from helexa-acp's system-prompt
  injection into the chat_template's native tools block is
  deferred. Today the request's `tools` array threads into the
  Jinja context, but helexa-acp continues to inject Hermes-format
  tool descriptions into the system prompt for backwards-compat
  with non-cortex endpoints.

## Tests

9 unit tests in `chat_template`: kill-switch matrix (truthy /
falsy / unset), template loading (string form, array form,
missing file, unparseable JSON, missing field), rendering
(basic conversation threading, kwargs forwarding, message-extras
threading for tool_calls).

215 workspace tests pass; clippy + fmt clean across all workspace
features (default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-31 23:43:11 +03:00
parent 44008358c5
commit cb303832bc
5 changed files with 517 additions and 4 deletions

View File

@@ -167,6 +167,15 @@ pub struct LoadedModel {
/// through as plain text in that case and the consumer parses
/// the markers itself if it knows how.
pub tool_call_tokens: Option<ToolCallTokenPair>,
/// Raw Jinja `chat_template` string loaded from this model's
/// `tokenizer_config.json` at load time. `None` when the file
/// is absent / unparseable / lacks the field. When `Some`,
/// the prompt-build path renders it through `minijinja` with
/// `chat_template_kwargs` from the request body; when `None`,
/// the hardcoded Qwen3 ChatML fallback (`format_qwen3_prompt`)
/// is used. The `NEURON_USE_CHAT_TEMPLATE=false` env var
/// forces the fallback path even when `Some`.
pub chat_template: Option<String>,
}
impl LoadedModel {
@@ -229,6 +238,8 @@ pub struct TpLoadedModel {
pub reasoning_tokens: Option<ReasoningTokenPair>,
/// Same shape as [`LoadedModel::tool_call_tokens`].
pub tool_call_tokens: Option<ToolCallTokenPair>,
/// Same shape as [`LoadedModel::chat_template`].
pub chat_template: Option<String>,
}
#[cfg(feature = "cuda")]
@@ -1397,7 +1408,7 @@ impl CandleHarness {
let _inference_guard = loaded.inference_lock.lock().await;
let result = async {
let prompt = format_qwen3_prompt(&request.messages);
let prompt = build_prompt_for_request(loaded.chat_template.as_deref(), &request);
let encoding = loaded
.tokenizer
@@ -1702,7 +1713,7 @@ impl CandleHarness {
}
};
let prompt = format_qwen3_prompt(&request.messages);
let prompt = build_prompt_for_request(loaded.chat_template.as_deref(), &request);
let encoding = loaded
.tokenizer
.encode(prompt.as_str(), true)
@@ -2081,6 +2092,19 @@ impl Harness for CandleHarness {
"tool-call markers detected — streaming will emit structured ToolCall events"
);
}
// Probe `tokenizer_config.json` in the same snapshot dir.
// When present and non-empty, the inference path renders
// this Jinja template with the request's
// `chat_template_kwargs` instead of using the hardcoded
// ChatML formatter. Best-effort: missing or unparseable
// configs silently fall through to the legacy path.
let chat_template = super::chat_template::load_chat_template_alongside(&tokenizer_path);
if chat_template.is_some() {
tracing::info!(
model = %spec.model_id,
"chat_template loaded from tokenizer_config.json — prompt assembly will use the model's own template"
);
}
let loaded = Arc::new(LoadedModel {
model_id: spec.model_id.clone(),
@@ -2095,6 +2119,7 @@ impl Harness for CandleHarness {
inference_lock: tokio::sync::Mutex::new(()),
reasoning_tokens,
tool_call_tokens,
chat_template,
});
let mut models = self.models.write().await;
@@ -2288,6 +2313,13 @@ impl CandleHarness {
"TP load: tool-call markers detected"
);
}
let chat_template = super::chat_template::load_chat_template_alongside(&tokenizer_path);
if chat_template.is_some() {
tracing::info!(
model = %spec.model_id,
"TP load: chat_template loaded from tokenizer_config.json"
);
}
let tp_loaded = StdArc::new(TpLoadedModel {
model_id: spec.model_id.clone(),
@@ -2303,6 +2335,7 @@ impl CandleHarness {
worker: leader_worker,
reasoning_tokens,
tool_call_tokens,
chat_template,
});
let mut models = self.models.write().await;
@@ -2429,7 +2462,7 @@ impl CandleHarness {
return Err(poisoned_error(&request.model));
}
let prompt = format_qwen3_prompt(&request.messages);
let prompt = build_prompt_for_request(tp.chat_template.as_deref(), &request);
let encoding = tp
.tokenizer
.encode(prompt.as_str(), true)
@@ -2893,7 +2926,7 @@ async fn chat_completion_tp_inner(
let req_start = std::time::Instant::now();
let model_id = request.model.clone();
let prompt = format_qwen3_prompt(&request.messages);
let prompt = build_prompt_for_request(tp.chat_template.as_deref(), &request);
let encoding = tp
.tokenizer
.encode(prompt.as_str(), true)
@@ -3242,6 +3275,66 @@ pub enum InferenceError {
Other(#[from] anyhow::Error),
}
/// Build the model's prompt from a [`ChatCompletionRequest`].
///
/// Prefers the model's own `chat_template` when one was loaded
/// from `tokenizer_config.json` at startup and the
/// `NEURON_USE_CHAT_TEMPLATE` kill switch isn't tripped. The
/// request's `chat_template_kwargs` (e.g.
/// `{"enable_thinking": false}` on Qwen3) and `tools` array flow
/// into the template's Jinja context so model-specific behaviour
/// like reasoning-suppression-at-generation works.
///
/// Falls back to [`format_qwen3_prompt`] (the legacy hardcoded
/// ChatML glue) on any of:
///
/// - no `chat_template` loaded for this model (older quantised
/// variants, fallback-only models)
/// - the env kill switch is set to a falsy value
/// - the template rendered to an error (caller can flip the env
/// var to force fallback while debugging the template)
///
/// Failures are logged at `warn` so an operator running with
/// `RUST_LOG=neuron=debug` sees which path each request took.
fn build_prompt_for_request(
chat_template: Option<&str>,
request: &ChatCompletionRequest,
) -> String {
if !super::chat_template::chat_templates_enabled() {
return format_qwen3_prompt(&request.messages);
}
let Some(tmpl) = chat_template else {
return format_qwen3_prompt(&request.messages);
};
// Pull `chat_template_kwargs` and `tools` from the request's
// catch-all `extra` field. Both are optional; absent fields
// become `Value::Null`, which the renderer skips inserting
// into the Jinja context.
let kwargs = request
.extra
.get("chat_template_kwargs")
.cloned()
.unwrap_or(serde_json::Value::Null);
let tools = request
.extra
.get("tools")
.cloned()
.unwrap_or(serde_json::Value::Null);
match super::chat_template::render_chat_template(tmpl, &request.messages, &tools, &kwargs) {
Ok(prompt) => prompt,
Err(e) => {
tracing::warn!(
model = %request.model,
error = %format!("{e:#}"),
"chat_template render failed; falling back to format_qwen3_prompt"
);
format_qwen3_prompt(&request.messages)
}
}
}
/// Apply the Qwen3 chat template:
///
/// ```text