feat(neuron): render the model's chat_template with chat_template_kwargs

Closes #9. Replaces the hardcoded `format_qwen3_prompt` ChatML glue with `minijinja`-driven rendering of the model's own `chat_template` from `tokenizer_config.json`. The request's `chat_template_kwargs` flow into the Jinja context so model-specific levers (Qwen3's `enable_thinking: false`, etc.) actually take effect. ## Implementation - New `harness::chat_template` module with three entry points: - `load_chat_template_alongside(tokenizer_json_path)` — probes `tokenizer_config.json` in the same hf-hub snapshot directory. Supports both the canonical string-form `chat_template` and the array-form some tokenizers ship (multi-template models). - `render_chat_template(template, messages, tools, kwargs)` — renders via `minijinja`. Messages flatten into the `[{role, content}]` shape HF templates iterate, with per-message extras (`tool_calls`, `tool_call_id`) preserved. `tools` and `kwargs` add into the Jinja context so templates that reference them work without us interpreting their shape. - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE` (default true). Falsy values force the fallback path everywhere — a kill switch for emergency rollback without a rebuild. - `LoadedModel.chat_template: Option<String>` and the TP equivalent are populated once at load time. `None` (no tokenizer_config.json, parse error, missing field) routes the fallback path silently; logs go through `tracing::debug`/`warn` per condition. - New `build_prompt_for_request(chat_template, request)` wraps the decision: when both the template is present AND the kill switch is off, render with kwargs from `request.extra` (looks up `chat_template_kwargs` and `tools` lazily). On render error → warn + fallback to `format_qwen3_prompt`. Wired into all four current prompt-build sites (single-GPU stream + non-stream, TP stream + non-stream). ## Dependency `minijinja = "2"` with the `builtins`, `json`, and `serde` features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used internally by HF's `tokenizers-rs` for its own chat templating; the API surface we touch (`Environment::add_template` + `Template::render(serde_value)`) is stable. ## Validation strategy I can't byte-compare the new path's output against `format_qwen3_prompt` for live models without GPU (CI doesn't have one). The fallback path and kill switch are the mitigations — a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the neuron service env if the chat template renders surprisingly on Qwen3-8B in production. The legacy formatter stays the fail-closed default. ## Scope cuts (documented in module header) - Tool-definition lifting from helexa-acp's system-prompt injection into the chat_template's native tools block is deferred. Today the request's `tools` array threads into the Jinja context, but helexa-acp continues to inject Hermes-format tool descriptions into the system prompt for backwards-compat with non-cortex endpoints. ## Tests 9 unit tests in `chat_template`: kill-switch matrix (truthy / falsy / unset), template loading (string form, array form, missing file, unparseable JSON, missing field), rendering (basic conversation threading, kwargs forwarding, message-extras threading for tool_calls). 215 workspace tests pass; clippy + fmt clean across all workspace features (default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:43:11 +03:00
parent 44008358c5
commit cb303832bc
5 changed files with 517 additions and 4 deletions
--- a/crates/neuron/src/harness/candle.rs
+++ b/crates/neuron/src/harness/candle.rs
@@ -167,6 +167,15 @@ pub struct LoadedModel {
    /// through as plain text in that case and the consumer parses
    /// the markers itself if it knows how.
    pub tool_call_tokens: Option<ToolCallTokenPair>,
+    /// Raw Jinja `chat_template` string loaded from this model's
+    /// `tokenizer_config.json` at load time. `None` when the file
+    /// is absent / unparseable / lacks the field. When `Some`,
+    /// the prompt-build path renders it through `minijinja` with
+    /// `chat_template_kwargs` from the request body; when `None`,
+    /// the hardcoded Qwen3 ChatML fallback (`format_qwen3_prompt`)
+    /// is used. The `NEURON_USE_CHAT_TEMPLATE=false` env var
+    /// forces the fallback path even when `Some`.
+    pub chat_template: Option<String>,
 }

 impl LoadedModel {
@@ -229,6 +238,8 @@ pub struct TpLoadedModel {
    pub reasoning_tokens: Option<ReasoningTokenPair>,
    /// Same shape as [`LoadedModel::tool_call_tokens`].
    pub tool_call_tokens: Option<ToolCallTokenPair>,
+    /// Same shape as [`LoadedModel::chat_template`].
+    pub chat_template: Option<String>,
 }

 #[cfg(feature = "cuda")]
@@ -1397,7 +1408,7 @@ impl CandleHarness {
        let _inference_guard = loaded.inference_lock.lock().await;

        let result = async {
-            let prompt = format_qwen3_prompt(&request.messages);
+            let prompt = build_prompt_for_request(loaded.chat_template.as_deref(), &request);

            let encoding = loaded
                .tokenizer
@@ -1702,7 +1713,7 @@ impl CandleHarness {
            }
        };

-        let prompt = format_qwen3_prompt(&request.messages);
+        let prompt = build_prompt_for_request(loaded.chat_template.as_deref(), &request);
        let encoding = loaded
            .tokenizer
            .encode(prompt.as_str(), true)
@@ -2081,6 +2092,19 @@ impl Harness for CandleHarness {
                "tool-call markers detected — streaming will emit structured ToolCall events"
            );
        }
+        // Probe `tokenizer_config.json` in the same snapshot dir.
+        // When present and non-empty, the inference path renders
+        // this Jinja template with the request's
+        // `chat_template_kwargs` instead of using the hardcoded
+        // ChatML formatter. Best-effort: missing or unparseable
+        // configs silently fall through to the legacy path.
+        let chat_template = super::chat_template::load_chat_template_alongside(&tokenizer_path);
+        if chat_template.is_some() {
+            tracing::info!(
+                model = %spec.model_id,
+                "chat_template loaded from tokenizer_config.json — prompt assembly will use the model's own template"
+            );
+        }

        let loaded = Arc::new(LoadedModel {
            model_id: spec.model_id.clone(),
@@ -2095,6 +2119,7 @@ impl Harness for CandleHarness {
            inference_lock: tokio::sync::Mutex::new(()),
            reasoning_tokens,
            tool_call_tokens,
+            chat_template,
        });

        let mut models = self.models.write().await;
@@ -2288,6 +2313,13 @@ impl CandleHarness {
                "TP load: tool-call markers detected"
            );
        }
+        let chat_template = super::chat_template::load_chat_template_alongside(&tokenizer_path);
+        if chat_template.is_some() {
+            tracing::info!(
+                model = %spec.model_id,
+                "TP load: chat_template loaded from tokenizer_config.json"
+            );
+        }

        let tp_loaded = StdArc::new(TpLoadedModel {
            model_id: spec.model_id.clone(),
@@ -2303,6 +2335,7 @@ impl CandleHarness {
            worker: leader_worker,
            reasoning_tokens,
            tool_call_tokens,
+            chat_template,
        });

        let mut models = self.models.write().await;
@@ -2429,7 +2462,7 @@ impl CandleHarness {
            return Err(poisoned_error(&request.model));
        }

-        let prompt = format_qwen3_prompt(&request.messages);
+        let prompt = build_prompt_for_request(tp.chat_template.as_deref(), &request);
        let encoding = tp
            .tokenizer
            .encode(prompt.as_str(), true)
@@ -2893,7 +2926,7 @@ async fn chat_completion_tp_inner(
    let req_start = std::time::Instant::now();
    let model_id = request.model.clone();

-    let prompt = format_qwen3_prompt(&request.messages);
+    let prompt = build_prompt_for_request(tp.chat_template.as_deref(), &request);
    let encoding = tp
        .tokenizer
        .encode(prompt.as_str(), true)
@@ -3242,6 +3275,66 @@ pub enum InferenceError {
    Other(#[from] anyhow::Error),
 }

+/// Build the model's prompt from a [`ChatCompletionRequest`].
+///
+/// Prefers the model's own `chat_template` when one was loaded
+/// from `tokenizer_config.json` at startup and the
+/// `NEURON_USE_CHAT_TEMPLATE` kill switch isn't tripped. The
+/// request's `chat_template_kwargs` (e.g.
+/// `{"enable_thinking": false}` on Qwen3) and `tools` array flow
+/// into the template's Jinja context so model-specific behaviour
+/// like reasoning-suppression-at-generation works.
+///
+/// Falls back to [`format_qwen3_prompt`] (the legacy hardcoded
+/// ChatML glue) on any of:
+///
+/// - no `chat_template` loaded for this model (older quantised
+///   variants, fallback-only models)
+/// - the env kill switch is set to a falsy value
+/// - the template rendered to an error (caller can flip the env
+///   var to force fallback while debugging the template)
+///
+/// Failures are logged at `warn` so an operator running with
+/// `RUST_LOG=neuron=debug` sees which path each request took.
+fn build_prompt_for_request(
+    chat_template: Option<&str>,
+    request: &ChatCompletionRequest,
+) -> String {
+    if !super::chat_template::chat_templates_enabled() {
+        return format_qwen3_prompt(&request.messages);
+    }
+    let Some(tmpl) = chat_template else {
+        return format_qwen3_prompt(&request.messages);
+    };
+
+    // Pull `chat_template_kwargs` and `tools` from the request's
+    // catch-all `extra` field. Both are optional; absent fields
+    // become `Value::Null`, which the renderer skips inserting
+    // into the Jinja context.
+    let kwargs = request
+        .extra
+        .get("chat_template_kwargs")
+        .cloned()
+        .unwrap_or(serde_json::Value::Null);
+    let tools = request
+        .extra
+        .get("tools")
+        .cloned()
+        .unwrap_or(serde_json::Value::Null);
+
+    match super::chat_template::render_chat_template(tmpl, &request.messages, &tools, &kwargs) {
+        Ok(prompt) => prompt,
+        Err(e) => {
+            tracing::warn!(
+                model = %request.model,
+                error = %format!("{e:#}"),
+                "chat_template render failed; falling back to format_qwen3_prompt"
+            );
+            format_qwen3_prompt(&request.messages)
+        }
+    }
+}
+
 /// Apply the Qwen3 chat template:
 ///
 /// ```text