Files
cortex/crates/neuron/Cargo.toml
rob thijssen cb303832bc
Some checks failed
CI / CUDA type-check (push) Failing after 58s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
build-prerelease / Build neuron-ampere (push) Failing after 1s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m47s
CI / Test (push) Successful in 6m13s
build-prerelease / Build neuron-blackwell (push) Failing after 5m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ada (push) Failing after 7m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
feat(neuron): render the model's chat_template with chat_template_kwargs
Closes #9.

Replaces the hardcoded `format_qwen3_prompt` ChatML glue with
`minijinja`-driven rendering of the model's own `chat_template`
from `tokenizer_config.json`. The request's `chat_template_kwargs`
flow into the Jinja context so model-specific levers
(Qwen3's `enable_thinking: false`, etc.) actually take effect.

## Implementation

- New `harness::chat_template` module with three entry points:
  - `load_chat_template_alongside(tokenizer_json_path)` — probes
    `tokenizer_config.json` in the same hf-hub snapshot directory.
    Supports both the canonical string-form `chat_template` and
    the array-form some tokenizers ship (multi-template models).
  - `render_chat_template(template, messages, tools, kwargs)` —
    renders via `minijinja`. Messages flatten into the
    `[{role, content}]` shape HF templates iterate, with
    per-message extras (`tool_calls`, `tool_call_id`) preserved.
    `tools` and `kwargs` add into the Jinja context so templates
    that reference them work without us interpreting their shape.
  - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE`
    (default true). Falsy values force the fallback path
    everywhere — a kill switch for emergency rollback without a
    rebuild.

- `LoadedModel.chat_template: Option<String>` and the TP
  equivalent are populated once at load time. `None` (no
  tokenizer_config.json, parse error, missing field) routes the
  fallback path silently; logs go through `tracing::debug`/`warn`
  per condition.

- New `build_prompt_for_request(chat_template, request)` wraps
  the decision: when both the template is present AND the kill
  switch is off, render with kwargs from `request.extra` (looks
  up `chat_template_kwargs` and `tools` lazily). On render error
  → warn + fallback to `format_qwen3_prompt`. Wired into all four
  current prompt-build sites (single-GPU stream + non-stream, TP
  stream + non-stream).

## Dependency

`minijinja = "2"` with the `builtins`, `json`, and `serde`
features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used
internally by HF's `tokenizers-rs` for its own chat templating;
the API surface we touch (`Environment::add_template` +
`Template::render(serde_value)`) is stable.

## Validation strategy

I can't byte-compare the new path's output against
`format_qwen3_prompt` for live models without GPU (CI doesn't
have one). The fallback path and kill switch are the mitigations
— a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the
neuron service env if the chat template renders surprisingly on
Qwen3-8B in production. The legacy formatter stays the
fail-closed default.

## Scope cuts (documented in module header)

- Tool-definition lifting from helexa-acp's system-prompt
  injection into the chat_template's native tools block is
  deferred. Today the request's `tools` array threads into the
  Jinja context, but helexa-acp continues to inject Hermes-format
  tool descriptions into the system prompt for backwards-compat
  with non-cortex endpoints.

## Tests

9 unit tests in `chat_template`: kill-switch matrix (truthy /
falsy / unset), template loading (string form, array form,
missing file, unparseable JSON, missing field), rendering
(basic conversation threading, kwargs forwarding, message-extras
threading for tool_calls).

215 workspace tests pass; clippy + fmt clean across all workspace
features (default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:43:11 +03:00

107 lines
3.6 KiB
TOML

[package]
name = "neuron"
version.workspace = true
edition.workspace = true
license.workspace = true
[lib]
name = "neuron"
path = "src/lib.rs"
[[bin]]
name = "neuron"
path = "src/main.rs"
[features]
default = []
# Enables CUDA acceleration in candle and the cudarc/nccl bindings the
# TP worker pool uses. Without this feature, candle compiles for CPU
# only, Device::new_cuda calls fall back to CPU, and TP Init/sanity
# requests return Error{kind="cuda_feature_not_enabled"}.
cuda = [
"candle-core/cuda",
"candle-core/nccl",
"candle-nn/cuda",
"candle-transformers/cuda",
"dep:cudarc",
"dep:half",
"dep:cudaforge",
]
# Use cuDNN for convolution / attention kernels. Requires CUDA.
cudnn = [
"cuda",
"candle-core/cudnn",
"candle-nn/cudnn",
"candle-transformers/cudnn",
]
# FlashAttention kernels. Requires CUDA.
flash-attn = [
"cuda",
"candle-transformers/flash-attn",
]
# Reserved for GPU-only integration tests in later stages.
cuda-integration = ["cuda"]
[dependencies]
cortex-core.workspace = true
tokio.workspace = true
axum.workspace = true
serde.workspace = true
serde_json.workspace = true
reqwest.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
anyhow.workspace = true
async-trait.workspace = true
clap.workspace = true
thiserror.workspace = true
futures.workspace = true
tokio-stream.workspace = true
figment.workspace = true
toml.workspace = true
# candle for in-process inference. CUDA support is gated behind the
# crate's `cuda` feature (default off) so the workspace builds on
# non-CUDA hosts and CI runners.
candle-core = "0.10.2"
candle-nn = "0.10.2"
candle-transformers = "0.10.2"
# Direct dep on cudarc (matching candle's transitive version) so the
# TP worker pool can call cudarc::nccl::{Comm, Id} directly. Gated on
# the `cuda` feature; same toolchain requirement as candle's CUDA path.
cudarc = { version = "0.19", optional = true, default-features = false, features = ["nccl", "cuda-version-from-build-system"] }
# Used by the AllReduce CustomOp1 to type-dispatch on bf16/f16 candle
# storages. Matches candle-core's pinned major version to avoid double-
# compiling the `half` crate at conflicting versions.
half = { version = "2.5", optional = true }
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
hf-hub = { version = "0.4", features = ["tokio"] }
# Jinja-compatible template renderer for the model's
# `tokenizer_config.json::chat_template`. Hugging Face's chat
# templates use a strict subset of Jinja2 that minijinja supports
# out of the box. ~80KB compiled; pure Rust, no async surface.
# Features: `builtins` for the `is defined` / `default` filters HF
# templates use; `json` for `tojson` (some Qwen3 templates emit
# tool definitions via tojson); `serde` so we can hand it a
# serde_json::Value as the context.
minijinja = { version = "2", features = ["builtins", "json", "serde"] }
# Direct dep on `safetensors` (re-exported by candle but its `TensorView`
# / `slice::IndexOp` types are public-but-not-re-exported). Used by the
# tp `fused_load` module to read per-rank slices of fused QKV tensors
# without materialising the full tensor on device.
safetensors = "0.7"
[dev-dependencies]
tokio = { workspace = true, features = ["test-util"] }
reqwest.workspace = true
[build-dependencies]
# Used by `build.rs` to compile `src/cuda/*.cu` into `libneuroncuda.a`
# under the `cuda` feature. Matches mistralrs's upstream build setup
# (their `mistralrs-core/build.rs` uses the same constructor).
cudaforge = { version = "0.1", optional = true }
[package.metadata.docs.rs]
# Skip the CUDA path on docs.rs (it lacks nvcc).
no-default-features = true