Some checks failed
CI / CUDA type-check (push) Failing after 58s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
build-prerelease / Build neuron-ampere (push) Failing after 1s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m47s
CI / Test (push) Successful in 6m13s
build-prerelease / Build neuron-blackwell (push) Failing after 5m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ada (push) Failing after 7m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Closes #9. Replaces the hardcoded `format_qwen3_prompt` ChatML glue with `minijinja`-driven rendering of the model's own `chat_template` from `tokenizer_config.json`. The request's `chat_template_kwargs` flow into the Jinja context so model-specific levers (Qwen3's `enable_thinking: false`, etc.) actually take effect. ## Implementation - New `harness::chat_template` module with three entry points: - `load_chat_template_alongside(tokenizer_json_path)` — probes `tokenizer_config.json` in the same hf-hub snapshot directory. Supports both the canonical string-form `chat_template` and the array-form some tokenizers ship (multi-template models). - `render_chat_template(template, messages, tools, kwargs)` — renders via `minijinja`. Messages flatten into the `[{role, content}]` shape HF templates iterate, with per-message extras (`tool_calls`, `tool_call_id`) preserved. `tools` and `kwargs` add into the Jinja context so templates that reference them work without us interpreting their shape. - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE` (default true). Falsy values force the fallback path everywhere — a kill switch for emergency rollback without a rebuild. - `LoadedModel.chat_template: Option<String>` and the TP equivalent are populated once at load time. `None` (no tokenizer_config.json, parse error, missing field) routes the fallback path silently; logs go through `tracing::debug`/`warn` per condition. - New `build_prompt_for_request(chat_template, request)` wraps the decision: when both the template is present AND the kill switch is off, render with kwargs from `request.extra` (looks up `chat_template_kwargs` and `tools` lazily). On render error → warn + fallback to `format_qwen3_prompt`. Wired into all four current prompt-build sites (single-GPU stream + non-stream, TP stream + non-stream). ## Dependency `minijinja = "2"` with the `builtins`, `json`, and `serde` features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used internally by HF's `tokenizers-rs` for its own chat templating; the API surface we touch (`Environment::add_template` + `Template::render(serde_value)`) is stable. ## Validation strategy I can't byte-compare the new path's output against `format_qwen3_prompt` for live models without GPU (CI doesn't have one). The fallback path and kill switch are the mitigations — a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the neuron service env if the chat template renders surprisingly on Qwen3-8B in production. The legacy formatter stays the fail-closed default. ## Scope cuts (documented in module header) - Tool-definition lifting from helexa-acp's system-prompt injection into the chat_template's native tools block is deferred. Today the request's `tools` array threads into the Jinja context, but helexa-acp continues to inject Hermes-format tool descriptions into the system prompt for backwards-compat with non-cortex endpoints. ## Tests 9 unit tests in `chat_template`: kill-switch matrix (truthy / falsy / unset), template loading (string form, array form, missing file, unparseable JSON, missing field), rendering (basic conversation threading, kwargs forwarding, message-extras threading for tool_calls). 215 workspace tests pass; clippy + fmt clean across all workspace features (default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
107 lines
3.6 KiB
TOML
107 lines
3.6 KiB
TOML
[package]
|
|
name = "neuron"
|
|
version.workspace = true
|
|
edition.workspace = true
|
|
license.workspace = true
|
|
|
|
[lib]
|
|
name = "neuron"
|
|
path = "src/lib.rs"
|
|
|
|
[[bin]]
|
|
name = "neuron"
|
|
path = "src/main.rs"
|
|
|
|
[features]
|
|
default = []
|
|
# Enables CUDA acceleration in candle and the cudarc/nccl bindings the
|
|
# TP worker pool uses. Without this feature, candle compiles for CPU
|
|
# only, Device::new_cuda calls fall back to CPU, and TP Init/sanity
|
|
# requests return Error{kind="cuda_feature_not_enabled"}.
|
|
cuda = [
|
|
"candle-core/cuda",
|
|
"candle-core/nccl",
|
|
"candle-nn/cuda",
|
|
"candle-transformers/cuda",
|
|
"dep:cudarc",
|
|
"dep:half",
|
|
"dep:cudaforge",
|
|
]
|
|
# Use cuDNN for convolution / attention kernels. Requires CUDA.
|
|
cudnn = [
|
|
"cuda",
|
|
"candle-core/cudnn",
|
|
"candle-nn/cudnn",
|
|
"candle-transformers/cudnn",
|
|
]
|
|
# FlashAttention kernels. Requires CUDA.
|
|
flash-attn = [
|
|
"cuda",
|
|
"candle-transformers/flash-attn",
|
|
]
|
|
# Reserved for GPU-only integration tests in later stages.
|
|
cuda-integration = ["cuda"]
|
|
|
|
[dependencies]
|
|
cortex-core.workspace = true
|
|
tokio.workspace = true
|
|
axum.workspace = true
|
|
serde.workspace = true
|
|
serde_json.workspace = true
|
|
reqwest.workspace = true
|
|
tracing.workspace = true
|
|
tracing-subscriber.workspace = true
|
|
anyhow.workspace = true
|
|
async-trait.workspace = true
|
|
clap.workspace = true
|
|
thiserror.workspace = true
|
|
futures.workspace = true
|
|
tokio-stream.workspace = true
|
|
figment.workspace = true
|
|
toml.workspace = true
|
|
|
|
# candle for in-process inference. CUDA support is gated behind the
|
|
# crate's `cuda` feature (default off) so the workspace builds on
|
|
# non-CUDA hosts and CI runners.
|
|
candle-core = "0.10.2"
|
|
candle-nn = "0.10.2"
|
|
candle-transformers = "0.10.2"
|
|
# Direct dep on cudarc (matching candle's transitive version) so the
|
|
# TP worker pool can call cudarc::nccl::{Comm, Id} directly. Gated on
|
|
# the `cuda` feature; same toolchain requirement as candle's CUDA path.
|
|
cudarc = { version = "0.19", optional = true, default-features = false, features = ["nccl", "cuda-version-from-build-system"] }
|
|
# Used by the AllReduce CustomOp1 to type-dispatch on bf16/f16 candle
|
|
# storages. Matches candle-core's pinned major version to avoid double-
|
|
# compiling the `half` crate at conflicting versions.
|
|
half = { version = "2.5", optional = true }
|
|
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
|
|
hf-hub = { version = "0.4", features = ["tokio"] }
|
|
# Jinja-compatible template renderer for the model's
|
|
# `tokenizer_config.json::chat_template`. Hugging Face's chat
|
|
# templates use a strict subset of Jinja2 that minijinja supports
|
|
# out of the box. ~80KB compiled; pure Rust, no async surface.
|
|
# Features: `builtins` for the `is defined` / `default` filters HF
|
|
# templates use; `json` for `tojson` (some Qwen3 templates emit
|
|
# tool definitions via tojson); `serde` so we can hand it a
|
|
# serde_json::Value as the context.
|
|
minijinja = { version = "2", features = ["builtins", "json", "serde"] }
|
|
# Direct dep on `safetensors` (re-exported by candle but its `TensorView`
|
|
# / `slice::IndexOp` types are public-but-not-re-exported). Used by the
|
|
# tp `fused_load` module to read per-rank slices of fused QKV tensors
|
|
# without materialising the full tensor on device.
|
|
safetensors = "0.7"
|
|
|
|
[dev-dependencies]
|
|
tokio = { workspace = true, features = ["test-util"] }
|
|
reqwest.workspace = true
|
|
|
|
[build-dependencies]
|
|
# Used by `build.rs` to compile `src/cuda/*.cu` into `libneuroncuda.a`
|
|
# under the `cuda` feature. Matches mistralrs's upstream build setup
|
|
# (their `mistralrs-core/build.rs` uses the same constructor).
|
|
cudaforge = { version = "0.1", optional = true }
|
|
|
|
[package.metadata.docs.rs]
|
|
# Skip the CUDA path on docs.rs (it lacks nvcc).
|
|
no-default-features = true
|