Closes #9. Replaces the hardcoded `format_qwen3_prompt` ChatML glue with `minijinja`-driven rendering of the model's own `chat_template` from `tokenizer_config.json`. The request's `chat_template_kwargs` flow into the Jinja context so model-specific levers (Qwen3's `enable_thinking: false`, etc.) actually take effect. ## Implementation - New `harness::chat_template` module with three entry points: - `load_chat_template_alongside(tokenizer_json_path)` — probes `tokenizer_config.json` in the same hf-hub snapshot directory. Supports both the canonical string-form `chat_template` and the array-form some tokenizers ship (multi-template models). - `render_chat_template(template, messages, tools, kwargs)` — renders via `minijinja`. Messages flatten into the `[{role, content}]` shape HF templates iterate, with per-message extras (`tool_calls`, `tool_call_id`) preserved. `tools` and `kwargs` add into the Jinja context so templates that reference them work without us interpreting their shape. - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE` (default true). Falsy values force the fallback path everywhere — a kill switch for emergency rollback without a rebuild. - `LoadedModel.chat_template: Option<String>` and the TP equivalent are populated once at load time. `None` (no tokenizer_config.json, parse error, missing field) routes the fallback path silently; logs go through `tracing::debug`/`warn` per condition. - New `build_prompt_for_request(chat_template, request)` wraps the decision: when both the template is present AND the kill switch is off, render with kwargs from `request.extra` (looks up `chat_template_kwargs` and `tools` lazily). On render error → warn + fallback to `format_qwen3_prompt`. Wired into all four current prompt-build sites (single-GPU stream + non-stream, TP stream + non-stream). ## Dependency `minijinja = "2"` with the `builtins`, `json`, and `serde` features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used internally by HF's `tokenizers-rs` for its own chat templating; the API surface we touch (`Environment::add_template` + `Template::render(serde_value)`) is stable. ## Validation strategy I can't byte-compare the new path's output against `format_qwen3_prompt` for live models without GPU (CI doesn't have one). The fallback path and kill switch are the mitigations — a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the neuron service env if the chat template renders surprisingly on Qwen3-8B in production. The legacy formatter stays the fail-closed default. ## Scope cuts (documented in module header) - Tool-definition lifting from helexa-acp's system-prompt injection into the chat_template's native tools block is deferred. Today the request's `tools` array threads into the Jinja context, but helexa-acp continues to inject Hermes-format tool descriptions into the system prompt for backwards-compat with non-cortex endpoints. ## Tests 9 unit tests in `chat_template`: kill-switch matrix (truthy / falsy / unset), template loading (string form, array form, missing file, unparseable JSON, missing field), rendering (basic conversation threading, kwargs forwarding, message-extras threading for tool_calls). 215 workspace tests pass; clippy + fmt clean across all workspace features (default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cortex
A Rust reverse-proxy and fleet management layer for multi-node GPU inference
clusters. Cortex sits in front of one or more neuron daemons (each running
candle-based inference on a local GPU host) and presents a unified OpenAI +
Anthropic compatible API surface.
Problem
Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:
- Presents a single
/v1/modelscatalogue merging every model that can be served by any neuron in the fleet. - Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
- Manages model lifecycle — load on demand, unload cold models, pin
critical ones — by calling each neuron's
/models/{load,unload}API. - Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
- Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.
Architecture
┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Claude Code │ │ Zed/IDE │ │ Tidal / mm │ │ curl / etc │
└──────┬───────┘ └─────┬────┘ └──────┬─────┘ └──────┬─────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
┌──────────▼──────────┐
│ cortex │
│ (cortex-gateway) │
│ │
│ Router · Metrics │
│ Evictor · Translate│
└──┬──────┬────────┬──┘
│ │ │
┌──────────▼┐ ┌──▼─────┐ ┌▼──────────┐
│ neuron │ │ neuron │ │ neuron │
│ :13131 │ │ :13131 │ │ :13131 │
│ candle │ │ candle │ │ candle │
└───────────┘ └────────┘ └───────────┘
private network (.internal)
Crates
| Crate | Purpose |
|---|---|
cortex-core |
Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types |
cortex-gateway |
Axum HTTP server: proxy, router, evictor, poller, metrics exporter |
neuron |
Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API |
cortex-cli |
CLI entrypoint (cortex serve, cortex status, etc.) |
Node setup
Each GPU node runs neuron (listening on :13131). Neuron uses
huggingface/candle for in-process inference — there is no external
inference subprocess to manage.
Inside the daemon, every CUDA device gets one dedicated OS thread
(named cuda-dev-N) that owns the device's CUDA context for the
daemon's lifetime. Model loads, forward passes, KV-cache resets,
NCCL collectives, VRAM queries, and unloads all route through that
thread via a job channel; tensors never escape it alive. This pins
context binding to a known thread, makes the CUDA Drop contract
structurally safe, and isolates driver-error poisoning to one worker
rather than the whole process. See CLAUDE.md for the design
rationale and crates/neuron/src/harness/device_worker/ for the code.
The neuron RPM (helexa-neuron) ships a systemd unit:
dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron
Gateway config
# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"
[eviction]
strategy = "lru" # lru | priority
defrag_after_cycles = 50
[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"
[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"
Model placement profiles live in models.toml — see models.example.toml.
Building
cargo build --release
CI
Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:
cargo fmt --check --all # must be clean
cargo clippy --workspace -- -D warnings # warnings are errors
cargo test --workspace # all tests must pass
Tagged releases (v*) additionally build SRPMs for both cortex and
helexa-neuron and publish to COPR.
Running
# start the gateway
cortex serve --config /etc/cortex/cortex.toml
# check fleet status
cortex status
# list all models across nodes
curl http://localhost:31313/v1/models
License
GPL-3.0