rob thijssen cb303832bc
Some checks failed
CI / CUDA type-check (push) Failing after 58s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 40s
build-prerelease / Build neuron-ampere (push) Failing after 1s
CI / Clippy (push) Successful in 2m37s
build-prerelease / Build cortex binary (push) Successful in 4m47s
CI / Test (push) Successful in 6m13s
build-prerelease / Build neuron-blackwell (push) Failing after 5m34s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ada (push) Failing after 7m20s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
feat(neuron): render the model's chat_template with chat_template_kwargs
Closes #9.

Replaces the hardcoded `format_qwen3_prompt` ChatML glue with
`minijinja`-driven rendering of the model's own `chat_template`
from `tokenizer_config.json`. The request's `chat_template_kwargs`
flow into the Jinja context so model-specific levers
(Qwen3's `enable_thinking: false`, etc.) actually take effect.

## Implementation

- New `harness::chat_template` module with three entry points:
  - `load_chat_template_alongside(tokenizer_json_path)` — probes
    `tokenizer_config.json` in the same hf-hub snapshot directory.
    Supports both the canonical string-form `chat_template` and
    the array-form some tokenizers ship (multi-template models).
  - `render_chat_template(template, messages, tools, kwargs)` —
    renders via `minijinja`. Messages flatten into the
    `[{role, content}]` shape HF templates iterate, with
    per-message extras (`tool_calls`, `tool_call_id`) preserved.
    `tools` and `kwargs` add into the Jinja context so templates
    that reference them work without us interpreting their shape.
  - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE`
    (default true). Falsy values force the fallback path
    everywhere — a kill switch for emergency rollback without a
    rebuild.

- `LoadedModel.chat_template: Option<String>` and the TP
  equivalent are populated once at load time. `None` (no
  tokenizer_config.json, parse error, missing field) routes the
  fallback path silently; logs go through `tracing::debug`/`warn`
  per condition.

- New `build_prompt_for_request(chat_template, request)` wraps
  the decision: when both the template is present AND the kill
  switch is off, render with kwargs from `request.extra` (looks
  up `chat_template_kwargs` and `tools` lazily). On render error
  → warn + fallback to `format_qwen3_prompt`. Wired into all four
  current prompt-build sites (single-GPU stream + non-stream, TP
  stream + non-stream).

## Dependency

`minijinja = "2"` with the `builtins`, `json`, and `serde`
features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used
internally by HF's `tokenizers-rs` for its own chat templating;
the API surface we touch (`Environment::add_template` +
`Template::render(serde_value)`) is stable.

## Validation strategy

I can't byte-compare the new path's output against
`format_qwen3_prompt` for live models without GPU (CI doesn't
have one). The fallback path and kill switch are the mitigations
— a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the
neuron service env if the chat template renders surprisingly on
Qwen3-8B in production. The legacy formatter stays the
fail-closed default.

## Scope cuts (documented in module header)

- Tool-definition lifting from helexa-acp's system-prompt
  injection into the chat_template's native tools block is
  deferred. Today the request's `tools` array threads into the
  Jinja context, but helexa-acp continues to inject Hermes-format
  tool descriptions into the system prompt for backwards-compat
  with non-cortex endpoints.

## Tests

9 unit tests in `chat_template`: kill-switch matrix (truthy /
falsy / unset), template loading (string form, array form,
missing file, unparseable JSON, missing field), rendering
(basic conversation threading, kwargs forwarding, message-extras
threading for tool_calls).

215 workspace tests pass; clippy + fmt clean across all workspace
features (default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 23:43:11 +03:00

cortex

A Rust reverse-proxy and fleet management layer for multi-node GPU inference clusters. Cortex sits in front of one or more neuron daemons (each running candle-based inference on a local GPU host) and presents a unified OpenAI + Anthropic compatible API surface.

Problem

Running local LLMs across multiple GPU nodes (different VRAM tiers, different model affinities) requires a unified API surface that:

  • Presents a single /v1/models catalogue merging every model that can be served by any neuron in the fleet.
  • Routes requests to the correct node based on where a model is loaded (or can be loaded), handling cold-load and eviction transparently.
  • Manages model lifecycle — load on demand, unload cold models, pin critical ones — by calling each neuron's /models/{load,unload} API.
  • Translates between OpenAI and Anthropic request/response envelopes so every client speaks whichever dialect it prefers.
  • Captures per-request metrics (tokens, tok/s, TTFT, latency) and exposes them as Prometheus counters/histograms.

Architecture

┌──────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
│ Claude Code  │  │ Zed/IDE  │  │ Tidal / mm │  │ curl / etc │
└──────┬───────┘  └─────┬────┘  └──────┬─────┘  └──────┬─────┘
       │                │              │               │
       └────────────────┴──────┬───────┴───────────────┘
                               │
                    ┌──────────▼──────────┐
                    │      cortex         │
                    │  (cortex-gateway)   │
                    │                     │
                    │  Router · Metrics   │
                    │  Evictor · Translate│
                    └──┬──────┬────────┬──┘
                       │      │        │
            ┌──────────▼┐  ┌──▼─────┐  ┌▼──────────┐
            │  neuron   │  │ neuron │  │  neuron   │
            │  :13131   │  │ :13131 │  │  :13131   │
            │  candle   │  │ candle │  │  candle   │
            └───────────┘  └────────┘  └───────────┘
                  private network (.internal)

Crates

Crate Purpose
cortex-core Shared types: config, node/model state, metrics, OpenAI/Anthropic envelopes, harness trait, discovery types
cortex-gateway Axum HTTP server: proxy, router, evictor, poller, metrics exporter
neuron Per-node daemon: GPU discovery, in-process candle inference, model lifecycle API
cortex-cli CLI entrypoint (cortex serve, cortex status, etc.)

Node setup

Each GPU node runs neuron (listening on :13131). Neuron uses huggingface/candle for in-process inference — there is no external inference subprocess to manage.

Inside the daemon, every CUDA device gets one dedicated OS thread (named cuda-dev-N) that owns the device's CUDA context for the daemon's lifetime. Model loads, forward passes, KV-cache resets, NCCL collectives, VRAM queries, and unloads all route through that thread via a job channel; tensors never escape it alive. This pins context binding to a known thread, makes the CUDA Drop contract structurally safe, and isolates driver-error poisoning to one worker rather than the whole process. See CLAUDE.md for the design rationale and crates/neuron/src/harness/device_worker/ for the code.

The neuron RPM (helexa-neuron) ships a systemd unit:

dnf copr enable helexa/helexa
dnf install helexa-neuron
systemctl enable --now neuron

Gateway config

# /etc/cortex/cortex.toml
[gateway]
listen = "0.0.0.0:31313"
metrics_listen = "0.0.0.0:31314"

[eviction]
strategy = "lru"        # lru | priority
defrag_after_cycles = 50

[[neurons]]
name = "beast"
endpoint = "http://beast.internal:13131"

[[neurons]]
name = "benjy"
endpoint = "http://benjy.internal:13131"

Model placement profiles live in models.toml — see models.example.toml.

Building

cargo build --release

CI

Every push triggers format, lint, and test checks. Ensure these pass locally before pushing:

cargo fmt --check --all                    # must be clean
cargo clippy --workspace -- -D warnings   # warnings are errors
cargo test --workspace                     # all tests must pass

Tagged releases (v*) additionally build SRPMs for both cortex and helexa-neuron and publish to COPR.

Running

# start the gateway
cortex serve --config /etc/cortex/cortex.toml

# check fleet status
cortex status

# list all models across nodes
curl http://localhost:31313/v1/models

License

GPL-3.0

Description
No description provided
Readme GPL-3.0 5.2 MiB
Languages
Rust 96.9%
Cuda 1.7%
Shell 1.1%
Python 0.3%