Files
cortex/crates/neuron
rob thijssen 44ae927e38
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 38s
CI / Test (push) Failing after 45s
CI / Clippy (push) Successful in 2m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_5
Replaces the per-token Rust delta-rule loop in
`arch/qwen3_5/linear_attn.rs::GatedDeltaNet::forward` with a single
dispatch to the `gated_delta_rule_recurrence` kernel imported from
mistralrs in 1ebbe87.

The kernel is V-tiled with compile-time BK (one block per (V-tile,
batch*head), one thread per V-column, BK state floats in registers).
For Qwen3.6's per-rank `(B=1, H=24, D_k=128, D_v=128)` shape this
collapses ~6 candle tensor-op launches per token per layer (each
~50µs CUDA dispatch overhead, so ~300µs/token/layer × 48 linear-
attention layers = 14ms in launch overhead alone) to a single
kernel launch with full ILP / register residency.

New free function `run_delta_rule`:
- cuda branch (when q is on a CUDA device): flattens
  `(B, H, ...)` → `(BH, ...)`, dispatches the kernel via
  `crate::cuda::gdn::gated_delta_rule_recurrence_cuda`, reshapes
  outputs back to `(B, H, L, D_v)` and state to `(B, H, D_k, D_v)`.
- cpu fallback: the original per-token Rust loop, unchanged. Keeps
  cargo test --workspace passing on hosts without cuda.

Dispatch decision lives in the wrapper (`q.device().is_cuda()`).

Build: `cargo build -p neuron --features cuda` compiles + links;
clippy clean on both CPU and cuda paths. 32 lib tests still pass
(none of them exercise this code path on cuda; smoke test for the
TP variant is the deployed Tbilisi probe).

Stage 8d-3 wires the conv1d kernels; 8d-4 the chunked prefill;
8d-5 the same wiring for `tp/tp_qwen3_5.rs`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:39:30 +03:00
..