Files
cortex/crates/neuron/build.rs
rob thijssen 1ebbe87651
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m23s
build-prerelease / Build neuron-blackwell (push) Has started running
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only
Stage 8d (new): port the Gated DeltaNet CUDA kernels from
EricLBuehler/mistral.rs to close the ~500x decode performance gap
we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path
vs ~37 T/s in mistralrs on the same hardware).

This commit lays the build infrastructure with zero behavioural
change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the
qwen3_5 architecture and TP variant.

Added:
- `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile
  every `src/cuda/*.cu` file into `libneuroncuda.a` under the `cuda`
  feature, then links it + `cudart`. Mirrors mistralrs's
  `mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same
  sm_<80 bf16 gate).
- `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from
  upstream:
    * `gated_delta_rule_recurrence` (V-tiled per-token decode)
    * `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill)
    * `causal_conv1d_update` (single-token conv decode)
    * `causal_conv1d_full` (multi-token conv prefill)
    * `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) *
      softplus(a + dt_bias))
- `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels,
  cudarc::CudaSlice::device_ptr boilerplate identical to upstream.
- `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of
  upstream's ffi.rs covering only the five GDN kernels; MoE / SSM /
  top-k decls land here when we absorb those too).
- `crates/neuron/src/cuda/mod.rs` — re-exports + module docs.

Cargo wiring: `cudaforge` added as an optional build-dep, activated
by the `cuda` feature. CPU build is unchanged (the `cuda/` module is
fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the
patched container compiles `gdn.cu` (1 of 1 kernels) and links
clean.

Licensing: upstream files preserve their MIT origin via per-file
comment banners pointing to the mistralrs path. No behaviour-relevant
edits to the .cu kernels — local diff against upstream is just the
banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream;
their structure (module path `crate::cuda::ffi::*`) matches identically
so future kernel imports drop in unchanged.

CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean
inside the runner container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:34:11 +03:00

2.4 KiB