Research: Prototype a quantization kernel in pure Rust via cuda-oxide #2

New Issue

grenade · 2026-05-27T15:30:16Z

grenade commented

2026-05-27 15:30:16 +00:00

Context

cuda-oxide is a custom rustc codegen backend (released by NVlabs 2026-04-22) that compiles #[kernel] functions directly to PTX. End-to-end pure Rust: HTTP handler → kernel arg types → device code, one type system, one cargo build, no C++ layer.

Neuron currently uses zero custom CUDA kernels — all GPU code comes from candle's kernel set (candle-kernels crate, hand-written CUDA C, plus cuBLAS for matmul). The kernels we care about (Q6K / Q5K / Q8_0 ISQ quantization, full-attention prefill, sampling) live upstream in candle. When they're suboptimal — e.g. Q6K from_float is single-threaded per-block despite the block work being embarrassingly parallel (#1) — our only fix path is a candle PR.

The hypothesis

Replacing candle's CUDA kernels with pure-Rust equivalents via cuda-oxide would make neuron the first end-to-end pure-Rust multi-node LLM inference stack. The compounding wins:

Audit story: GPU code is Rust we can read. No "the kernel is somewhere in candle-kernels C++".
Optimisation story: hot-path bottlenecks (like the Q6K parallelisation gap) become local Rust refactors instead of upstream PR cycles.
Type-safety story: the same Rust type system that owns the CUDA context (via the per-device worker pattern) would own the kernel argument types. cuda-oxide monomorphises generic kernels per concrete type, so quant dtype dispatch becomes an impl QuantKernel for Q6K problem instead of a runtime branch.
Differentiator: every other Rust ML stack (candle, burn, tch-rs, mistral.rs) is "Rust on host, foreign-language kernels on device". A fully-Rust kernel layer is, to our knowledge, unprecedented in production-grade open-source LLM inference.

Why this is a research ticket, not a roadmap item

cuda-oxide is 5 weeks old, alpha quality, NVlabs research project (could be archived, rewritten, or productionised — unpredictable at this stage). Realistic full replacement of candle's kernel surface is a 6–12 month rewrite. Doing this without validating the tooling first is reckless.

The right shape is a bounded experiment: pick one kernel, prove the toolchain works, measure delta vs candle, decide based on data.

Concrete starting target: Q6K ISQ quantization

Reasons it's the right first kernel:

It's a documented bottleneck (issue #1 lays out exactly why it's slow).
It's a tight, well-defined block-encoding operation — no attention masking, no cuBLAS interop, no NCCL collectives.
It's CPU-bound today, so a GPU implementation has a clear "is it faster?" answer (it should be, dramatically).
Failure to integrate doesn't break anything — feature-gated, off by default.

Proposed experiment

New crate: helexa-kernels (separate from neuron, separate from cortex-core).
Implement Q6K from_float as a cuda-oxide #[kernel]: input bf16 tensor, output GGML Q6K block layout, one block per thread-block (or warp, depending on block size of 256 elements).
Wire it via a Cargo feature flag (helexa-kernels) into neuron's ISQ load path. Default OFF.
Benchmark on beast (5090 / Blackwell), benjy (4090 / Ada), quadbrat (3060 / Ampere):
- Quantization wall-clock for Qwen3.6-27B q6k load.
- Numerical equivalence (max abs error) vs candle's CPU implementation.
- PTX size and register pressure.
Document findings, including which hardware tiers the cuda-oxide-generated kernel actually runs on.

Validation gates

Proceed to next kernel ONLY if all four pass on the first prototype:

Compiles cleanly with cuda-oxide nightly toolchain in our existing Cargo workspace (no fork of cuda-oxide required).
Runs without segfault on all three hardware tiers (Blackwell 5090, Ada 4090, Ampere 3060).
Numerical output is bit-exact (or within F32-rounding tolerance) of candle's QTensor::quantize.
At least 5× faster than candle's single-threaded CPU path for the Qwen3.6-27B Q6K quantization wall-clock.

If any gate fails, the prototype gets parked and we revisit when cuda-oxide is more mature. The experiment is cheap (1–2 weekends) compared to the wrong-direction cost.

Known risks

Toolchain churn: nightly rustc, LLVM 21+, CUDA 12.x+ — extra moving parts in our COPR build.
Blackwell bleeding edge: cuda-oxide's TMA / tcgen05 / WGMMA intrinsics need LLVM 21+ and we'd be the test case.
Performance ceiling: a custom kernel competes with what we write, not with what NVIDIA's cutlass team wrote. For matmul-shaped ops we'd still want cuBLAS — meaning "pure Rust" leaks at the BLAS boundary in any realistic stack. Quantization is not matmul-shaped, so this kernel is a clean test of the model.
Maintenance overhead: another crate, another build path, another set of hardware-specific bugs to chase.

Out of scope for this ticket

Replacing all of candle's kernel set.
Replacing cuBLAS or NCCL.
Touching the host runtime — we keep cudarc + candle's host side; the experiment is purely about the kernel layer.

Priority

Low. Neuron currently handles real workloads (agent-zero session held up for >20 K tokens). The case for this work is "novel differentiator" rather than "unblocks a user". Defer until either:

Higher-priority work (Claude Code / opencode backend, more model coverage, more node types) is in a stable place, OR
cuda-oxide hits 1.0 / loses the "experimental" label, lowering the toolchain risk, OR
A concrete kernel-level bottleneck we can't fix any other way pushes this up the priority list.

Issue #1 — Q6K parallelisation bottleneck. This experiment's success would unlock the fix locally.

## Context [cuda-oxide](https://github.com/NVlabs/cuda-oxide) is a custom rustc codegen backend (released by NVlabs 2026-04-22) that compiles `#[kernel]` functions directly to PTX. End-to-end pure Rust: HTTP handler → kernel arg types → device code, one type system, one cargo build, no C++ layer. Neuron currently uses **zero custom CUDA kernels** — all GPU code comes from candle's kernel set (`candle-kernels` crate, hand-written CUDA C, plus cuBLAS for matmul). The kernels we care about (Q6K / Q5K / Q8_0 ISQ quantization, full-attention prefill, sampling) live upstream in candle. When they're suboptimal — e.g. Q6K `from_float` is single-threaded per-block despite the block work being embarrassingly parallel ([#1](https://git.lair.cafe/helexa/cortex/issues/1)) — our only fix path is a candle PR. ## The hypothesis Replacing candle's CUDA kernels with pure-Rust equivalents via cuda-oxide would make neuron the first end-to-end pure-Rust multi-node LLM inference stack. The compounding wins: - **Audit story**: GPU code is Rust we can read. No "the kernel is somewhere in candle-kernels C++". - **Optimisation story**: hot-path bottlenecks (like the Q6K parallelisation gap) become local Rust refactors instead of upstream PR cycles. - **Type-safety story**: the same Rust type system that owns the CUDA context (via the per-device worker pattern) would own the kernel argument types. cuda-oxide monomorphises generic kernels per concrete type, so quant dtype dispatch becomes an `impl QuantKernel for Q6K` problem instead of a runtime branch. - **Differentiator**: every other Rust ML stack (candle, burn, tch-rs, mistral.rs) is "Rust on host, foreign-language kernels on device". A fully-Rust kernel layer is, to our knowledge, unprecedented in production-grade open-source LLM inference. ## Why this is a research ticket, not a roadmap item cuda-oxide is **5 weeks old, alpha quality, NVlabs research project** (could be archived, rewritten, or productionised — unpredictable at this stage). Realistic full replacement of candle's kernel surface is a 6–12 month rewrite. Doing this without validating the tooling first is reckless. The right shape is a bounded experiment: pick one kernel, prove the toolchain works, measure delta vs candle, decide based on data. ## Concrete starting target: Q6K ISQ quantization Reasons it's the right first kernel: - It's a documented bottleneck (issue #1 lays out exactly why it's slow). - It's a tight, well-defined block-encoding operation — no attention masking, no cuBLAS interop, no NCCL collectives. - It's CPU-bound today, so a GPU implementation has a clear "is it faster?" answer (it should be, dramatically). - Failure to integrate doesn't break anything — feature-gated, off by default. ## Proposed experiment 1. New crate: `helexa-kernels` (separate from neuron, separate from cortex-core). 2. Implement Q6K `from_float` as a cuda-oxide `#[kernel]`: input bf16 tensor, output GGML Q6K block layout, one block per thread-block (or warp, depending on block size of 256 elements). 3. Wire it via a Cargo feature flag (`helexa-kernels`) into neuron's ISQ load path. Default OFF. 4. Benchmark on beast (5090 / Blackwell), benjy (4090 / Ada), quadbrat (3060 / Ampere): - Quantization wall-clock for Qwen3.6-27B q6k load. - Numerical equivalence (max abs error) vs candle's CPU implementation. - PTX size and register pressure. 5. Document findings, including which hardware tiers the cuda-oxide-generated kernel actually runs on. ## Validation gates Proceed to next kernel ONLY if all four pass on the first prototype: - [ ] Compiles cleanly with cuda-oxide nightly toolchain in our existing Cargo workspace (no fork of cuda-oxide required). - [ ] Runs without segfault on all three hardware tiers (Blackwell 5090, Ada 4090, Ampere 3060). - [ ] Numerical output is bit-exact (or within F32-rounding tolerance) of candle's `QTensor::quantize`. - [ ] At least 5× faster than candle's single-threaded CPU path for the Qwen3.6-27B Q6K quantization wall-clock. If any gate fails, the prototype gets parked and we revisit when cuda-oxide is more mature. The experiment is cheap (1–2 weekends) compared to the wrong-direction cost. ## Known risks - **Toolchain churn**: nightly rustc, LLVM 21+, CUDA 12.x+ — extra moving parts in our COPR build. - **Blackwell bleeding edge**: cuda-oxide's TMA / tcgen05 / WGMMA intrinsics need LLVM 21+ and we'd be the test case. - **Performance ceiling**: a custom kernel competes with what *we* write, not with what NVIDIA's cutlass team wrote. For matmul-shaped ops we'd still want cuBLAS — meaning "pure Rust" leaks at the BLAS boundary in any realistic stack. Quantization is *not* matmul-shaped, so this kernel is a clean test of the model. - **Maintenance overhead**: another crate, another build path, another set of hardware-specific bugs to chase. ## Out of scope for this ticket - Replacing all of candle's kernel set. - Replacing cuBLAS or NCCL. - Touching the host runtime — we keep cudarc + candle's host side; the experiment is purely about the kernel layer. ## Priority **Low.** Neuron currently handles real workloads (agent-zero session held up for >20 K tokens). The case for this work is "novel differentiator" rather than "unblocks a user". Defer until either: 1. Higher-priority work (Claude Code / opencode backend, more model coverage, more node types) is in a stable place, OR 2. cuda-oxide hits 1.0 / loses the "experimental" label, lowering the toolchain risk, OR 3. A concrete kernel-level bottleneck we can't fix any other way pushes this up the priority list. ## Related - Issue #1 — Q6K parallelisation bottleneck. This experiment's success would unlock the fix locally.

grenade commented

2026-06-12 08:57:46 +00:00

Closing as out-of-scope under the sharpened project positioning (README, 2026-06-12): helexa's niche is near-frontier models on consumer hardware, served predictably — not stack purity. "First end-to-end pure-Rust inference stack" is a language-identity goal, and adopting a custom rustc codegen backend (released eight weeks ago) is exactly the kind of foundational maintenance bet the lean-deps principle exists to refuse.

The practical motivation cited here is already served by cheaper paths the project uses today: the Q6K from_float bottleneck (#1) is CPU-side and fixable with rayon or a small candle patch, and when upstream is the obstacle we carry a pinned fork (see the cudarc nccl-comm-abort fork from #17) rather than replacing the layer.

If a hot path someday genuinely needs a custom device kernel that candle cannot express, that is a fresh, narrowly-scoped issue — written against the bottleneck, not the toolchain.

Closing as out-of-scope under the sharpened project positioning (README, 2026-06-12): helexa's niche is near-frontier models on consumer hardware, served predictably — not stack purity. "First end-to-end pure-Rust inference stack" is a language-identity goal, and adopting a custom rustc codegen backend (released eight weeks ago) is exactly the kind of foundational maintenance bet the lean-deps principle exists to refuse. The practical motivation cited here is already served by cheaper paths the project uses today: the Q6K `from_float` bottleneck (#1) is CPU-side and fixable with rayon or a small candle patch, and when upstream is the obstacle we carry a pinned fork (see the cudarc `nccl-comm-abort` fork from #17) rather than replacing the layer. If a hot path someday genuinely needs a custom device kernel that candle cannot express, that is a fresh, narrowly-scoped issue — written against the bottleneck, not the toolchain.

grenade added the out-of-scope label 2026-06-12 08:57:46 +00:00

grenade closed this issue

2026-06-12 08:57:46 +00:00

grenade referenced this issue

2026-06-12 08:58:13 +00:00

Reduce TP=2 Q6K cold-load time for Qwen3.6-27B (~5 min today) #1

grenade referenced this issue

2026-06-12 09:02:07 +00:00

tracking: prioritised path to closing every open issue #27

Sign in to join this conversation.