Research: Prototype a quantization kernel in pure Rust via cuda-oxide #2
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
cuda-oxide is a custom rustc codegen backend (released by NVlabs 2026-04-22) that compiles
#[kernel]functions directly to PTX. End-to-end pure Rust: HTTP handler → kernel arg types → device code, one type system, one cargo build, no C++ layer.Neuron currently uses zero custom CUDA kernels — all GPU code comes from candle's kernel set (
candle-kernelscrate, hand-written CUDA C, plus cuBLAS for matmul). The kernels we care about (Q6K / Q5K / Q8_0 ISQ quantization, full-attention prefill, sampling) live upstream in candle. When they're suboptimal — e.g. Q6Kfrom_floatis single-threaded per-block despite the block work being embarrassingly parallel (#1) — our only fix path is a candle PR.The hypothesis
Replacing candle's CUDA kernels with pure-Rust equivalents via cuda-oxide would make neuron the first end-to-end pure-Rust multi-node LLM inference stack. The compounding wins:
impl QuantKernel for Q6Kproblem instead of a runtime branch.Why this is a research ticket, not a roadmap item
cuda-oxide is 5 weeks old, alpha quality, NVlabs research project (could be archived, rewritten, or productionised — unpredictable at this stage). Realistic full replacement of candle's kernel surface is a 6–12 month rewrite. Doing this without validating the tooling first is reckless.
The right shape is a bounded experiment: pick one kernel, prove the toolchain works, measure delta vs candle, decide based on data.
Concrete starting target: Q6K ISQ quantization
Reasons it's the right first kernel:
Proposed experiment
helexa-kernels(separate from neuron, separate from cortex-core).from_floatas a cuda-oxide#[kernel]: input bf16 tensor, output GGML Q6K block layout, one block per thread-block (or warp, depending on block size of 256 elements).helexa-kernels) into neuron's ISQ load path. Default OFF.Validation gates
Proceed to next kernel ONLY if all four pass on the first prototype:
QTensor::quantize.If any gate fails, the prototype gets parked and we revisit when cuda-oxide is more mature. The experiment is cheap (1–2 weekends) compared to the wrong-direction cost.
Known risks
Out of scope for this ticket
Priority
Low. Neuron currently handles real workloads (agent-zero session held up for >20 K tokens). The case for this work is "novel differentiator" rather than "unblocks a user". Defer until either:
Related
Closing as out-of-scope under the sharpened project positioning (README, 2026-06-12): helexa's niche is near-frontier models on consumer hardware, served predictably — not stack purity. "First end-to-end pure-Rust inference stack" is a language-identity goal, and adopting a custom rustc codegen backend (released eight weeks ago) is exactly the kind of foundational maintenance bet the lean-deps principle exists to refuse.
The practical motivation cited here is already served by cheaper paths the project uses today: the Q6K
from_floatbottleneck (#1) is CPU-side and fixable with rayon or a small candle patch, and when upstream is the obstacle we carry a pinned fork (see the cudarcnccl-comm-abortfork from #17) rather than replacing the layer.If a hot path someday genuinely needs a custom device kernel that candle cannot express, that is a fresh, narrowly-scoped issue — written against the bottleneck, not the toolchain.