Files
cortex/crates
rob thijssen 34f9b77d9d
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m40s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m58s
build-prerelease / Build neuron-ampere (push) Successful in 5m14s
build-prerelease / Package cortex RPM (push) Successful in 9m25s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
feat(stage-8e-2d): route quantized matmul by M (prefill vs decode)
MaybeQuantLinear::forward picks between two QMatMul paths:

- M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight
  once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost
  is fixed per call, so it's amortised across the M tokens.
- M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel
  on the quantized blocks directly. Requires f32 inputs so we still
  cast in/out at the boundary in that arm.

Earlier 8e-2c sent everything through the GGUF GEMV kernel, which
is excellent at GEMV (decode) but doesn't have a real batched GEMM
path — prefill regressed ~4x. This restores prefill to roughly the
bf16 cuBLAS GEMM throughput while keeping the decode gain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:15:32 +03:00
..