cortex

helexa/cortex

Fork 0

Files

History

rob thijssen ee663e5e99

build-prerelease / Build cortex binary (push) Blocked by required conditions

Details

CI / Test (push) Waiting to run

Details

CI / Format (push) Successful in 34s

Details

build-prerelease / Resolve version stamps (push) Successful in 37s

Details

CI / Clippy (push) Successful in 2m20s

Details

build-prerelease / Build neuron-ampere (push) Has been cancelled

Details

build-prerelease / Build neuron-ada (push) Has been cancelled

Details

build-prerelease / Package cortex RPM (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled

Details

CI / Build cortex SRPM (push) Has been cancelled

Details

CI / Build neuron SRPM (push) Has been cancelled

Details

CI / Publish cortex to COPR (push) Has been cancelled

Details

CI / Publish neuron to COPR (push) Has been cancelled

Details

CI / Bump version in source (push) Has been cancelled

Details

build-prerelease / Build neuron-blackwell (push) Has been cancelled

Details

fix(stage-8e-2e): bump quant prefill threshold to M > 64

The M > 8 threshold from 8e-2d activated forward_via_f16 on the test
case (M=30) and slightly regressed prefill (143 -> 133 T/s). The
dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms)
eats the cuBLAS GEMM speedup at small M.

Move the crossover to M > 64 so short prefills (typical for the
validate probe) stay on the GGUF GEMV kernel where per-call cost is
comparable but the dequant tax is zero. Long prefills still get the
dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the
fixed dequant cost.

Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that
needs either a dequant cache (gives back the ISQ memory win) or a
fused dequant+gemm kernel. Both larger projects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-21 21:50:45 +03:00

src

fix(stage-8e-2e): bump quant prefill threshold to M > 64

2026-05-21 21:50:45 +03:00

tests

Stage 7a-ii: real NCCL handshake behind the worker pool

2026-05-19 16:40:01 +03:00

build.rs

feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only

2026-05-21 11:34:11 +03:00

Cargo.toml

feat(stage-8d-7): direct safetensors fused-region loader

2026-05-21 17:49:35 +03:00