feat(stage-8d-7): direct safetensors fused-region loader
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m13s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled

Replaces load_fused_qkv_slice_2d/_3d with reads from a separate
MmapedSafetensors handle. Each per-rank fused tensor is built by
reading the three region byte-slices directly from the mmap,
concatenating them host-side, and uploading as one device
allocation — no full-fused-tensor device materialisation.

The prior approach allocated a ~100 MB transient device tensor
per linear-attention layer; on Qwen3.6-27B with 48 linear-attn
layers that's ~4.8 GB of allocator churn during load — enough
to fragment the cuda caching allocator on a tight-VRAM 32 GB
consumer GPU, which is what triggered the layer-22 up_proj
OOM seen on beast.

Threading: MmapedSafetensors flows worker → ForCausalLM →
Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs)
and worker (worker.rs) construct their own mmap; Linux's page
cache shares the underlying pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-21 17:49:35 +03:00
parent 89d98d1fb2
commit 8d7b099b36
6 changed files with 282 additions and 97 deletions

1
Cargo.lock generated
View File

@@ -2120,6 +2120,7 @@ dependencies = [
"half",
"hf-hub",
"reqwest",
"safetensors 0.7.0",
"serde",
"serde_json",
"thiserror 2.0.18",