feat(stage-8d-7): direct safetensors fused-region loader

Replaces load_fused_qkv_slice_2d/_3d with reads from a separate MmapedSafetensors handle. Each per-rank fused tensor is built by reading the three region byte-slices directly from the mmap, concatenating them host-side, and uploading as one device allocation — no full-fused-tensor device materialisation. The prior approach allocated a ~100 MB transient device tensor per linear-attention layer; on Qwen3.6-27B with 48 linear-attn layers that's ~4.8 GB of allocator churn during load — enough to fragment the cuda caching allocator on a tight-VRAM 32 GB consumer GPU, which is what triggered the layer-22 up_proj OOM seen on beast. Threading: MmapedSafetensors flows worker → ForCausalLM → Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs) and worker (worker.rs) construct their own mmap; Linux's page cache shares the underlying pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 17:49:35 +03:00
parent 89d98d1fb2
commit 8d7b099b36
6 changed files with 282 additions and 97 deletions
--- a/crates/neuron/Cargo.toml
+++ b/crates/neuron/Cargo.toml
@@ -76,6 +76,11 @@ cudarc = { version = "0.19", optional = true, default-features = false, features
 half = { version = "2.5", optional = true }
 tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
 hf-hub = { version = "0.4", features = ["tokio"] }
+# Direct dep on `safetensors` (re-exported by candle but its `TensorView`
+# / `slice::IndexOp` types are public-but-not-re-exported). Used by the
+# tp `fused_load` module to read per-rank slices of fused QKV tensors
+# without materialising the full tensor on device.
+safetensors = "0.7"

 [dev-dependencies]
 tokio = { workspace = true, features = ["test-util"] }