cortex/Cargo.lock at 4aa71902d081d16176b6a84576e7568ff00d624d

helexa/cortex

Fork 0

Files

rob thijssen 8d7b099b36

build-prerelease / Package cortex RPM (push) Blocked by required conditions

Details

CI / Format (push) Successful in 35s

Details

build-prerelease / Resolve version stamps (push) Successful in 39s

Details

CI / Clippy (push) Successful in 2m18s

Details

CI / Test (push) Successful in 4m28s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

build-prerelease / Build neuron-blackwell (push) Successful in 3m51s

Details

build-prerelease / Build cortex binary (push) Successful in 4m13s

Details

build-prerelease / Build neuron-ada (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled

Details

build-prerelease / Build neuron-ampere (push) Has been cancelled

Details

feat(stage-8d-7): direct safetensors fused-region loader

Replaces load_fused_qkv_slice_2d/_3d with reads from a separate
MmapedSafetensors handle. Each per-rank fused tensor is built by
reading the three region byte-slices directly from the mmap,
concatenating them host-side, and uploading as one device
allocation — no full-fused-tensor device materialisation.

The prior approach allocated a ~100 MB transient device tensor
per linear-attention layer; on Qwen3.6-27B with 48 linear-attn
layers that's ~4.8 GB of allocator churn during load — enough
to fragment the cuda caching allocator on a tight-VRAM 32 GB
consumer GPU, which is what triggered the layer-22 up_proj
OOM seen on beast.

Threading: MmapedSafetensors flows worker → ForCausalLM →
Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs)
and worker (worker.rs) construct their own mmap; Linux's page
cache shares the underlying pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-21 17:49:35 +03:00

104 KiB

Raw Blame History

View Raw

104 KiB Raw Blame History

104 KiB

Raw Blame History