cortex

Files

fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM

cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync
returns memory to the device's default mempool, not to the OS), so
mem_get_info under-reports free VRAM between requests. With
Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw
~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into
CUBLAS_STATUS_INTERNAL_ERROR depending on quant.

Add a cuda-gated trim_device_pool helper that, after each successful
clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool,
0) against the device's default mempool. Failures (no async-alloc
support, transient driver errors) are non-fatal and log at debug. The
before/after free-VRAM delta is logged so an operator can correlate the
trim with the next request's prefill VRAM.

ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors
correctly; the leak was strictly at the cudarc pool layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 12:36:13 +03:00

src

fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM

2026-05-27 12:36:13 +03:00

tests

refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker

2026-05-27 10:16:02 +03:00

build.rs

feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only

2026-05-21 11:34:11 +03:00

Cargo.toml

feat(stage-8d-7): direct safetensors fused-region loader

2026-05-21 17:49:35 +03:00