Files
cortex/crates/neuron
rob thijssen 1385979e3d feat(neuron,candle): log per-device VRAM at chat_completion start
Every "starting" log line now carries vram_free_mb / vram_total_mb for
the request's serving device (the leader device on TP). On the 2026-05-26
incident this would have made the 14k-token prefill OOM diagnosable from
the first log line: with ~412 MB free, that prompt was never going to
fit, and the operator could have caught the imbalance before the CUDA
context got poisoned.

`device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is
kept separate to avoid coupling the inference path to the TP module.
TpLoadedModel gains a `leader_device: Device` clone so the request
path reads the device without locking the leader model (which would
contend with an in-flight forward).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:26:23 +03:00
..