From c4954e0eedb6292f6da7339f590f75c5d66e6323 Mon Sep 17 00:00:00 2001 From: rob thijssen Date: Wed, 27 May 2026 11:15:43 +0300 Subject: [PATCH] docs: per-device worker thread architecture (phase 5 of refactor) Closes the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. CLAUDE.md: - New "Per-device worker thread (neuron)" section under Key design decisions, covering the three load-bearing properties (context locality, drop safety, poisoning blast radius), the CPU-fallback exception, and pointers to the canonical narrative in crates/neuron/src/harness/device_worker/mod.rs's module doc-comment. - New 2026-05-27 addendum dating the migration and naming the four PR commits (Phase 1: 081b532, Phase 2: b179204, Phase 3: 76ab24d, Phase 4: b4f3576). Same convention as the 2026-04-15 and 2026-05-18 addenda. README.md: - One paragraph in "Node setup" noting the per-device thread pattern with a pointer to CLAUDE.md and the device_worker module. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) --- CLAUDE.md | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 10 ++++++ 2 files changed, 106 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index 53e7857..385aded 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -84,6 +84,63 @@ Per-request: model, node, prompt_tokens, completion_tokens, total_tokens, tok_per_sec, time_to_first_token_ms, total_latency_ms. Exposed as Prometheus histograms/counters on a separate port. +### Per-device worker thread (neuron) +The neuron daemon dedicates one OS thread per CUDA device it loads +onto. That thread binds the device's `CudaContext` once at startup and +owns it for the daemon's lifetime; every model load, forward step, +KV-cache reset, VRAM query, NCCL init/sanity, NCCL all_reduce, and +model drop on that device routes through this thread via a +`std::sync::mpsc` job channel. Replies cross back via +`tokio::sync::oneshot`. + +Three properties this gives us, in order of weight: + +1. **Context locality.** cudarc binds the CUDA context per OS thread + via `cuCtxSetCurrent`. Before this refactor, ad-hoc + `tokio::task::spawn_blocking` calls bound the context onto a + different thread per request — and `device_vram_mb()` from an + async task bound it onto whichever tokio worker happened to be + running. Pinning the context to one named thread ends that. +2. **Drop safety.** Every `CudaSlice` in a `Tensor`, every + `cudarc::nccl::Comm`, and the `CudaContext` itself call `cuMemFree` / + `ncclCommDestroy` / `cuCtxDestroy` during `Drop` — and require the + right context current. With the worker owning the model slab, + `Drop` always runs on the right thread. The cudarc Drop constraint + is structurally enforced. +3. **Poisoning blast radius.** When a CUDA driver error makes the + context unrecoverable, the poison flag lives on the + `DeviceWorkerHandle` itself. Subsequent `submit()` calls fast-reject + at the channel boundary with a clear "device worker is poisoned" + error before any further CUDA work is attempted. The thread doesn't + exit (dropping the slab would re-touch the broken context) — it + enters a drain-only mode and replies error to everything until the + daemon restarts. + +Tensors never escape the worker thread alive. Inference replies carry +`Vec` CPU-side logits; the async caller wraps them in a CPU +candle tensor and runs `apply_repeat_penalty` + `LogitsProcessor::sample` +without ever rebinding the device context. Sampled tokens come back as +`u32`; VRAM queries as `(u64, u64)`. The opaque `ArchHandle(u64)` and +`TpHandle(u64)` are the only "references" callers hold to loaded +models — they're indices into the worker's state slab, not pointers. + +The TP worker subprocesses in `harness/tp/worker.rs` are the same +pattern out-of-process — a dedicated context-owning process per +non-zero NCCL rank. The in-process worker in `harness/device_worker/` +brings the discipline to rank 0. + +CPU loads (`Device::Cpu` fallback when CUDA is unavailable) keep the +legacy `tokio::task::spawn_blocking + Arc>` path — +there's no context to own and the channel hop would only add latency. +Four `spawn_blocking` references in `harness/candle.rs` are deliberate +CPU fallback. + +Canonical narrative lives in +`crates/neuron/src/harness/device_worker/mod.rs`'s module +doc-comment; touch points (the `Job` enum, the dispatch handlers, the +`DeviceWorkerState` struct) are in the sibling `jobs.rs` and +`dispatch.rs`. + ## Tech stack - **Rust 2024 edition** — workspace with 4 crates @@ -658,3 +715,42 @@ longer in scope for helexa. ~~Originally planned to ship CUDA-versioned mistral.rs RPMs.~~ Replaced by the candle harness work in the 2026-05-18 addendum above. With mistral.rs out of the dependency tree, there is nothing to package. + +## 2026-05-27 addendum: per-device worker thread + +Replaced the ad-hoc `tokio::task::spawn_blocking` pattern that drove +every leader-side CUDA op with one dedicated OS thread per CUDA device, +permanently bound to that device's `CudaContext`. All leader-side +inference work (GGUF + dense + TP shard load, forward, kv-cache clear, +NCCL init/sanity, NCCL all_reduce, VRAM query, model drop) routes +through the worker via a `std::sync::mpsc` channel; tensors never +escape the worker thread alive. See "Per-device worker thread (neuron)" +above and `crates/neuron/src/harness/device_worker/mod.rs` for the +canonical narrative. + +Motivated by the 2026-05-26 silent-hang on beast: a CUDA OOM cascade +poisoned the device context on whichever spawn_blocking thread caught +it, and subsequent requests stalled invisibly on the pool lock. After +the refactor, the same failure mode shows up in journalctl as +`prefill sample failed; logits unhealthy nan: 248320/248320` followed +by `failed, model marked poisoned`. The thread stays alive and rejects +subsequent requests at the channel boundary. + +Landed in four PRs: + +- **Phase 1** (`081b532`) — device_worker module + 8 VRAM-query sites + route through the worker. CPU build only; smoke on beast confirmed + a persistent `cuda-dev-0` thread. +- **Phase 2** (`b179204`) — single-GPU forward + clear_kv + drop via + the worker. `LoadedModel.arch_handle: Option` replaces + `Arc>` for CUDA loads. CPU keeps the legacy path. +- **Phase 3** (`76ab24d`) — TP forward + NCCL init/sanity + leader + KV-clear routed through the worker. `WorkerPool.leader_nccl` moves + into the worker's state. `TpLoadedModel.leader_handle: TpHandle` + replaces `Arc>`. CUDA-only TP smoke deferred to + next deploy. +- **Phase 4** (`b4f3576`) — GGUF + dense + TP shard loads move onto + the worker. The `Job::TransferIn` / `Job::CloneLeaderComm` bridges + from Phases 2/3 deleted; `SendComm` newtype no longer needed in the + load path. `grep -rn spawn_blocking crates/neuron/src/harness/` + returns only deliberate CPU-fallback hits after this PR. diff --git a/README.md b/README.md index 9425cda..affa586 100644 --- a/README.md +++ b/README.md @@ -61,6 +61,16 @@ Each GPU node runs `neuron` (listening on `:13131`). Neuron uses huggingface/candle for in-process inference — there is no external inference subprocess to manage. +Inside the daemon, every CUDA device gets one dedicated OS thread +(named `cuda-dev-N`) that owns the device's CUDA context for the +daemon's lifetime. Model loads, forward passes, KV-cache resets, +NCCL collectives, VRAM queries, and unloads all route through that +thread via a job channel; tensors never escape it alive. This pins +context binding to a known thread, makes the CUDA Drop contract +structurally safe, and isolates driver-error poisoning to one worker +rather than the whole process. See `CLAUDE.md` for the design +rationale and `crates/neuron/src/harness/device_worker/` for the code. + The neuron RPM (`helexa-neuron`) ships a systemd unit: ```sh