docs: per-device worker thread architecture (phase 5 of refactor)

Closes the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. CLAUDE.md: - New "Per-device worker thread (neuron)" section under Key design decisions, covering the three load-bearing properties (context locality, drop safety, poisoning blast radius), the CPU-fallback exception, and pointers to the canonical narrative in crates/neuron/src/harness/device_worker/mod.rs's module doc-comment. - New 2026-05-27 addendum dating the migration and naming the four PR commits (Phase 1: 081b532, Phase 2: b179204, Phase 3: 76ab24d, Phase 4: b4f3576). Same convention as the 2026-04-15 and 2026-05-18 addenda. README.md: - One paragraph in "Node setup" noting the per-device thread pattern with a pointer to CLAUDE.md and the device_worker module. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:15:43 +03:00
parent b4f3576d82
commit c4954e0eed
2 changed files with 106 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -61,6 +61,16 @@ Each GPU node runs `neuron` (listening on `:13131`). Neuron uses
 huggingface/candle for in-process inference — there is no external
 inference subprocess to manage.

+Inside the daemon, every CUDA device gets one dedicated OS thread
+(named `cuda-dev-N`) that owns the device's CUDA context for the
+daemon's lifetime. Model loads, forward passes, KV-cache resets,
+NCCL collectives, VRAM queries, and unloads all route through that
+thread via a job channel; tensors never escape it alive. This pins
+context binding to a known thread, makes the CUDA Drop contract
+structurally safe, and isolates driver-error poisoning to one worker
+rather than the whole process. See `CLAUDE.md` for the design
+rationale and `crates/neuron/src/harness/device_worker/` for the code.
+
 The neuron RPM (`helexa-neuron`) ships a systemd unit:

 ```sh