refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker

Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl*` / `Job::Tp*` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:16:02 +03:00
parent b179204fd3
commit 76ab24d98c
8 changed files with 676 additions and 138 deletions
--- a/crates/neuron/tests/tp_worker_lifecycle.rs
+++ b/crates/neuron/tests/tp_worker_lifecycle.rs
@@ -5,6 +5,7 @@
 //! `Init` and `NcclSanityCheck` are stubbed in 7a-i, so this test
 //! runs on any host the workspace builds on.

+use neuron::harness::device_worker::DeviceWorkerHandle;
 use neuron::harness::tp::{WorkerPool, rpc::WorkerResponse};

 /// Path to the neuron binary built by cargo for this test process.
@@ -19,7 +20,8 @@ const NEURON_BIN: &str = env!("CARGO_BIN_EXE_neuron");
 async fn test_spawn_ping_shutdown() {
    // cuda_devices: rank 0 → device 0 (leader, unused here),
    //               rank 1 → device 1 (worker; not actually opened in 7a-i).
-    let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 2, &[0, 1])
+    let leader_worker = DeviceWorkerHandle::spawn(0).expect("spawn device worker");
+    let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 2, &[0, 1], leader_worker)
        .await
        .expect("spawn worker pool");

@@ -44,7 +46,8 @@ async fn test_spawn_ping_shutdown() {
 /// Three workers — exercise the loop in `ping_all` / `shutdown`.
 #[tokio::test]
 async fn test_spawn_three_workers() {
-    let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 3, &[0, 1, 2])
+    let leader_worker = DeviceWorkerHandle::spawn(0).expect("spawn device worker");
+    let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 3, &[0, 1, 2], leader_worker)
        .await
        .expect("spawn worker pool");

--- a/crates/neuron/tests/tp_worker_lifecycle_cuda.rs
+++ b/crates/neuron/tests/tp_worker_lifecycle_cuda.rs
@@ -25,7 +25,9 @@ async fn test_init_and_sanity_check_two_ranks() {
        .try_init();

    // 2 ranks: leader = rank 0 on device 0, worker = rank 1 on device 1.
-    let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 2, &[0, 1])
+    let leader_worker = neuron::harness::device_worker::DeviceWorkerHandle::spawn(0)
+        .expect("spawn leader device worker");
+    let mut pool = WorkerPool::spawn(NEURON_BIN.as_ref(), 2, &[0, 1], leader_worker)
        .await
        .expect("spawn worker pool");