Stage 7a-ii: real NCCL handshake behind the worker pool

Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i. With --features cuda the leader and its workers now establish a live NCCL communicator end-to-end; without the feature the same code paths return Error{kind="cuda_feature_not_enabled"} so a misconfigured build is obvious instead of silently no-op. NCCL state machine (harness/tp/nccl_state.rs) is shared between the worker process and the leader's pool: - generate_comm_id_hex() mints an Id::new() on the leader. - NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit, opens a CudaContext on the configured device, calls Comm::from_rank with the supplied (rank, world_size, id). NCCL blocks until every rank has joined. - NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader asserts every rank reports observed_sum == world_size. - NCCL handles serialised under Mutex; unsafe impl Send/Sync gates the Comm across spawn_blocking boundaries (NCCL is move-safe; only concurrent op issuance is unsafe). WorkerPool::init_nccl orchestrates the rendezvous: 1. Write Init { comm_id } to every worker's stdin (no await yet). 2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking, concurrently with workers. 3. NCCL handshake completes for all ranks simultaneously. 4. Leader collects InitOk responses. WorkerPool::nccl_sanity_check follows the same pattern over all_reduce, validating world_size == observed_sum on every rank. Worker.send_only / Worker.recv_only split out from the previous monolithic Worker.request so the leader can interleave its own NCCL work with the worker calls — required because NCCL blocks during init. Tests: - 4 hex roundtrip unit tests for the wire encoding. - The 7a-i "not implemented" expectation now reads "cuda_feature_not_enabled" on the local dev box (no CUDA), or accepts InitOk on a cuda-built test binary. - New cuda-integration test in tp_worker_lifecycle_cuda.rs covers the real init + sanity round-trip; gated on the cuda-integration feature so default CI doesn't try to NCCL. Verifiable on beast (2× RTX 5090): cargo test -p neuron --features cuda-integration \ --test tp_worker_lifecycle_cuda Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 16:40:01 +03:00
parent 2a7ede0232
commit da068ded6d
7 changed files with 498 additions and 29 deletions
--- a/crates/neuron/Cargo.toml
+++ b/crates/neuron/Cargo.toml
@@ -14,12 +14,16 @@ path = "src/main.rs"

 [features]
 default = []
-# Enables CUDA acceleration in candle. Without this feature, candle
-# compiles for CPU only and Device::new_cuda calls fall back to CPU.
+# Enables CUDA acceleration in candle and the cudarc/nccl bindings the
+# TP worker pool uses. Without this feature, candle compiles for CPU
+# only, Device::new_cuda calls fall back to CPU, and TP Init/sanity
+# requests return Error{kind="cuda_feature_not_enabled"}.
 cuda = [
    "candle-core/cuda",
+    "candle-core/nccl",
    "candle-nn/cuda",
    "candle-transformers/cuda",
+    "dep:cudarc",
 ]
 # Use cuDNN for convolution / attention kernels. Requires CUDA.
 cudnn = [
@@ -60,6 +64,10 @@ toml.workspace = true
 candle-core = "0.10.2"
 candle-nn = "0.10.2"
 candle-transformers = "0.10.2"
+# Direct dep on cudarc (matching candle's transitive version) so the
+# TP worker pool can call cudarc::nccl::{Comm, Id} directly. Gated on
+# the `cuda` feature; same toolchain requirement as candle's CUDA path.
+cudarc = { version = "0.19", optional = true, default-features = false, features = ["nccl", "cuda-version-from-build-system"] }
 tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
 hf-hub = { version = "0.4", features = ["tokio"] }