Stage 7a-ii: real NCCL handshake behind the worker pool
Some checks failed
CI / Format (push) Failing after 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Failing after 3m33s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Failing after 4m19s
build-prerelease / Build neuron-ada (push) Failing after 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Some checks failed
CI / Format (push) Failing after 38s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build neuron-blackwell (push) Failing after 3m33s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Failing after 4m19s
build-prerelease / Build neuron-ada (push) Failing after 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.
NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
opens a CudaContext on the configured device, calls Comm::from_rank
with the supplied (rank, world_size, id). NCCL blocks until every
rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
the Comm across spawn_blocking boundaries (NCCL is move-safe; only
concurrent op issuance is unsafe).
WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.
Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.
Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
"cuda_feature_not_enabled" on the local dev box (no CUDA), or
accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
the real init + sanity round-trip; gated on the cuda-integration
feature so default CI doesn't try to NCCL.
Verifiable on beast (2× RTX 5090):
cargo test -p neuron --features cuda-integration \
--test tp_worker_lifecycle_cuda
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -14,12 +14,16 @@ path = "src/main.rs"
|
||||
|
||||
[features]
|
||||
default = []
|
||||
# Enables CUDA acceleration in candle. Without this feature, candle
|
||||
# compiles for CPU only and Device::new_cuda calls fall back to CPU.
|
||||
# Enables CUDA acceleration in candle and the cudarc/nccl bindings the
|
||||
# TP worker pool uses. Without this feature, candle compiles for CPU
|
||||
# only, Device::new_cuda calls fall back to CPU, and TP Init/sanity
|
||||
# requests return Error{kind="cuda_feature_not_enabled"}.
|
||||
cuda = [
|
||||
"candle-core/cuda",
|
||||
"candle-core/nccl",
|
||||
"candle-nn/cuda",
|
||||
"candle-transformers/cuda",
|
||||
"dep:cudarc",
|
||||
]
|
||||
# Use cuDNN for convolution / attention kernels. Requires CUDA.
|
||||
cudnn = [
|
||||
@@ -60,6 +64,10 @@ toml.workspace = true
|
||||
candle-core = "0.10.2"
|
||||
candle-nn = "0.10.2"
|
||||
candle-transformers = "0.10.2"
|
||||
# Direct dep on cudarc (matching candle's transitive version) so the
|
||||
# TP worker pool can call cudarc::nccl::{Comm, Id} directly. Gated on
|
||||
# the `cuda` feature; same toolchain requirement as candle's CUDA path.
|
||||
cudarc = { version = "0.19", optional = true, default-features = false, features = ["nccl", "cuda-version-from-build-system"] }
|
||||
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }
|
||||
hf-hub = { version = "0.4", features = ["tokio"] }
|
||||
|
||||
|
||||
Reference in New Issue
Block a user