cortex/Cargo.lock at 5436af9c7360d5a5789dec5ca27160793a71b512

helexa/cortex

Fork 0

Files

rob thijssen da068ded6d

CI / Format (push) Failing after 38s

Details

build-prerelease / Resolve version stamps (push) Successful in 42s

Details

CI / Clippy (push) Successful in 2m18s

Details

build-prerelease / Build neuron-blackwell (push) Failing after 3m33s

Details

CI / Test (push) Successful in 4m27s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

build-prerelease / Build cortex binary (push) Successful in 4m31s

Details

build-prerelease / Package cortex RPM (push) Successful in 1m21s

Details

build-prerelease / Build neuron-ampere (push) Failing after 4m19s

Details

build-prerelease / Build neuron-ada (push) Failing after 4m56s

Details

build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped

Details

build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped

Details

build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped

Details

build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped

Details

Stage 7a-ii: real NCCL handshake behind the worker pool

Wires cudarc::nccl into the TP worker lifecycle introduced in 7a-i.
With --features cuda the leader and its workers now establish a live
NCCL communicator end-to-end; without the feature the same code paths
return Error{kind="cuda_feature_not_enabled"} so a misconfigured
build is obvious instead of silently no-op.

NCCL state machine (harness/tp/nccl_state.rs) is shared between the
worker process and the leader's pool:
- generate_comm_id_hex() mints an Id::new() on the leader.
- NcclState::init parses 256 hex chars → [c_char; 128] → Id::uninit,
  opens a CudaContext on the configured device, calls Comm::from_rank
  with the supplied (rank, world_size, id). NCCL blocks until every
  rank has joined.
- NcclState::sanity_check runs one all_reduce(1u32, Sum); the leader
  asserts every rank reports observed_sum == world_size.
- NCCL handles serialised under Mutex; unsafe impl Send/Sync gates
  the Comm across spawn_blocking boundaries (NCCL is move-safe; only
  concurrent op issuance is unsafe).

WorkerPool::init_nccl orchestrates the rendezvous:
1. Write Init { comm_id } to every worker's stdin (no await yet).
2. Leader rank 0 calls its own Comm::from_rank in spawn_blocking,
   concurrently with workers.
3. NCCL handshake completes for all ranks simultaneously.
4. Leader collects InitOk responses.
WorkerPool::nccl_sanity_check follows the same pattern over
all_reduce, validating world_size == observed_sum on every rank.

Worker.send_only / Worker.recv_only split out from the previous
monolithic Worker.request so the leader can interleave its own NCCL
work with the worker calls — required because NCCL blocks during
init.

Tests:
- 4 hex roundtrip unit tests for the wire encoding.
- The 7a-i "not implemented" expectation now reads
  "cuda_feature_not_enabled" on the local dev box (no CUDA), or
  accepts InitOk on a cuda-built test binary.
- New cuda-integration test in tp_worker_lifecycle_cuda.rs covers
  the real init + sanity round-trip; gated on the cuda-integration
  feature so default CI doesn't try to NCCL.

Verifiable on beast (2× RTX 5090):
  cargo test -p neuron --features cuda-integration \
        --test tp_worker_lifecycle_cuda

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 16:40:01 +03:00

104 KiB

Raw Blame History

View Raw

104 KiB Raw Blame History

104 KiB

Raw Blame History