build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync)
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 35s
CI / Test (push) Failing after 1m9s
CI / Clippy (push) Successful in 2m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Build neuron-ampere (push) Successful in 7m35s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s

The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's
thread-safety invariant upstream) and makes `Comm::abort` idempotent via
an `aborted` flag (so abort-then-Drop can't double-free) — strictly
better than the previous Drop-no-panic workaround, and the `abort()`
signature is unchanged so the watchdog call site is unaffected.

Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` /
`NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119)
with neuron's manual `unsafe impl`s. Remove the four now-redundant impls
— the safety assertion lives upstream in cudarc where it belongs. The
conflict is in cuda-gated code, so only the CUDA type-check catches it
(non-cuda build + clippy + tests stay green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 16:33:14 +03:00
parent 7945240646
commit 60f5598542
3 changed files with 12 additions and 27 deletions

2
Cargo.lock generated
View File

@@ -905,7 +905,7 @@ dependencies = [
[[package]] [[package]]
name = "cudarc" name = "cudarc"
version = "0.19.7" version = "0.19.7"
source = "git+https://github.com/grenade/cudarc?rev=dbc425aa865c178f38a3ec838f1f7a4da3146358#dbc425aa865c178f38a3ec838f1f7a4da3146358" source = "git+https://github.com/grenade/cudarc?rev=63327a256059f8252641ae46c6bb9eefe707f382#63327a256059f8252641ae46c6bb9eefe707f382"
dependencies = [ dependencies = [
"float8", "float8",
"half", "half",

View File

@@ -69,4 +69,4 @@ cortex-gateway = { path = "crates/cortex-gateway" }
# rebuild the comm). Pinned to a fork revision pending upstream review # rebuild the comm). Pinned to a fork revision pending upstream review
# (grenade/cudarc @ nccl-comm-abort). # (grenade/cudarc @ nccl-comm-abort).
[patch.crates-io] [patch.crates-io]
cudarc = { git = "https://github.com/grenade/cudarc", rev = "dbc425aa865c178f38a3ec838f1f7a4da3146358" } cudarc = { git = "https://github.com/grenade/cudarc", rev = "63327a256059f8252641ae46c6bb9eefe707f382" }

View File

@@ -119,40 +119,25 @@ mod cuda_impl {
} }
} }
/// `Arc<Comm>` doesn't impl `Send` because `Comm` wraps a raw /// Thin newtype over `Arc<Comm>`, kept for call-site clarity — it marks
/// `ncclComm_t` pointer. The NCCL contract is "operations against a /// the points where a comm handle is intentionally moved across threads
/// given comm must be serialised", not "the handle must stay on the /// (e.g. cached async-side for the TP step watchdog's `ncclCommAbort`).
/// thread that created it" — so it's safe to move an `Arc<Comm>`
/// across threads as long as no concurrent ops are issued. The
/// pool's outer Mutex serialises us into `spawn_blocking`, so this
/// wrapper at the move boundary is the only thing missing.
/// ///
/// `Sync` is also marked safe because the `Arc<Comm>` clones held /// `Send`/`Sync` are provided upstream by `cudarc`'s `Comm` (which
/// by the row-parallel layers are only used from the /// asserts the NCCL thread-safety invariant, including aborting from a
/// `spawn_blocking` thread driving the forward pass; concurrent /// different thread than one inside a collective), so this type derives
/// access from another thread would still be a bug. /// them automatically — no manual `unsafe impl` here.
pub struct SendComm(pub Arc<Comm>); pub struct SendComm(pub Arc<Comm>);
// SAFETY: see the doc-comment above; the invariant is enforced at
// the call site (pool Mutex + single spawn_blocking thread), not at
// the type level.
unsafe impl Send for SendComm {}
unsafe impl Sync for SendComm {}
impl SendComm { impl SendComm {
pub fn into_inner(self) -> Arc<Comm> { pub fn into_inner(self) -> Arc<Comm> {
self.0 self.0
} }
} }
// SAFETY: `cudarc::nccl::Comm` contains a raw `ncclComm_t` pointer // `NcclState`'s `Send`/`Sync` are auto-derived: its `Arc<Comm>` and
// (libnccl-allocated state). NCCL requires that operations against // `Arc<CudaContext>` fields are now `Send`/`Sync` (cudarc asserts the
// one Comm be issued one at a time; we serialise access by storing // comm thread-safety invariant), so no manual `unsafe impl` is needed.
// NcclState behind a Mutex in `WorkerPool`. The Comm itself is
// move-safe — NCCL doesn't track the calling OS thread, only the
// stream the operations are dispatched against.
unsafe impl Send for NcclState {}
unsafe impl Sync for NcclState {}
/// Generate a fresh NCCL `Id` and return it hex-encoded. Used by /// Generate a fresh NCCL `Id` and return it hex-encoded. Used by
/// the leader to mint the shared communicator id which is then /// the leader to mint the shared communicator id which is then