build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync)

The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's thread-safety invariant upstream) and makes `Comm::abort` idempotent via an `aborted` flag (so abort-then-Drop can't double-free) — strictly better than the previous Drop-no-panic workaround, and the `abort()` signature is unchanged so the watchdog call site is unaffected. Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` / `NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119) with neuron's manual `unsafe impl`s. Remove the four now-redundant impls — the safety assertion lives upstream in cudarc where it belongs. The conflict is in cuda-gated code, so only the CUDA type-check catches it (non-cuda build + clippy + tests stay green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chore: re-trigger deploy (#17 Stage 2, attempt 3)
2026-06-08 16:33:14 +03:00 · 2026-06-08 15:06:04 +03:00 · 2026-06-08 14:45:16 +03:00
3 changed files with 12 additions and 27 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -905,7 +905,7 @@ dependencies = [
 [[package]]
 name = "cudarc"
 version = "0.19.7"
-source = "git+https://github.com/grenade/cudarc?rev=dbc425aa865c178f38a3ec838f1f7a4da3146358#dbc425aa865c178f38a3ec838f1f7a4da3146358"
+source = "git+https://github.com/grenade/cudarc?rev=63327a256059f8252641ae46c6bb9eefe707f382#63327a256059f8252641ae46c6bb9eefe707f382"
 dependencies = [
 "float8",
 "half",
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -69,4 +69,4 @@ cortex-gateway = { path = "crates/cortex-gateway" }
 # rebuild the comm). Pinned to a fork revision pending upstream review
 # (grenade/cudarc @ nccl-comm-abort).
 [patch.crates-io]
-cudarc = { git = "https://github.com/grenade/cudarc", rev = "dbc425aa865c178f38a3ec838f1f7a4da3146358" }
+cudarc = { git = "https://github.com/grenade/cudarc", rev = "63327a256059f8252641ae46c6bb9eefe707f382" }
--- a/crates/neuron/src/harness/tp/nccl_state.rs
+++ b/crates/neuron/src/harness/tp/nccl_state.rs
@@ -119,40 +119,25 @@ mod cuda_impl {
        }
    }

-    /// `Arc<Comm>` doesn't impl `Send` because `Comm` wraps a raw
-    /// `ncclComm_t` pointer. The NCCL contract is "operations against a
-    /// given comm must be serialised", not "the handle must stay on the
-    /// thread that created it" — so it's safe to move an `Arc<Comm>`
-    /// across threads as long as no concurrent ops are issued. The
-    /// pool's outer Mutex serialises us into `spawn_blocking`, so this
-    /// wrapper at the move boundary is the only thing missing.
+    /// Thin newtype over `Arc<Comm>`, kept for call-site clarity — it marks
+    /// the points where a comm handle is intentionally moved across threads
+    /// (e.g. cached async-side for the TP step watchdog's `ncclCommAbort`).
    ///
-    /// `Sync` is also marked safe because the `Arc<Comm>` clones held
-    /// by the row-parallel layers are only used from the
-    /// `spawn_blocking` thread driving the forward pass; concurrent
-    /// access from another thread would still be a bug.
+    /// `Send`/`Sync` are provided upstream by `cudarc`'s `Comm` (which
+    /// asserts the NCCL thread-safety invariant, including aborting from a
+    /// different thread than one inside a collective), so this type derives
+    /// them automatically — no manual `unsafe impl` here.
    pub struct SendComm(pub Arc<Comm>);

-    // SAFETY: see the doc-comment above; the invariant is enforced at
-    // the call site (pool Mutex + single spawn_blocking thread), not at
-    // the type level.
-    unsafe impl Send for SendComm {}
-    unsafe impl Sync for SendComm {}
-
    impl SendComm {
        pub fn into_inner(self) -> Arc<Comm> {
            self.0
        }
    }

-    // SAFETY: `cudarc::nccl::Comm` contains a raw `ncclComm_t` pointer
-    // (libnccl-allocated state). NCCL requires that operations against
-    // one Comm be issued one at a time; we serialise access by storing
-    // NcclState behind a Mutex in `WorkerPool`. The Comm itself is
-    // move-safe — NCCL doesn't track the calling OS thread, only the
-    // stream the operations are dispatched against.
-    unsafe impl Send for NcclState {}
-    unsafe impl Sync for NcclState {}
+    // `NcclState`'s `Send`/`Sync` are auto-derived: its `Arc<Comm>` and
+    // `Arc<CudaContext>` fields are now `Send`/`Sync` (cudarc asserts the
+    // comm thread-safety invariant), so no manual `unsafe impl` is needed.

    /// Generate a fresh NCCL `Id` and return it hex-encoded. Used by
    /// the leader to mint the shared communicator id which is then
Author	SHA1	Message	Date
rob thijssen	60f5598542	build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync) Some checks failed build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 35s Details CI / Test (push) Failing after 1m9s Details CI / Clippy (push) Successful in 2m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Build neuron-ampere (push) Successful in 7m35s Details build-prerelease / Build neuron-ada (push) Successful in 5m7s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Build cortex binary (push) Successful in 4m33s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's thread-safety invariant upstream) and makes `Comm::abort` idempotent via an `aborted` flag (so abort-then-Drop can't double-free) — strictly better than the previous Drop-no-panic workaround, and the `abort()` signature is unchanged so the watchdog call site is unaffected. Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` / `NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119) with neuron's manual `unsafe impl`s. Remove the four now-redundant impls — the safety assertion lives upstream in cudarc where it belongs. The conflict is in cuda-gated code, so only the CUDA type-check catches it (non-cuda build + clippy + tests stay green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:33:14 +03:00
rob thijssen	7945240646	chore: re-trigger deploy (#17 Stage 2, attempt 3) All checks were successful CI / CUDA type-check (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m45s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m50s Details CI / Test (push) Successful in 6m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m38s Details build-prerelease / Build neuron-ada (push) Successful in 5m36s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details No code change. Each deploy run, the degraded CI runner kills a different single arch build (blackwell, then ada) ~fast, and the all-arch-gated packaging skips → no publish. Every arch HAS built green across runs (blackwell ✅ in 342, ampere ✅, ada ✅ in 339) and the gate + CUDA type-check pass. Re-running to catch all three green in one run so the Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate infra work). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 15:06:04 +03:00
rob thijssen	0c74d89d15	chore: re-trigger deploy (#17 Stage 2) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 30s Details build-prerelease / Build neuron-ada (push) Failing after 51s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m42s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details CI / Test (push) Successful in 6m6s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details No code change. The `c94a2ae` deploy's neuron-blackwell build died ~12min into the Blackwell kernel compile on the degraded runner, while neuron-ampere + neuron-ada built the identical Rust + patched cudarc cleanly and the CUDA type-check passed. Transient infra; re-running to get a healthy blackwell build so the RPMs publish and beast (Blackwell) picks it up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:45:16 +03:00