build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 6m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
All checks were successful
CI / CUDA type-check (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 6m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
#17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE communicator from another thread — to unblock a collective wedged on a dead/hung peer so the ranks can resync. No cudarc release (incl. main) exposes this: the safe Comm only aborts in Drop, which can't fire while a stuck thread holds an Arc<Comm> clone. Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort, rev 4dff0be) adding three thin methods — Comm::abort, get_async_error, and a raw comm() accessor — to be submitted upstream. The patch targets 0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io. Foundation only; the watchdog + abort + comm-rebuild that consume these land in follow-up commits (cuda-gated → validated by the blackwell build). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
3
Cargo.lock
generated
3
Cargo.lock
generated
@@ -905,8 +905,7 @@ dependencies = [
|
|||||||
[[package]]
|
[[package]]
|
||||||
name = "cudarc"
|
name = "cudarc"
|
||||||
version = "0.19.7"
|
version = "0.19.7"
|
||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "git+https://github.com/grenade/cudarc?rev=4dff0be72d8a685d6691a6a53d4c95e1fe932277#4dff0be72d8a685d6691a6a53d4c95e1fe932277"
|
||||||
checksum = "1cea5f10a99e025c1b44ae2354c2d8326b25ddbd0baf76bde8e55cfd4018a2cc"
|
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"float8",
|
"float8",
|
||||||
"half",
|
"half",
|
||||||
|
|||||||
@@ -61,3 +61,12 @@ eventsource-stream = "0.2"
|
|||||||
# workspace crates
|
# workspace crates
|
||||||
cortex-core = { path = "crates/cortex-core" }
|
cortex-core = { path = "crates/cortex-core" }
|
||||||
cortex-gateway = { path = "crates/cortex-gateway" }
|
cortex-gateway = { path = "crates/cortex-gateway" }
|
||||||
|
|
||||||
|
# Patched cudarc (affects neuron's 0.19.x only; candle's 0.17.x is
|
||||||
|
# untouched since the fork is 0.19.7 and doesn't satisfy a 0.17 req). Adds
|
||||||
|
# Comm::abort / get_async_error / raw comm() — needed for #17 Stage 2 TP
|
||||||
|
# hang-recovery (abort a wedged collective from another thread, then
|
||||||
|
# rebuild the comm). Pinned to a fork revision pending upstream review
|
||||||
|
# (grenade/cudarc @ nccl-comm-abort).
|
||||||
|
[patch.crates-io]
|
||||||
|
cudarc = { git = "https://github.com/grenade/cudarc", rev = "4dff0be72d8a685d6691a6a53d4c95e1fe932277" }
|
||||||
|
|||||||
Reference in New Issue
Block a user