TP-rank failure should abort the leader's NCCL collective instead of hanging the daemon #17
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
On 2026-06-04, an image-bearing request from agent-0 (~12,960-token
prompt) OOM'd rank 1 mid-prefill. Because the worker died before
issuing its row-parallel
AllReduce, rank 0 (the leader) blockedforever on the matching collective, holding the pool lock — the
whole neuron daemon wedged until a manual restart. Journal:
The single-shot vision prefill made this easy to trigger; chunked
vision prefill (
fa01350) and the pre-flight VRAM guard make it farless likely. But the underlying fragility remains: any time one TP
rank fails inside a forward (OOM, CUBLAS error, illegal address), the
surviving ranks deadlock on the next collective rather than failing
the request. This predates vision — see the historical
2026-05-27poison events in the journal whereGenerateStepfailures on rank 1 took the model down.
This is the highest-value TP robustness gap: a single bad request can
hang the node indefinitely instead of failing cleanly and staying up.
Goal
A forward failure on any rank should surface as a clean request
error (HTTP 4xx/5xx) and leave the daemon serving subsequent requests
— never a silent hang on the pool lock.
Possible approaches (to evaluate)
tp_forward_logits*in a timeout; if it doesn't return within agenerous bound, classify as a device fault, mark the model poisoned,
and (since NCCL state is unrecoverable) tear down + respawn the
worker subprocesses. The pool lock releases, new requests get a
clean "model reloading" error.
ncclCommAborton the leader's comm when a workerreports failure, to unblock the stuck collective deterministically
rather than relying on a timeout. Requires plumbing the abort into
the device-worker thread that owns the
Comm.forward-start/forward-failed over the RPC stream so the leader can
detect a dead rank before/while it enters the collective. Hard to
make race-free against an in-flight
AllReduce.Likely a combination: detect worker failure fast (RPC),
ncclCommAbortto unblock, then respawn the rank-N subprocesses and re-init NCCL.
Touch points
crates/neuron/src/harness/tp/mod.rs—WorkerPool::generate_step*(fan-out/leader-forward/drain),
Workersubprocess lifecycle.crates/neuron/src/harness/device_worker/— leaderCommownership,Job::TpForwardLogits*, NCCL init/abort.crates/neuron/src/harness/tp/nccl_state.rs— comm teardown/re-init.crates/neuron/src/harness/candle.rs—chat_completion_tp*/inference_tp_streampoison + error classification.Verification
bail!on rank 1's forward and assert: therequest returns an error within the bound, the daemon stays alive,
and a follow-up request succeeds (after worker respawn) rather than
hanging on the pool lock.
VRAM and confirm a clean rejection, not a hang.
References
guard (
fa01350), TP-vision umbrella #16 / #12.2026-05-27.🤖 Generated with Claude Code
(test comment removed — see the Stage 2 implementation + journal-search status below)
Implemented (branch
feat/neuron-17-stage2) — verify from real-world journalsNow addressed by the #17 fault-recovery work, in two stages.
Stage 1 (merged, verified) — error-faults: a rank that returns an error (the common case) poisons the model and auto-recovers via
unload_model→load_model(NCCL re-init + sanity inside the load). Verified end-to-end on beast 2026-06-08 (clean ~1.4s unload, reload, healthy, next request served — no human, no restart).Stage 2 (this branch) — the hang case from the original report (a rank dies/wedges and the leader blocks forever on the matching collective). Implements the "bounded wait +
ncclCommAbort" approach listed in the issue:Commhandle async-side at init.tokio::time::timeout, default 120s,NEURON_TP_STEP_TIMEOUT_S). On expiry it callsComm::abort()(ncclCommAbort) from the async thread to unblock the wedged collective, then fails the step → poison → Stage 1 reload (which now completes because the leader thread is responsive again). The reload'sunloadkills + respawns the wedged subprocess workers.Comm::abort/get_async_error, which no cudarc release exposes; carried on a fork (grenade/cudarc @ nccl-comm-abort) pinned via[patch], pending upstream review. (Also a Drop-must-not-panic fix so the post-abort comm teardown doesn't double-abort-panic.)Verification strategy: observe real hangs, no synthetic harness
We deliberately did not build a hang-injection harness (inducing a real collective hang is risky and low-value). Instead the recovery path logs a distinctive, greppable trail. When a real TP hang occurs in production, confirm the watchdog handled it from the neuron journal on the affected host:
A clean recovery from a real hang looks like this (in order):
That sequence = the daemon detected the hang, unblocked it, and healed itself with no operator action. If you see it, Stage 2 worked in the wild.
Failure / degraded signatures to watch for:
tp watchdog: ncclCommAbort failedtp watchdog: no cached leader comm handle — cannot abortcould not cache leader NCCL comm handlewarn at load) → process restart needed. Bug to fix.auto-recovery: reload failed; model left unloadedNEURON_TP_STEP_TIMEOUT_S. Shouldn't happen (healthy steps are sub-second).To close this issue later
Revisit the journals across the fleet after some weeks of real traffic. If we find one or more clean
tp watchdog → ncclCommAbort succeeded → auto-recovery: reloaded; model healthysequences (and no stuck...abort failed/no cached comm handlecases), we have real-world evidence the hang fragility is resolved and can close. If no hangs ever occur, the Stage 1 mitigations + chunked prefill have made them rare enough that the watchdog is pure insurance — also a fine outcome.Out of scope (deferred): fast comm-only rebuild (skip the ~4min reload via
ArcSwap<Comm>); the in-process rank-0 leader still can't recover a true context fault (vs collective wedge) without a process restart.🤖 Generated with Claude Code