TP-rank failure should abort the leader's NCCL collective instead of hanging the daemon #17

New Issue

grenade · 2026-06-04T14:33:49Z

grenade commented

2026-06-04 14:33:49 +00:00

Context

On 2026-06-04, an image-bearing request from agent-0 (~12,960-token
prompt) OOM'd rank 1 mid-prefill. Because the worker died before
issuing its row-parallel AllReduce, rank 0 (the leader) blocked
forever on the matching collective, holding the pool lock — the
whole neuron daemon wedged until a manual restart. Journal:

worker GenerateStepWithImages: forward failed rank=1 … CUDA_ERROR_OUT_OF_MEMORY
… (no further lines — leader hung on the all_reduce)

The single-shot vision prefill made this easy to trigger; chunked
vision prefill (fa01350) and the pre-flight VRAM guard make it far
less likely. But the underlying fragility remains: any time one TP
rank fails inside a forward (OOM, CUBLAS error, illegal address), the
surviving ranks deadlock on the next collective rather than failing
the request. This predates vision — see the historical
2026-05-27 poison events in the journal where GenerateStep
failures on rank 1 took the model down.

This is the highest-value TP robustness gap: a single bad request can
hang the node indefinitely instead of failing cleanly and staying up.

Goal

A forward failure on any rank should surface as a clean request
error (HTTP 4xx/5xx) and leave the daemon serving subsequent requests
— never a silent hang on the pool lock.

Possible approaches (to evaluate)

Bounded wait + poison on the leader forward. Wrap the leader's
tp_forward_logits* in a timeout; if it doesn't return within a
generous bound, classify as a device fault, mark the model poisoned,
and (since NCCL state is unrecoverable) tear down + respawn the
worker subprocesses. The pool lock releases, new requests get a
clean "model reloading" error.
NCCL abort. ncclCommAbort on the leader's comm when a worker
reports failure, to unblock the stuck collective deterministically
rather than relying on a timeout. Requires plumbing the abort into
the device-worker thread that owns the Comm.
Pre-collective health handshake. Have workers report
forward-start/forward-failed over the RPC stream so the leader can
detect a dead rank before/while it enters the collective. Hard to
make race-free against an in-flight AllReduce.

Likely a combination: detect worker failure fast (RPC), ncclCommAbort
to unblock, then respawn the rank-N subprocesses and re-init NCCL.

Touch points

crates/neuron/src/harness/tp/mod.rs — WorkerPool::generate_step*
(fan-out/leader-forward/drain), Worker subprocess lifecycle.
crates/neuron/src/harness/device_worker/ — leader Comm ownership,
Job::TpForwardLogits*, NCCL init/abort.
crates/neuron/src/harness/tp/nccl_state.rs — comm teardown/re-init.
crates/neuron/src/harness/candle.rs — chat_completion_tp* /
inference_tp_stream poison + error classification.

Verification

Inject a forced OOM / bail! on rank 1's forward and assert: the
request returns an error within the bound, the daemon stays alive,
and a follow-up request succeeds (after worker respawn) rather than
hanging on the pool lock.
Re-run the agent-0 large-context vision request against constrained
VRAM and confirm a clean rejection, not a hang.

References

Hang incident + diagnosis: vision TP-prefill OOM, 2026-06-04.
Mitigations already landed: chunked TP-vision prefill + pre-flight
guard (fa01350), TP-vision umbrella #16 / #12.
Related historical poison cascades: journal 2026-05-27.

🤖 Generated with Claude Code

## Context On 2026-06-04, an image-bearing request from agent-0 (~12,960-token prompt) OOM'd **rank 1** mid-prefill. Because the worker died *before* issuing its row-parallel `AllReduce`, **rank 0 (the leader) blocked forever on the matching collective**, holding the pool lock — the whole neuron daemon wedged until a manual restart. Journal: ``` worker GenerateStepWithImages: forward failed rank=1 … CUDA_ERROR_OUT_OF_MEMORY … (no further lines — leader hung on the all_reduce) ``` The single-shot vision prefill made this easy to trigger; chunked vision prefill (`fa01350`) and the pre-flight VRAM guard make it far less likely. **But the underlying fragility remains**: any time one TP rank fails inside a forward (OOM, CUBLAS error, illegal address), the surviving ranks deadlock on the next collective rather than failing the request. This predates vision — see the historical `2026-05-27` poison events in the journal where `GenerateStep` failures on rank 1 took the model down. This is the highest-value TP robustness gap: a single bad request can hang the node indefinitely instead of failing cleanly and staying up. ## Goal A forward failure on **any** rank should surface as a clean request error (HTTP 4xx/5xx) and leave the daemon serving subsequent requests — never a silent hang on the pool lock. ## Possible approaches (to evaluate) - **Bounded wait + poison on the leader forward.** Wrap the leader's `tp_forward_logits*` in a timeout; if it doesn't return within a generous bound, classify as a device fault, mark the model poisoned, and (since NCCL state is unrecoverable) tear down + respawn the worker subprocesses. The pool lock releases, new requests get a clean "model reloading" error. - **NCCL abort.** `ncclCommAbort` on the leader's comm when a worker reports failure, to unblock the stuck collective deterministically rather than relying on a timeout. Requires plumbing the abort into the device-worker thread that owns the `Comm`. - **Pre-collective health handshake.** Have workers report forward-start/forward-failed over the RPC stream so the leader can detect a dead rank before/while it enters the collective. Hard to make race-free against an in-flight `AllReduce`. Likely a combination: detect worker failure fast (RPC), `ncclCommAbort` to unblock, then respawn the rank-N subprocesses and re-init NCCL. ## Touch points - `crates/neuron/src/harness/tp/mod.rs` — `WorkerPool::generate_step*` (fan-out/leader-forward/drain), `Worker` subprocess lifecycle. - `crates/neuron/src/harness/device_worker/` — leader `Comm` ownership, `Job::TpForwardLogits*`, NCCL init/abort. - `crates/neuron/src/harness/tp/nccl_state.rs` — comm teardown/re-init. - `crates/neuron/src/harness/candle.rs` — `chat_completion_tp*` / `inference_tp_stream` poison + error classification. ## Verification - Inject a forced OOM / `bail!` on rank 1's forward and assert: the request returns an error within the bound, the daemon stays alive, and a follow-up request succeeds (after worker respawn) rather than hanging on the pool lock. - Re-run the agent-0 large-context vision request against constrained VRAM and confirm a clean rejection, not a hang. ## References - Hang incident + diagnosis: vision TP-prefill OOM, 2026-06-04. - Mitigations already landed: chunked TP-vision prefill + pre-flight guard (`fa01350`), TP-vision umbrella #16 / #12. - Related historical poison cascades: journal `2026-05-27`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

grenade referenced this issue from a commit

2026-06-08 06:08:44 +00:00

feat(neuron): auto-recover poisoned models (#17 Stage 1c)

grenade referenced this issue from a commit

2026-06-08 06:08:44 +00:00

test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)

grenade referenced this issue from a commit

2026-06-08 06:34:25 +00:00

chore: re-trigger deploy (#17 Stage 1)

grenade referenced this issue

2026-06-08 10:31:19 +00:00

neuron: keep auto-recovering models listed as `recovering` in /models (so cortex holds the route) #20

grenade referenced this issue from a commit

2026-06-08 10:50:58 +00:00

build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2)

grenade referenced this issue from a commit

2026-06-08 11:15:32 +00:00

feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2)

grenade commented

2026-06-08 11:17:55 +00:00

(test comment removed — see the Stage 2 implementation + journal-search status below)

_(test comment removed — see the Stage 2 implementation + journal-search status below)_

grenade commented

2026-06-08 11:18:21 +00:00

Implemented (branch `feat/neuron-17-stage2`) — verify from real-world journals

Now addressed by the #17 fault-recovery work, in two stages.

Stage 1 (merged, verified) — error-faults: a rank that returns an error (the common case) poisons the model and auto-recovers via unload_model → load_model (NCCL re-init + sanity inside the load). Verified end-to-end on beast 2026-06-08 (clean ~1.4s unload, reload, healthy, next request served — no human, no restart).

Stage 2 (this branch) — the hang case from the original report (a rank dies/wedges and the leader blocks forever on the matching collective). Implements the "bounded wait + ncclCommAbort" approach listed in the issue:

Cache the leader's NCCL Comm handle async-side at init.
A watchdog wraps the leader forward (tokio::time::timeout, default 120s, NEURON_TP_STEP_TIMEOUT_S). On expiry it calls Comm::abort() (ncclCommAbort) from the async thread to unblock the wedged collective, then fails the step → poison → Stage 1 reload (which now completes because the leader thread is responsive again). The reload's unload kills + respawns the wedged subprocess workers.
Needs Comm::abort/get_async_error, which no cudarc release exposes; carried on a fork (grenade/cudarc @ nccl-comm-abort) pinned via [patch], pending upstream review. (Also a Drop-must-not-panic fix so the post-abort comm teardown doesn't double-abort-panic.)

Verification strategy: observe real hangs, no synthetic harness

We deliberately did not build a hang-injection harness (inducing a real collective hang is risky and low-value). Instead the recovery path logs a distinctive, greppable trail. When a real TP hang occurs in production, confirm the watchdog handled it from the neuron journal on the affected host:

ssh <host> journalctl -u neuron.service --no-pager \
  | grep -iE "tp watchdog|auto-recovery|ncclCommAbort|reloaded; model healthy"

A clean recovery from a real hang looks like this (in order):

tp watchdog: leader forward exceeded deadline — NCCL collective wedged; aborting comm ...
tp watchdog: ncclCommAbort succeeded — wedged collective unblocked; failing the step ...
auto-recovery: poisoned, enqueueing rebuild   model=<id>
auto-recovery: unload+reload starting         model=<id>
auto-recovery: reloaded; model healthy        model=<id>

That sequence = the daemon detected the hang, unblocked it, and healed itself with no operator action. If you see it, Stage 2 worked in the wild.

Failure / degraded signatures to watch for:

Journal line	Meaning / action
`tp watchdog: ncclCommAbort failed`	abort itself errored — recovery may stall; host likely needs a neuron restart. Investigate.
`tp watchdog: no cached leader comm handle — cannot abort`	comm handle wasn't cached at init (see the `could not cache leader NCCL comm handle` warn at load) → process restart needed. Bug to fix.
`auto-recovery: reload failed; model left unloaded`	abort+unblock worked but reload failed (VRAM, driver) — model unloaded; a later request/activation should retry.
watchdog fires with no real hang (healthy step >120s)	false positive — raise `NEURON_TP_STEP_TIMEOUT_S`. Shouldn't happen (healthy steps are sub-second).

To close this issue later

Revisit the journals across the fleet after some weeks of real traffic. If we find one or more clean tp watchdog → ncclCommAbort succeeded → auto-recovery: reloaded; model healthy sequences (and no stuck ...abort failed / no cached comm handle cases), we have real-world evidence the hang fragility is resolved and can close. If no hangs ever occur, the Stage 1 mitigations + chunked prefill have made them rare enough that the watchdog is pure insurance — also a fine outcome.

Out of scope (deferred): fast comm-only rebuild (skip the ~4min reload via ArcSwap<Comm>); the in-process rank-0 leader still can't recover a true context fault (vs collective wedge) without a process restart.

🤖 Generated with Claude Code

## Implemented (branch `feat/neuron-17-stage2`) — verify from real-world journals Now addressed by the #17 fault-recovery work, in two stages. **Stage 1 (merged, verified)** — error-faults: a rank that *returns an error* (the common case) poisons the model and auto-recovers via `unload_model` → `load_model` (NCCL re-init + sanity inside the load). Verified end-to-end on beast 2026-06-08 (clean ~1.4s unload, reload, healthy, next request served — no human, no restart). **Stage 2 (this branch)** — the *hang* case from the original report (a rank dies/wedges and the leader blocks forever on the matching collective). Implements the "bounded wait + `ncclCommAbort`" approach listed in the issue: - Cache the leader's NCCL `Comm` handle async-side at init. - A watchdog wraps the leader forward (`tokio::time::timeout`, default 120s, `NEURON_TP_STEP_TIMEOUT_S`). On expiry it calls `Comm::abort()` (`ncclCommAbort`) from the async thread to unblock the wedged collective, then fails the step → poison → Stage 1 reload (which now completes because the leader thread is responsive again). The reload's `unload` kills + respawns the wedged subprocess workers. - Needs `Comm::abort`/`get_async_error`, which no cudarc release exposes; carried on a fork (`grenade/cudarc @ nccl-comm-abort`) pinned via `[patch]`, pending upstream review. (Also a Drop-must-not-panic fix so the post-abort comm teardown doesn't double-abort-panic.) ### Verification strategy: observe real hangs, no synthetic harness We deliberately did **not** build a hang-injection harness (inducing a real collective hang is risky and low-value). Instead the recovery path logs a distinctive, greppable trail. When a real TP hang occurs in production, confirm the watchdog handled it from the neuron journal on the affected host: ```sh ssh <host> journalctl -u neuron.service --no-pager \ | grep -iE "tp watchdog|auto-recovery|ncclCommAbort|reloaded; model healthy" ``` **A clean recovery from a real hang looks like this** (in order): ``` tp watchdog: leader forward exceeded deadline — NCCL collective wedged; aborting comm ... tp watchdog: ncclCommAbort succeeded — wedged collective unblocked; failing the step ... auto-recovery: poisoned, enqueueing rebuild model=<id> auto-recovery: unload+reload starting model=<id> auto-recovery: reloaded; model healthy model=<id> ``` That sequence = the daemon detected the hang, unblocked it, and healed itself with no operator action. If you see it, Stage 2 worked in the wild. **Failure / degraded signatures to watch for:** | Journal line | Meaning / action | |---|---| | `tp watchdog: ncclCommAbort failed` | abort itself errored — recovery may stall; host likely needs a neuron restart. Investigate. | | `tp watchdog: no cached leader comm handle — cannot abort` | comm handle wasn't cached at init (see the `could not cache leader NCCL comm handle` warn at load) → process restart needed. Bug to fix. | | `auto-recovery: reload failed; model left unloaded` | abort+unblock worked but reload failed (VRAM, driver) — model unloaded; a later request/activation should retry. | | watchdog fires with no real hang (healthy step >120s) | false positive — raise `NEURON_TP_STEP_TIMEOUT_S`. Shouldn't happen (healthy steps are sub-second). | ### To close this issue later Revisit the journals across the fleet after some weeks of real traffic. If we find one or more clean `tp watchdog → ncclCommAbort succeeded → auto-recovery: reloaded; model healthy` sequences (and no stuck `...abort failed` / `no cached comm handle` cases), we have real-world evidence the hang fragility is resolved and can close. If no hangs ever occur, the Stage 1 mitigations + chunked prefill have made them rare enough that the watchdog is pure insurance — also a fine outcome. Out of scope (deferred): fast comm-only rebuild (skip the ~4min reload via `ArcSwap<Comm>`); the in-process rank-0 leader still can't recover a true *context* fault (vs collective wedge) without a process restart. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

grenade referenced this issue from a commit

2026-06-08 11:21:47 +00:00

fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2)

grenade referenced this issue from a commit

2026-06-08 11:45:18 +00:00

chore: re-trigger deploy (#17 Stage 2)

grenade referenced this issue from a commit

2026-06-08 12:06:05 +00:00

chore: re-trigger deploy (#17 Stage 2, attempt 3)

Sign in to join this conversation.