TP-rank failure should abort the leader's NCCL collective instead of hanging the daemon #17

Open
opened 2026-06-04 14:33:49 +00:00 by grenade · 2 comments
Owner

Context

On 2026-06-04, an image-bearing request from agent-0 (~12,960-token
prompt) OOM'd rank 1 mid-prefill. Because the worker died before
issuing its row-parallel AllReduce, rank 0 (the leader) blocked
forever on the matching collective
, holding the pool lock — the
whole neuron daemon wedged until a manual restart. Journal:

worker GenerateStepWithImages: forward failed rank=1 … CUDA_ERROR_OUT_OF_MEMORY
… (no further lines — leader hung on the all_reduce)

The single-shot vision prefill made this easy to trigger; chunked
vision prefill (fa01350) and the pre-flight VRAM guard make it far
less likely. But the underlying fragility remains: any time one TP
rank fails inside a forward (OOM, CUBLAS error, illegal address), the
surviving ranks deadlock on the next collective rather than failing
the request. This predates vision — see the historical
2026-05-27 poison events in the journal where GenerateStep
failures on rank 1 took the model down.

This is the highest-value TP robustness gap: a single bad request can
hang the node indefinitely instead of failing cleanly and staying up.

Goal

A forward failure on any rank should surface as a clean request
error (HTTP 4xx/5xx) and leave the daemon serving subsequent requests
— never a silent hang on the pool lock.

Possible approaches (to evaluate)

  • Bounded wait + poison on the leader forward. Wrap the leader's
    tp_forward_logits* in a timeout; if it doesn't return within a
    generous bound, classify as a device fault, mark the model poisoned,
    and (since NCCL state is unrecoverable) tear down + respawn the
    worker subprocesses. The pool lock releases, new requests get a
    clean "model reloading" error.
  • NCCL abort. ncclCommAbort on the leader's comm when a worker
    reports failure, to unblock the stuck collective deterministically
    rather than relying on a timeout. Requires plumbing the abort into
    the device-worker thread that owns the Comm.
  • Pre-collective health handshake. Have workers report
    forward-start/forward-failed over the RPC stream so the leader can
    detect a dead rank before/while it enters the collective. Hard to
    make race-free against an in-flight AllReduce.

Likely a combination: detect worker failure fast (RPC), ncclCommAbort
to unblock, then respawn the rank-N subprocesses and re-init NCCL.

Touch points

  • crates/neuron/src/harness/tp/mod.rsWorkerPool::generate_step*
    (fan-out/leader-forward/drain), Worker subprocess lifecycle.
  • crates/neuron/src/harness/device_worker/ — leader Comm ownership,
    Job::TpForwardLogits*, NCCL init/abort.
  • crates/neuron/src/harness/tp/nccl_state.rs — comm teardown/re-init.
  • crates/neuron/src/harness/candle.rschat_completion_tp* /
    inference_tp_stream poison + error classification.

Verification

  • Inject a forced OOM / bail! on rank 1's forward and assert: the
    request returns an error within the bound, the daemon stays alive,
    and a follow-up request succeeds (after worker respawn) rather than
    hanging on the pool lock.
  • Re-run the agent-0 large-context vision request against constrained
    VRAM and confirm a clean rejection, not a hang.

References

  • Hang incident + diagnosis: vision TP-prefill OOM, 2026-06-04.
  • Mitigations already landed: chunked TP-vision prefill + pre-flight
    guard (fa01350), TP-vision umbrella #16 / #12.
  • Related historical poison cascades: journal 2026-05-27.

🤖 Generated with Claude Code

## Context On 2026-06-04, an image-bearing request from agent-0 (~12,960-token prompt) OOM'd **rank 1** mid-prefill. Because the worker died *before* issuing its row-parallel `AllReduce`, **rank 0 (the leader) blocked forever on the matching collective**, holding the pool lock — the whole neuron daemon wedged until a manual restart. Journal: ``` worker GenerateStepWithImages: forward failed rank=1 … CUDA_ERROR_OUT_OF_MEMORY … (no further lines — leader hung on the all_reduce) ``` The single-shot vision prefill made this easy to trigger; chunked vision prefill (`fa01350`) and the pre-flight VRAM guard make it far less likely. **But the underlying fragility remains**: any time one TP rank fails inside a forward (OOM, CUBLAS error, illegal address), the surviving ranks deadlock on the next collective rather than failing the request. This predates vision — see the historical `2026-05-27` poison events in the journal where `GenerateStep` failures on rank 1 took the model down. This is the highest-value TP robustness gap: a single bad request can hang the node indefinitely instead of failing cleanly and staying up. ## Goal A forward failure on **any** rank should surface as a clean request error (HTTP 4xx/5xx) and leave the daemon serving subsequent requests — never a silent hang on the pool lock. ## Possible approaches (to evaluate) - **Bounded wait + poison on the leader forward.** Wrap the leader's `tp_forward_logits*` in a timeout; if it doesn't return within a generous bound, classify as a device fault, mark the model poisoned, and (since NCCL state is unrecoverable) tear down + respawn the worker subprocesses. The pool lock releases, new requests get a clean "model reloading" error. - **NCCL abort.** `ncclCommAbort` on the leader's comm when a worker reports failure, to unblock the stuck collective deterministically rather than relying on a timeout. Requires plumbing the abort into the device-worker thread that owns the `Comm`. - **Pre-collective health handshake.** Have workers report forward-start/forward-failed over the RPC stream so the leader can detect a dead rank before/while it enters the collective. Hard to make race-free against an in-flight `AllReduce`. Likely a combination: detect worker failure fast (RPC), `ncclCommAbort` to unblock, then respawn the rank-N subprocesses and re-init NCCL. ## Touch points - `crates/neuron/src/harness/tp/mod.rs` — `WorkerPool::generate_step*` (fan-out/leader-forward/drain), `Worker` subprocess lifecycle. - `crates/neuron/src/harness/device_worker/` — leader `Comm` ownership, `Job::TpForwardLogits*`, NCCL init/abort. - `crates/neuron/src/harness/tp/nccl_state.rs` — comm teardown/re-init. - `crates/neuron/src/harness/candle.rs` — `chat_completion_tp*` / `inference_tp_stream` poison + error classification. ## Verification - Inject a forced OOM / `bail!` on rank 1's forward and assert: the request returns an error within the bound, the daemon stays alive, and a follow-up request succeeds (after worker respawn) rather than hanging on the pool lock. - Re-run the agent-0 large-context vision request against constrained VRAM and confirm a clean rejection, not a hang. ## References - Hang incident + diagnosis: vision TP-prefill OOM, 2026-06-04. - Mitigations already landed: chunked TP-vision prefill + pre-flight guard (`fa01350`), TP-vision umbrella #16 / #12. - Related historical poison cascades: journal `2026-05-27`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Owner

(test comment removed — see the Stage 2 implementation + journal-search status below)

_(test comment removed — see the Stage 2 implementation + journal-search status below)_
Author
Owner

Implemented (branch feat/neuron-17-stage2) — verify from real-world journals

Now addressed by the #17 fault-recovery work, in two stages.

Stage 1 (merged, verified) — error-faults: a rank that returns an error (the common case) poisons the model and auto-recovers via unload_modelload_model (NCCL re-init + sanity inside the load). Verified end-to-end on beast 2026-06-08 (clean ~1.4s unload, reload, healthy, next request served — no human, no restart).

Stage 2 (this branch) — the hang case from the original report (a rank dies/wedges and the leader blocks forever on the matching collective). Implements the "bounded wait + ncclCommAbort" approach listed in the issue:

  • Cache the leader's NCCL Comm handle async-side at init.
  • A watchdog wraps the leader forward (tokio::time::timeout, default 120s, NEURON_TP_STEP_TIMEOUT_S). On expiry it calls Comm::abort() (ncclCommAbort) from the async thread to unblock the wedged collective, then fails the step → poison → Stage 1 reload (which now completes because the leader thread is responsive again). The reload's unload kills + respawns the wedged subprocess workers.
  • Needs Comm::abort/get_async_error, which no cudarc release exposes; carried on a fork (grenade/cudarc @ nccl-comm-abort) pinned via [patch], pending upstream review. (Also a Drop-must-not-panic fix so the post-abort comm teardown doesn't double-abort-panic.)

Verification strategy: observe real hangs, no synthetic harness

We deliberately did not build a hang-injection harness (inducing a real collective hang is risky and low-value). Instead the recovery path logs a distinctive, greppable trail. When a real TP hang occurs in production, confirm the watchdog handled it from the neuron journal on the affected host:

ssh <host> journalctl -u neuron.service --no-pager \
  | grep -iE "tp watchdog|auto-recovery|ncclCommAbort|reloaded; model healthy"

A clean recovery from a real hang looks like this (in order):

tp watchdog: leader forward exceeded deadline — NCCL collective wedged; aborting comm ...
tp watchdog: ncclCommAbort succeeded — wedged collective unblocked; failing the step ...
auto-recovery: poisoned, enqueueing rebuild   model=<id>
auto-recovery: unload+reload starting         model=<id>
auto-recovery: reloaded; model healthy        model=<id>

That sequence = the daemon detected the hang, unblocked it, and healed itself with no operator action. If you see it, Stage 2 worked in the wild.

Failure / degraded signatures to watch for:

Journal line Meaning / action
tp watchdog: ncclCommAbort failed abort itself errored — recovery may stall; host likely needs a neuron restart. Investigate.
tp watchdog: no cached leader comm handle — cannot abort comm handle wasn't cached at init (see the could not cache leader NCCL comm handle warn at load) → process restart needed. Bug to fix.
auto-recovery: reload failed; model left unloaded abort+unblock worked but reload failed (VRAM, driver) — model unloaded; a later request/activation should retry.
watchdog fires with no real hang (healthy step >120s) false positive — raise NEURON_TP_STEP_TIMEOUT_S. Shouldn't happen (healthy steps are sub-second).

To close this issue later

Revisit the journals across the fleet after some weeks of real traffic. If we find one or more clean tp watchdog → ncclCommAbort succeeded → auto-recovery: reloaded; model healthy sequences (and no stuck ...abort failed / no cached comm handle cases), we have real-world evidence the hang fragility is resolved and can close. If no hangs ever occur, the Stage 1 mitigations + chunked prefill have made them rare enough that the watchdog is pure insurance — also a fine outcome.

Out of scope (deferred): fast comm-only rebuild (skip the ~4min reload via ArcSwap<Comm>); the in-process rank-0 leader still can't recover a true context fault (vs collective wedge) without a process restart.

🤖 Generated with Claude Code

## Implemented (branch `feat/neuron-17-stage2`) — verify from real-world journals Now addressed by the #17 fault-recovery work, in two stages. **Stage 1 (merged, verified)** — error-faults: a rank that *returns an error* (the common case) poisons the model and auto-recovers via `unload_model` → `load_model` (NCCL re-init + sanity inside the load). Verified end-to-end on beast 2026-06-08 (clean ~1.4s unload, reload, healthy, next request served — no human, no restart). **Stage 2 (this branch)** — the *hang* case from the original report (a rank dies/wedges and the leader blocks forever on the matching collective). Implements the "bounded wait + `ncclCommAbort`" approach listed in the issue: - Cache the leader's NCCL `Comm` handle async-side at init. - A watchdog wraps the leader forward (`tokio::time::timeout`, default 120s, `NEURON_TP_STEP_TIMEOUT_S`). On expiry it calls `Comm::abort()` (`ncclCommAbort`) from the async thread to unblock the wedged collective, then fails the step → poison → Stage 1 reload (which now completes because the leader thread is responsive again). The reload's `unload` kills + respawns the wedged subprocess workers. - Needs `Comm::abort`/`get_async_error`, which no cudarc release exposes; carried on a fork (`grenade/cudarc @ nccl-comm-abort`) pinned via `[patch]`, pending upstream review. (Also a Drop-must-not-panic fix so the post-abort comm teardown doesn't double-abort-panic.) ### Verification strategy: observe real hangs, no synthetic harness We deliberately did **not** build a hang-injection harness (inducing a real collective hang is risky and low-value). Instead the recovery path logs a distinctive, greppable trail. When a real TP hang occurs in production, confirm the watchdog handled it from the neuron journal on the affected host: ```sh ssh <host> journalctl -u neuron.service --no-pager \ | grep -iE "tp watchdog|auto-recovery|ncclCommAbort|reloaded; model healthy" ``` **A clean recovery from a real hang looks like this** (in order): ``` tp watchdog: leader forward exceeded deadline — NCCL collective wedged; aborting comm ... tp watchdog: ncclCommAbort succeeded — wedged collective unblocked; failing the step ... auto-recovery: poisoned, enqueueing rebuild model=<id> auto-recovery: unload+reload starting model=<id> auto-recovery: reloaded; model healthy model=<id> ``` That sequence = the daemon detected the hang, unblocked it, and healed itself with no operator action. If you see it, Stage 2 worked in the wild. **Failure / degraded signatures to watch for:** | Journal line | Meaning / action | |---|---| | `tp watchdog: ncclCommAbort failed` | abort itself errored — recovery may stall; host likely needs a neuron restart. Investigate. | | `tp watchdog: no cached leader comm handle — cannot abort` | comm handle wasn't cached at init (see the `could not cache leader NCCL comm handle` warn at load) → process restart needed. Bug to fix. | | `auto-recovery: reload failed; model left unloaded` | abort+unblock worked but reload failed (VRAM, driver) — model unloaded; a later request/activation should retry. | | watchdog fires with no real hang (healthy step >120s) | false positive — raise `NEURON_TP_STEP_TIMEOUT_S`. Shouldn't happen (healthy steps are sub-second). | ### To close this issue later Revisit the journals across the fleet after some weeks of real traffic. If we find one or more clean `tp watchdog → ncclCommAbort succeeded → auto-recovery: reloaded; model healthy` sequences (and no stuck `...abort failed` / `no cached comm handle` cases), we have real-world evidence the hang fragility is resolved and can close. If no hangs ever occur, the Stage 1 mitigations + chunked prefill have made them rare enough that the watchdog is pure insurance — also a fine outcome. Out of scope (deferred): fast comm-only rebuild (skip the ~4min reload via `ArcSwap<Comm>`); the in-process rank-0 leader still can't recover a true *context* fault (vs collective wedge) without a process restart. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#17