One-shot, env-gated fault injector for beast verification: when
NEURON_DEBUG_POISON names a model, the first request for it triggers the
auto-recovery path as if a device fault had occurred — exercising
unload→reload→healthy without corrupting the GPU. Latched so it fires
exactly once (no recovery loop). No-op unless the env var is set; wired
into both the single-GPU and TP chat poison gates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.
- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
`LoadedHandle::spec()`) so a poisoned model can be reloaded without an
operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
`new()` when a runtime is present) drains poisoned model ids and runs
`unload_model` → `load_model(spec)`. Unload drops the model → cudarc
`Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
init + sanity inside the load path, so a successful reload yields a
fresh, healthy model. A failed reload leaves it unloaded (next load
retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
per model via a `recovering` set) and return a transient "recovering,
retry shortly" error instead of the manual-reload message. Requests
that arrive during the brief reload gap (model absent from the registry)
also get "recovering" rather than a misleading "not loaded".
`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.
Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Beast testing surfaced a real regression in the dynamic-resolution
default: a tall 808×1600 image resized (within the 1024² max_pixels) to a
90×44 patch grid = 3960 patches, exceeding the vision tower's hard
`num_position_embeddings = 2304` pos-embed budget. The per-rank
`patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP-
forward and poisoned the device context, bricking the model until reload.
Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches →
≤ 576 LM tokens), clamping even the operator env override. `smart_resize`
floors the pixel count under the cap, so no resized image can ever exceed
the budget — the tower check never fires, no poison. The pos-embed grid
(48×48) is the resolution Qwen3.6 was trained at, so the cap is
principled, not just defensive. Still ~3× the old fixed 196 tokens, and
the book-cover OCR test (1176 patches) already reads full title+subtitle.
Test: a huge/tall/wide/extreme image battery stays within the 2304 patch
budget. (Per-rank-error poison robustness itself remains issue #17.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 23:30:47 +03:00
3 changed files with 232 additions and 23 deletions
// The pos-embed cap must hold for huge, tall, wide, and extreme
// images — exceeding 2304 patches errors mid-tower and poisons
// the device context, so this invariant is load-bearing.
letp=PreprocessProfile::qwen3_6();
for(sh,sw)in[
(8000u32,6000u32),
(808,1600),
(4000,400),
(1,199),
(16,16),
]{
let(h,w)=p.resized_dims(sh,sw).unwrap();
letpatches=(h/16)*(w/16);
assert!(
patches<=2304,
"{sh}x{sw} → {h}x{w} = {patches} patches exceeds the 2304 budget"
);
}
}
#[test]
fnqwen3_6_default_budget_bounds_lm_tokens(){
// A huge source image caps at max_pixels → the per-image LM token
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.