When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.
- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
`LoadedHandle::spec()`) so a poisoned model can be reloaded without an
operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
`new()` when a runtime is present) drains poisoned model ids and runs
`unload_model` → `load_model(spec)`. Unload drops the model → cudarc
`Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
init + sanity inside the load path, so a successful reload yields a
fresh, healthy model. A failed reload leaves it unloaded (next load
retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
per model via a `recovering` set) and return a transient "recovering,
retry shortly" error instead of the manual-reload message. Requests
that arrive during the brief reload gap (model absent from the registry)
also get "recovering" rather than a misleading "not loaded".
`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.
Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>