Files
cortex/crates
rob thijssen 4f2957af9e feat(neuron): auto-recover poisoned models (#17 Stage 1c)
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.

- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
  `LoadedHandle::spec()`) so a poisoned model can be reloaded without an
  operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
  `new()` when a runtime is present) drains poisoned model ids and runs
  `unload_model` → `load_model(spec)`. Unload drops the model → cudarc
  `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
  init + sanity inside the load path, so a successful reload yields a
  fresh, healthy model. A failed reload leaves it unloaded (next load
  retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
  per model via a `recovering` set) and return a transient "recovering,
  retry shortly" error instead of the manual-reload message. Requests
  that arrive during the brief reload gap (model absent from the registry)
  also get "recovering" rather than a misleading "not loaded".

`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.

Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:05:02 +03:00
..