neuron: keep auto-recovering models listed as recovering in /models (so cortex holds the route)
#20
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Follow-up from the #17 Stage-1 auto-recovery work (poisoned models now
self-heal via unload→reload). Surfaced during end-to-end verification on
beast (2026-06-08).
When a poisoned model is auto-recovering, neuron does
unload_model→
load_model(spec). During the reload window (~3-6 min for a 27B TPmodel) the model is absent from neuron's registry, so it disappears
from
GET /models.neuron already handles a direct request landing in that window — the
chat handler checks
is_recovering(model_id)and returns a transientrecovering ... retry shortlyerror (good). But when fronted bycortex, the request never reaches that handler: cortex routes by
/modelspresence, the model is gone, so cortex returns its ownSo through the gateway, an auto-recovering model looks evicted/unknown
rather than temporarily recovering — a worse signal, and one that
could also race cortex into trying to (re)load the model itself.
Observed (verification log)
Proposal
Keep recovering models visible in neuron's
GET /modelswith adistinct status while the rebuild is in flight, so cortex holds the route
instead of dropping it:
CandleHarness.recovering) and havelist_modelsemit an entry foreach recovering id with
status: "recovering"(or reuse the existingstatus enum + a flag).
recoveringlike a transient/loading state —return
503 recovering, retry shortly(or hold/retry per policy)rather than
not found, and do not attempt its own load of amodel neuron reports as recovering (avoids a double-load race during
the reload gap).
Acceptance
GET /modelson neuron lists the modelas
recovering."recovering/retry" signal, not "not found on any node".
recovering.Notes
verified). The core auto-heal works; this just makes the gateway-facing
signal accurate during the reload window.
/modelssemantics (catalogue × topology).🤖 Generated with Claude Code