neuron: keep auto-recovering models listed as recovering in /models (so cortex holds the route) #20

Open
opened 2026-06-08 10:31:19 +00:00 by grenade · 0 comments
Owner

Context

Follow-up from the #17 Stage-1 auto-recovery work (poisoned models now
self-heal via unload→reload). Surfaced during end-to-end verification on
beast (2026-06-08).

When a poisoned model is auto-recovering, neuron does unload_model
load_model(spec). During the reload window (~3-6 min for a 27B TP
model) the model is absent from neuron's registry, so it disappears
from GET /models.

neuron already handles a direct request landing in that window — the
chat handler checks is_recovering(model_id) and returns a transient
recovering ... retry shortly error (good). But when fronted by
cortex, the request never reaches that handler: cortex routes by
/models presence, the model is gone, so cortex returns its own

{"message":"model 'Qwen/Qwen3.6-27B' not found on any node and not in catalogue","type":"gateway_error"}

So through the gateway, an auto-recovering model looks evicted/unknown
rather than temporarily recovering — a worse signal, and one that
could also race cortex into trying to (re)load the model itself.

Observed (verification log)

10:12:13 auto-recovery: unload+reload starting
10:12:14 model unloaded                       ← gone from /models here
  (curl via cortex) → "not found on any node and not in catalogue"
10:16:01 auto-recovery: reloaded; model healthy
  (curl via cortex) → normal completion

Proposal

Keep recovering models visible in neuron's GET /models with a
distinct status while the rebuild is in flight, so cortex holds the route
instead of dropping it:

  • Track the recovering set (already exists as
    CandleHarness.recovering) and have list_models emit an entry for
    each recovering id with status: "recovering" (or reuse the existing
    status enum + a flag).
  • cortex's router: treat recovering like a transient/loading state —
    return 503 recovering, retry shortly (or hold/retry per policy)
    rather than not found, and do not attempt its own load of a
    model neuron reports as recovering (avoids a double-load race during
    the reload gap).

Acceptance

  • During an auto-recovery reload, GET /models on neuron lists the model
    as recovering.
  • A request through cortex during that window returns a transient
    "recovering/retry" signal, not "not found on any node".
  • cortex does not initiate its own load for a model neuron reports
    recovering.

Notes

  • Small, non-urgent polish on top of #17 Stage 1 (which is implemented +
    verified). The core auto-heal works; this just makes the gateway-facing
    signal accurate during the reload window.
  • Relates to the unified /models semantics (catalogue × topology).

🤖 Generated with Claude Code

## Context Follow-up from the #17 Stage-1 auto-recovery work (poisoned models now self-heal via unload→reload). Surfaced during end-to-end verification on beast (2026-06-08). When a poisoned model is auto-recovering, neuron does `unload_model` → `load_model(spec)`. During the **reload window** (~3-6 min for a 27B TP model) the model is **absent from neuron's registry**, so it disappears from `GET /models`. neuron already handles a *direct* request landing in that window — the chat handler checks `is_recovering(model_id)` and returns a transient `recovering ... retry shortly` error (good). But when fronted by **cortex**, the request never reaches that handler: cortex routes by `/models` presence, the model is gone, so cortex returns its own ``` {"message":"model 'Qwen/Qwen3.6-27B' not found on any node and not in catalogue","type":"gateway_error"} ``` So through the gateway, an auto-recovering model looks **evicted/unknown** rather than **temporarily recovering** — a worse signal, and one that could also race cortex into trying to (re)load the model itself. ## Observed (verification log) ``` 10:12:13 auto-recovery: unload+reload starting 10:12:14 model unloaded ← gone from /models here (curl via cortex) → "not found on any node and not in catalogue" 10:16:01 auto-recovery: reloaded; model healthy (curl via cortex) → normal completion ``` ## Proposal Keep recovering models **visible** in neuron's `GET /models` with a distinct status while the rebuild is in flight, so cortex holds the route instead of dropping it: - Track the recovering set (already exists as `CandleHarness.recovering`) and have `list_models` emit an entry for each recovering id with `status: "recovering"` (or reuse the existing status enum + a flag). - cortex's router: treat `recovering` like a transient/loading state — return `503 recovering, retry shortly` (or hold/retry per policy) rather than `not found`, and **do not** attempt its own load of a model neuron reports as recovering (avoids a double-load race during the reload gap). ## Acceptance - During an auto-recovery reload, `GET /models` on neuron lists the model as `recovering`. - A request through cortex during that window returns a transient "recovering/retry" signal, not "not found on any node". - cortex does not initiate its own load for a model neuron reports `recovering`. ## Notes - Small, non-urgent polish on top of #17 Stage 1 (which is implemented + verified). The core auto-heal works; this just makes the gateway-facing signal accurate during the reload window. - Relates to the unified `/models` semantics (catalogue × topology). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#20