Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (pull_request) Failing after 28s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m54s
CI / Clippy (pull_request) Successful in 3m36s
CI / Test (push) Successful in 4m37s
CI / Test (pull_request) Successful in 5m20s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / CUDA type-check (push) Failing after 31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
During the #17 auto-recovery window (unload → reload, minutes for a large TP model) the model's registry slot is absent, so it vanished from neuron's /models — and cortex, routing by /models presence, answered "model not found on any node" while a direct request to neuron would have correctly said "recovering, retry shortly". neuron: the recovery set becomes a map carrying a devices/capabilities snapshot taken at trigger time (while the registry slot still exists). list_models reports `recovering` for models in the set — both while the poisoned slot is still present and during the reload gap, where the snapshot keeps the model listed. gateway: ModelStatus grows a Recovering variant (parsed from the wire); the router holds the route — new RouteError::ModelRecovering mapped to 503 instead of 404 — and deliberately does not fall through to the catalogue cold-load, which would race a second placement against the in-flight recovery. The evictor already ignores non-Loaded entries. Tests: neuron unit test (recovering model stays listed with snapshot), gateway integration tests (poller parses `recovering`; request gets 503 retry-shortly and the model stays on /v1/models). Closes #20 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6.9 KiB
6.9 KiB