feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled

Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.

1. main.rs no longer awaits load_default_models before binding axum.
   The listener binds first; pre-warm runs in a spawned background
   task that holds a read lock on the harness registry for the
   duration of its sequential load loop. Concurrent on-demand
   /models/load and /v1/chat/completions traffic still flow.

2. /health gains an `activation` field carrying:
     state         pre_warming | ready
     pending       model ids queued but not started
     in_progress   model id currently loading (Option)
     completed     model ids loaded successfully this activation
     failed        [{model_id, error}] for failed entries
   The field is `#[serde(default)]` so a pre-change cortex polling a
   new neuron — or vice versa — keeps working.

`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.

Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-26 15:18:04 +03:00
parent d3f2d50749
commit 800498f530
9 changed files with 267 additions and 25 deletions

View File

@@ -36,8 +36,72 @@ pub struct DeviceHealth {
/// Runtime health response from a neuron endpoint.
/// Returned by `GET /health`.
///
/// `activation` was added in 2026-05-26 to distinguish "process is up
/// and reachable" from "process is ready to serve traffic". A `Type=simple`
/// systemd unit reports `active` the moment the binary starts — but a
/// neuron whose `default_models` list takes minutes to materialise
/// won't bind its listener (or, in the new flow, won't have any models
/// loaded) until pre-warm completes. The new field is `#[serde(default)]`
/// so a pre-2026-05-26 gateway polling a new neuron — or vice versa —
/// keeps working.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HealthResponse {
pub uptime_secs: u64,
pub devices: Vec<DeviceHealth>,
#[serde(default)]
pub activation: ActivationStatus,
}
/// High-level activation state of the neuron daemon. The HTTP listener
/// is bound during both states; what differs is whether the configured
/// `default_models` have finished loading.
#[derive(Debug, Clone, Copy, Serialize, Deserialize, Default, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ActivationState {
/// At least one `default_models` entry is still loading. The
/// neuron's other endpoints work, but inference against
/// not-yet-loaded models will 404.
PreWarming,
/// Every `default_models` entry has either loaded or failed; the
/// neuron is steady-state. Subsequent on-demand loads via
/// `/models/load` don't flip back to PreWarming — that field
/// reflects the activation-time set only.
#[default]
Ready,
}
/// Per-model failure record surfaced in [`ActivationStatus::failed`].
/// The error string is the rendered anyhow chain at the time of the
/// failure; operators read it from `/health` to decide whether to
/// retry, edit the spec, or unload+reload.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PreWarmFailure {
pub model_id: String,
pub error: String,
}
/// Activation-time progress snapshot. All four lists are populated by
/// the neuron's pre-warm task and read by the `/health` handler. The
/// snapshot is consistent: a model id appears in exactly one of
/// `pending`, `in_progress` (as `Option<String>`), `completed`, or
/// `failed` at any point in time.
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct ActivationStatus {
pub state: ActivationState,
/// Model ids queued but not yet started. Empty in `Ready` state.
#[serde(default)]
pub pending: Vec<String>,
/// Model id currently materialising. None when between models or
/// in `Ready` state.
#[serde(default)]
pub in_progress: Option<String>,
/// Model ids that finished loading successfully during this
/// activation. Cleared on process restart.
#[serde(default)]
pub completed: Vec<String>,
/// Model ids that failed during this activation, with the rendered
/// error chain. Cleared on process restart.
#[serde(default)]
pub failed: Vec<PreWarmFailure>,
}