feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.
1. main.rs no longer awaits load_default_models before binding axum.
The listener binds first; pre-warm runs in a spawned background
task that holds a read lock on the harness registry for the
duration of its sequential load loop. Concurrent on-demand
/models/load and /v1/chat/completions traffic still flow.
2. /health gains an `activation` field carrying:
state pre_warming | ready
pending model ids queued but not started
in_progress model id currently loading (Option)
completed model ids loaded successfully this activation
failed [{model_id, error}] for failed entries
The field is `#[serde(default)]` so a pre-change cortex polling a
new neuron — or vice versa — keeps working.
`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.
Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -36,8 +36,72 @@ pub struct DeviceHealth {
|
||||
|
||||
/// Runtime health response from a neuron endpoint.
|
||||
/// Returned by `GET /health`.
|
||||
///
|
||||
/// `activation` was added in 2026-05-26 to distinguish "process is up
|
||||
/// and reachable" from "process is ready to serve traffic". A `Type=simple`
|
||||
/// systemd unit reports `active` the moment the binary starts — but a
|
||||
/// neuron whose `default_models` list takes minutes to materialise
|
||||
/// won't bind its listener (or, in the new flow, won't have any models
|
||||
/// loaded) until pre-warm completes. The new field is `#[serde(default)]`
|
||||
/// so a pre-2026-05-26 gateway polling a new neuron — or vice versa —
|
||||
/// keeps working.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct HealthResponse {
|
||||
pub uptime_secs: u64,
|
||||
pub devices: Vec<DeviceHealth>,
|
||||
#[serde(default)]
|
||||
pub activation: ActivationStatus,
|
||||
}
|
||||
|
||||
/// High-level activation state of the neuron daemon. The HTTP listener
|
||||
/// is bound during both states; what differs is whether the configured
|
||||
/// `default_models` have finished loading.
|
||||
#[derive(Debug, Clone, Copy, Serialize, Deserialize, Default, PartialEq, Eq)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub enum ActivationState {
|
||||
/// At least one `default_models` entry is still loading. The
|
||||
/// neuron's other endpoints work, but inference against
|
||||
/// not-yet-loaded models will 404.
|
||||
PreWarming,
|
||||
/// Every `default_models` entry has either loaded or failed; the
|
||||
/// neuron is steady-state. Subsequent on-demand loads via
|
||||
/// `/models/load` don't flip back to PreWarming — that field
|
||||
/// reflects the activation-time set only.
|
||||
#[default]
|
||||
Ready,
|
||||
}
|
||||
|
||||
/// Per-model failure record surfaced in [`ActivationStatus::failed`].
|
||||
/// The error string is the rendered anyhow chain at the time of the
|
||||
/// failure; operators read it from `/health` to decide whether to
|
||||
/// retry, edit the spec, or unload+reload.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct PreWarmFailure {
|
||||
pub model_id: String,
|
||||
pub error: String,
|
||||
}
|
||||
|
||||
/// Activation-time progress snapshot. All four lists are populated by
|
||||
/// the neuron's pre-warm task and read by the `/health` handler. The
|
||||
/// snapshot is consistent: a model id appears in exactly one of
|
||||
/// `pending`, `in_progress` (as `Option<String>`), `completed`, or
|
||||
/// `failed` at any point in time.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
|
||||
pub struct ActivationStatus {
|
||||
pub state: ActivationState,
|
||||
/// Model ids queued but not yet started. Empty in `Ready` state.
|
||||
#[serde(default)]
|
||||
pub pending: Vec<String>,
|
||||
/// Model id currently materialising. None when between models or
|
||||
/// in `Ready` state.
|
||||
#[serde(default)]
|
||||
pub in_progress: Option<String>,
|
||||
/// Model ids that finished loading successfully during this
|
||||
/// activation. Cleared on process restart.
|
||||
#[serde(default)]
|
||||
pub completed: Vec<String>,
|
||||
/// Model ids that failed during this activation, with the rendered
|
||||
/// error chain. Cleared on process restart.
|
||||
#[serde(default)]
|
||||
pub failed: Vec<PreWarmFailure>,
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user