feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.
1. main.rs no longer awaits load_default_models before binding axum.
The listener binds first; pre-warm runs in a spawned background
task that holds a read lock on the harness registry for the
duration of its sequential load loop. Concurrent on-demand
/models/load and /v1/chat/completions traffic still flow.
2. /health gains an `activation` field carrying:
state pre_warming | ready
pending model ids queued but not started
in_progress model id currently loading (Option)
completed model ids loaded successfully this activation
failed [{model_id, error}] for failed entries
The field is `#[serde(default)]` so a pre-change cortex polling a
new neuron — or vice versa — keeps working.
`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.
Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
93
crates/neuron/src/activation.rs
Normal file
93
crates/neuron/src/activation.rs
Normal file
@@ -0,0 +1,93 @@
|
||||
//! Activation-time pre-warm progress tracking.
|
||||
//!
|
||||
//! Wraps the [`ActivationStatus`] snapshot in an async RwLock so the
|
||||
//! background pre-warm task can update it per-model while the
|
||||
//! `/health` handler reads coherent snapshots. The tracker exists
|
||||
//! because `default_models` loading moved from synchronous-before-bind
|
||||
//! to background-after-bind on 2026-05-26: the listener is up
|
||||
//! immediately, but `/health` now needs to tell callers which of the
|
||||
//! configured defaults are still warming.
|
||||
|
||||
use cortex_core::discovery::{ActivationState, ActivationStatus, PreWarmFailure};
|
||||
use cortex_core::harness::ModelSpec;
|
||||
use tokio::sync::RwLock;
|
||||
|
||||
/// Shared, async-safe handle to the daemon's activation progress.
|
||||
///
|
||||
/// Construct once in `main` with the configured `default_models` so
|
||||
/// the initial `pending` list matches the spec; clone the `Arc` into
|
||||
/// the `NeuronState` for HTTP handlers and into the spawned pre-warm
|
||||
/// task for updates.
|
||||
pub struct ActivationTracker {
|
||||
inner: RwLock<ActivationStatus>,
|
||||
}
|
||||
|
||||
impl ActivationTracker {
|
||||
/// Build a tracker primed with one entry per spec. An empty spec
|
||||
/// list yields a `Ready` tracker — no point reporting PreWarming
|
||||
/// when there's nothing queued.
|
||||
pub fn new(default_models: &[ModelSpec]) -> Self {
|
||||
let pending: Vec<String> = default_models.iter().map(|s| s.model_id.clone()).collect();
|
||||
let state = if pending.is_empty() {
|
||||
ActivationState::Ready
|
||||
} else {
|
||||
ActivationState::PreWarming
|
||||
};
|
||||
Self {
|
||||
inner: RwLock::new(ActivationStatus {
|
||||
state,
|
||||
pending,
|
||||
in_progress: None,
|
||||
completed: vec![],
|
||||
failed: vec![],
|
||||
}),
|
||||
}
|
||||
}
|
||||
|
||||
/// Mark a model as in-progress: remove it from `pending`, set as
|
||||
/// `in_progress`. Called immediately before `registry.load_model`.
|
||||
pub async fn start_loading(&self, model_id: &str) {
|
||||
let mut s = self.inner.write().await;
|
||||
s.pending.retain(|m| m != model_id);
|
||||
s.in_progress = Some(model_id.to_string());
|
||||
}
|
||||
|
||||
/// Mark a model as completed: clear `in_progress` (if it matches),
|
||||
/// append to `completed`.
|
||||
pub async fn complete_loading(&self, model_id: &str) {
|
||||
let mut s = self.inner.write().await;
|
||||
if s.in_progress.as_deref() == Some(model_id) {
|
||||
s.in_progress = None;
|
||||
}
|
||||
s.completed.push(model_id.to_string());
|
||||
}
|
||||
|
||||
/// Mark a model as failed: clear `in_progress` (if it matches),
|
||||
/// append a `PreWarmFailure` carrying the rendered error chain.
|
||||
pub async fn fail_loading(&self, model_id: &str, error: &str) {
|
||||
let mut s = self.inner.write().await;
|
||||
if s.in_progress.as_deref() == Some(model_id) {
|
||||
s.in_progress = None;
|
||||
}
|
||||
s.failed.push(PreWarmFailure {
|
||||
model_id: model_id.to_string(),
|
||||
error: error.to_string(),
|
||||
});
|
||||
}
|
||||
|
||||
/// Flip the high-level `state` to `Ready` once the pre-warm task
|
||||
/// is done iterating. Pending should be empty by this point; if a
|
||||
/// caller bails early it's a stuck activation and the operator
|
||||
/// will see entries in `pending` even with `state=ready` — that's
|
||||
/// a useful diagnostic, not an inconsistency to scrub.
|
||||
pub async fn mark_ready(&self) {
|
||||
let mut s = self.inner.write().await;
|
||||
s.state = ActivationState::Ready;
|
||||
s.in_progress = None;
|
||||
}
|
||||
|
||||
/// Cheap clone of the current state for the `/health` handler.
|
||||
pub async fn snapshot(&self) -> ActivationStatus {
|
||||
self.inner.read().await.clone()
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user