feat(neuron): bind listener before pre-warm, surface activation in /health

Two coupled changes addressing the 2026-05-26 validate-neuron failure where a fresh deploy of beast had /health unreachable for ~5 minutes while Qwen3.6-27B q5k materialised, even though systemd reported the unit as active. 1. main.rs no longer awaits load_default_models before binding axum. The listener binds first; pre-warm runs in a spawned background task that holds a read lock on the harness registry for the duration of its sequential load loop. Concurrent on-demand /models/load and /v1/chat/completions traffic still flow. 2. /health gains an `activation` field carrying: state pre_warming | ready pending model ids queued but not started in_progress model id currently loading (Option) completed model ids loaded successfully this activation failed [{model_id, error}] for failed entries The field is `#[serde(default)]` so a pre-change cortex polling a new neuron — or vice versa — keeps working. `ActivationTracker` (new module `neuron::activation`) owns the RwLock-wrapped state; load_default_models takes a tracker reference and updates it per-model. NeuronState holds an Arc clone for the /health handler. Tests updated to construct trackers and assert state transitions (empty noop, two failures → ready with both in `failed`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 15:18:04 +03:00
parent d3f2d50749
commit 800498f530
9 changed files with 267 additions and 25 deletions
--- a/crates/cortex-core/src/discovery.rs
+++ b/crates/cortex-core/src/discovery.rs
@@ -36,8 +36,72 @@ pub struct DeviceHealth {

 /// Runtime health response from a neuron endpoint.
 /// Returned by `GET /health`.
+///
+/// `activation` was added in 2026-05-26 to distinguish "process is up
+/// and reachable" from "process is ready to serve traffic". A `Type=simple`
+/// systemd unit reports `active` the moment the binary starts — but a
+/// neuron whose `default_models` list takes minutes to materialise
+/// won't bind its listener (or, in the new flow, won't have any models
+/// loaded) until pre-warm completes. The new field is `#[serde(default)]`
+/// so a pre-2026-05-26 gateway polling a new neuron — or vice versa —
+/// keeps working.
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct HealthResponse {
    pub uptime_secs: u64,
    pub devices: Vec<DeviceHealth>,
+    #[serde(default)]
+    pub activation: ActivationStatus,
+}
+
+/// High-level activation state of the neuron daemon. The HTTP listener
+/// is bound during both states; what differs is whether the configured
+/// `default_models` have finished loading.
+#[derive(Debug, Clone, Copy, Serialize, Deserialize, Default, PartialEq, Eq)]
+#[serde(rename_all = "snake_case")]
+pub enum ActivationState {
+    /// At least one `default_models` entry is still loading. The
+    /// neuron's other endpoints work, but inference against
+    /// not-yet-loaded models will 404.
+    PreWarming,
+    /// Every `default_models` entry has either loaded or failed; the
+    /// neuron is steady-state. Subsequent on-demand loads via
+    /// `/models/load` don't flip back to PreWarming — that field
+    /// reflects the activation-time set only.
+    #[default]
+    Ready,
+}
+
+/// Per-model failure record surfaced in [`ActivationStatus::failed`].
+/// The error string is the rendered anyhow chain at the time of the
+/// failure; operators read it from `/health` to decide whether to
+/// retry, edit the spec, or unload+reload.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct PreWarmFailure {
+    pub model_id: String,
+    pub error: String,
+}
+
+/// Activation-time progress snapshot. All four lists are populated by
+/// the neuron's pre-warm task and read by the `/health` handler. The
+/// snapshot is consistent: a model id appears in exactly one of
+/// `pending`, `in_progress` (as `Option<String>`), `completed`, or
+/// `failed` at any point in time.
+#[derive(Debug, Clone, Serialize, Deserialize, Default)]
+pub struct ActivationStatus {
+    pub state: ActivationState,
+    /// Model ids queued but not yet started. Empty in `Ready` state.
+    #[serde(default)]
+    pub pending: Vec<String>,
+    /// Model id currently materialising. None when between models or
+    /// in `Ready` state.
+    #[serde(default)]
+    pub in_progress: Option<String>,
+    /// Model ids that finished loading successfully during this
+    /// activation. Cleared on process restart.
+    #[serde(default)]
+    pub completed: Vec<String>,
+    /// Model ids that failed during this activation, with the rendered
+    /// error chain. Cleared on process restart.
+    #[serde(default)]
+    pub failed: Vec<PreWarmFailure>,
 }