feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled

Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.

1. main.rs no longer awaits load_default_models before binding axum.
   The listener binds first; pre-warm runs in a spawned background
   task that holds a read lock on the harness registry for the
   duration of its sequential load loop. Concurrent on-demand
   /models/load and /v1/chat/completions traffic still flow.

2. /health gains an `activation` field carrying:
     state         pre_warming | ready
     pending       model ids queued but not started
     in_progress   model id currently loading (Option)
     completed     model ids loaded successfully this activation
     failed        [{model_id, error}] for failed entries
   The field is `#[serde(default)]` so a pre-change cortex polling a
   new neuron — or vice versa — keeps working.

`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.

Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-26 15:18:04 +03:00
parent d3f2d50749
commit 800498f530
9 changed files with 267 additions and 25 deletions

View File

@@ -2,7 +2,9 @@
//! individual failures so a single broken catalogue entry doesn't
//! prevent the rest of the fleet from starting.
use cortex_core::discovery::ActivationState;
use cortex_core::harness::{HarnessConfig, ModelSpec};
use neuron::activation::ActivationTracker;
use neuron::config::HarnessSettings;
use neuron::harness::HarnessRegistry;
use neuron::startup;
@@ -37,7 +39,8 @@ async fn test_load_default_models_skips_unknown_harness() {
},
];
startup::load_default_models(&registry, &specs).await;
let activation = ActivationTracker::new(&specs);
startup::load_default_models(&registry, &specs, &activation).await;
let listed = registry
.list_all_models()
@@ -47,10 +50,28 @@ async fn test_load_default_models_skips_unknown_harness() {
listed.is_empty(),
"no models should be loaded after failed entries"
);
// Both specs should land in `failed`; tracker should flip to ready.
let snapshot = activation.snapshot().await;
assert_eq!(snapshot.state, ActivationState::Ready);
assert!(snapshot.pending.is_empty());
assert!(snapshot.in_progress.is_none());
assert!(snapshot.completed.is_empty());
assert_eq!(snapshot.failed.len(), 2);
let failed_ids: Vec<&str> = snapshot
.failed
.iter()
.map(|f| f.model_id.as_str())
.collect();
assert!(failed_ids.contains(&"model-a"));
assert!(failed_ids.contains(&"model-b"));
}
#[tokio::test]
async fn test_load_default_models_empty_is_noop() {
let registry = HarnessRegistry::new();
startup::load_default_models(&registry, &[]).await;
let activation = ActivationTracker::new(&[]);
startup::load_default_models(&registry, &[], &activation).await;
let snapshot = activation.snapshot().await;
assert_eq!(snapshot.state, ActivationState::Ready);
}

View File

@@ -1,4 +1,5 @@
use cortex_core::discovery::{DeviceInfo, DiscoveryResponse};
use neuron::activation::ActivationTracker;
use neuron::api::{self, NeuronState};
use neuron::harness::HarnessRegistry;
use neuron::health::HealthCache;
@@ -15,6 +16,7 @@ async fn spawn_neuron(discovery: DiscoveryResponse) -> String {
health_cache,
registry: RwLock::new(registry),
candle: None,
activation: Arc::new(ActivationTracker::new(&[])),
});
let app = api::neuron_routes().with_state(state);
@@ -160,6 +162,7 @@ async fn test_candle_harness_registers_and_rejects_bogus_model() {
health_cache,
registry: RwLock::new(registry),
candle,
activation: Arc::new(ActivationTracker::new(&[])),
});
let app = api::neuron_routes().with_state(state);
@@ -211,6 +214,7 @@ async fn test_chat_completions_no_candle_harness() {
health_cache,
registry: RwLock::new(registry),
candle: None,
activation: Arc::new(ActivationTracker::new(&[])),
});
let app = api::neuron_routes().with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
@@ -252,6 +256,7 @@ async fn test_chat_completions_model_not_loaded() {
health_cache,
registry: RwLock::new(registry),
candle,
activation: Arc::new(ActivationTracker::new(&[])),
});
let app = api::neuron_routes().with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
@@ -295,6 +300,7 @@ async fn test_chat_completions_streaming_model_not_loaded() {
health_cache,
registry: RwLock::new(registry),
candle,
activation: Arc::new(ActivationTracker::new(&[])),
});
let app = api::neuron_routes().with_state(state);
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();