test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
One-shot, env-gated fault injector for beast verification: when NEURON_DEBUG_POISON names a model, the first request for it triggers the auto-recovery path as if a device fault had occurred — exercising unload→reload→healthy without corrupting the GPU. Latched so it fires exactly once (no recovery loop). No-op unless the env var is set; wired into both the single-GPU and TP chat poison gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -830,6 +830,18 @@ fn recovering_error(model_id: &str) -> InferenceError {
|
||||
))
|
||||
}
|
||||
|
||||
/// Verification hook for #17 auto-recovery. When `NEURON_DEBUG_POISON`
|
||||
/// names a model, the **first** request for it (process-wide) returns
|
||||
/// true, so the request path can trigger recovery as if a device fault
|
||||
/// had occurred — exercising the unload→reload→healthy cycle without
|
||||
/// corrupting the GPU. One-shot (a `swap` latch) so it can't loop the
|
||||
/// model through endless recoveries. No-op unless the env var is set.
|
||||
fn debug_poison_armed(model_id: &str) -> bool {
|
||||
static FIRED: std::sync::atomic::AtomicBool = std::sync::atomic::AtomicBool::new(false);
|
||||
let armed = std::env::var("NEURON_DEBUG_POISON").ok().as_deref() == Some(model_id);
|
||||
armed && !FIRED.swap(true, Ordering::Relaxed)
|
||||
}
|
||||
|
||||
/// Background auto-recovery task (#17). Drains poisoned model ids and
|
||||
/// rebuilds each via [`CandleHarness::recover_one`]. Holds a `Weak` so a
|
||||
/// shutting-down harness lets the task exit; processes one id at a time,
|
||||
@@ -1736,6 +1748,11 @@ impl CandleHarness {
|
||||
tracing::warn!("chat_completion: refusing request, model poisoned");
|
||||
return Err(self.trigger_recovery(&model_id).await);
|
||||
}
|
||||
if debug_poison_armed(&model_id) {
|
||||
let _g = span.enter();
|
||||
tracing::warn!("NEURON_DEBUG_POISON: forcing auto-recovery (#17 verification)");
|
||||
return Err(self.trigger_recovery(&model_id).await);
|
||||
}
|
||||
|
||||
// Serialise concurrent requests against this model. Holds for
|
||||
// the duration of clear_kv_cache → prefill → decode so two
|
||||
@@ -2988,6 +3005,11 @@ impl CandleHarness {
|
||||
tracing::warn!("TP chat_completion: refusing request, model poisoned");
|
||||
return Err(self.trigger_recovery(&model_id).await);
|
||||
}
|
||||
if debug_poison_armed(&model_id) {
|
||||
let _g = span.enter();
|
||||
tracing::warn!("NEURON_DEBUG_POISON: forcing auto-recovery (#17 verification)");
|
||||
return Err(self.trigger_recovery(&model_id).await);
|
||||
}
|
||||
|
||||
// Reject image-bearing requests against a TP model with no
|
||||
// vision tower, cleanly (`vision_unsupported`) rather than
|
||||
|
||||
Reference in New Issue
Block a user