test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
One-shot, env-gated fault injector for beast verification: when NEURON_DEBUG_POISON names a model, the first request for it triggers the auto-recovery path as if a device fault had occurred — exercising unload→reload→healthy without corrupting the GPU. Latched so it fires exactly once (no recovery loop). No-op unless the env var is set; wired into both the single-GPU and TP chat poison gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -830,6 +830,18 @@ fn recovering_error(model_id: &str) -> InferenceError {
|
|||||||
))
|
))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Verification hook for #17 auto-recovery. When `NEURON_DEBUG_POISON`
|
||||||
|
/// names a model, the **first** request for it (process-wide) returns
|
||||||
|
/// true, so the request path can trigger recovery as if a device fault
|
||||||
|
/// had occurred — exercising the unload→reload→healthy cycle without
|
||||||
|
/// corrupting the GPU. One-shot (a `swap` latch) so it can't loop the
|
||||||
|
/// model through endless recoveries. No-op unless the env var is set.
|
||||||
|
fn debug_poison_armed(model_id: &str) -> bool {
|
||||||
|
static FIRED: std::sync::atomic::AtomicBool = std::sync::atomic::AtomicBool::new(false);
|
||||||
|
let armed = std::env::var("NEURON_DEBUG_POISON").ok().as_deref() == Some(model_id);
|
||||||
|
armed && !FIRED.swap(true, Ordering::Relaxed)
|
||||||
|
}
|
||||||
|
|
||||||
/// Background auto-recovery task (#17). Drains poisoned model ids and
|
/// Background auto-recovery task (#17). Drains poisoned model ids and
|
||||||
/// rebuilds each via [`CandleHarness::recover_one`]. Holds a `Weak` so a
|
/// rebuilds each via [`CandleHarness::recover_one`]. Holds a `Weak` so a
|
||||||
/// shutting-down harness lets the task exit; processes one id at a time,
|
/// shutting-down harness lets the task exit; processes one id at a time,
|
||||||
@@ -1736,6 +1748,11 @@ impl CandleHarness {
|
|||||||
tracing::warn!("chat_completion: refusing request, model poisoned");
|
tracing::warn!("chat_completion: refusing request, model poisoned");
|
||||||
return Err(self.trigger_recovery(&model_id).await);
|
return Err(self.trigger_recovery(&model_id).await);
|
||||||
}
|
}
|
||||||
|
if debug_poison_armed(&model_id) {
|
||||||
|
let _g = span.enter();
|
||||||
|
tracing::warn!("NEURON_DEBUG_POISON: forcing auto-recovery (#17 verification)");
|
||||||
|
return Err(self.trigger_recovery(&model_id).await);
|
||||||
|
}
|
||||||
|
|
||||||
// Serialise concurrent requests against this model. Holds for
|
// Serialise concurrent requests against this model. Holds for
|
||||||
// the duration of clear_kv_cache → prefill → decode so two
|
// the duration of clear_kv_cache → prefill → decode so two
|
||||||
@@ -2988,6 +3005,11 @@ impl CandleHarness {
|
|||||||
tracing::warn!("TP chat_completion: refusing request, model poisoned");
|
tracing::warn!("TP chat_completion: refusing request, model poisoned");
|
||||||
return Err(self.trigger_recovery(&model_id).await);
|
return Err(self.trigger_recovery(&model_id).await);
|
||||||
}
|
}
|
||||||
|
if debug_poison_armed(&model_id) {
|
||||||
|
let _g = span.enter();
|
||||||
|
tracing::warn!("NEURON_DEBUG_POISON: forcing auto-recovery (#17 verification)");
|
||||||
|
return Err(self.trigger_recovery(&model_id).await);
|
||||||
|
}
|
||||||
|
|
||||||
// Reject image-bearing requests against a TP model with no
|
// Reject image-bearing requests against a TP model with no
|
||||||
// vision tower, cleanly (`vision_unsupported`) rather than
|
// vision tower, cleanly (`vision_unsupported`) rather than
|
||||||
|
|||||||
Reference in New Issue
Block a user