Files
helexa/script
rob thijssen b3dc835375
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m23s
build-prerelease / Build cortex binary (push) Successful in 2m29s
build-prerelease / Build helexa-bench binary (push) Successful in 2m34s
build-prerelease / Test (push) Successful in 4m33s
build-prerelease / Build neuron-blackwell (push) Successful in 1m31s
build-prerelease / Build neuron-ada (push) Successful in 2m13s
build-prerelease / Build neuron-ampere (push) Successful in 2m50s
build-prerelease / Package helexa-bench RPM (push) Successful in 1m17s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 1m38s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 1m42s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 1m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 55s
ci: bound job runtime + stop dropping sccache on rustc signal-death
A neuron-blackwell build hung ~90 min (siblings finished in 2) and there
was no job timeout to kill it, so it sat burning a runner. Root cause of
the hang: the inline retry loop treated every failure identically and, on
its final attempt, rebuilt with sccache disabled. When the real failure
is a rustc SIGSEGV or an OOM-kill, an uncached rebuild does *more* work
under the same memory pressure — turning one transient compiler crash
into a wedged job.

Two fixes:

1. timeout-minutes on every job in build-prerelease.yml and ci.yml
   (builds 25, neuron CUDA build/cuda-check 35, packaging 20, COPR 60,
   fast jobs 10-15). A hang now dies in minutes, not hours.

2. New script/ci-cargo-escalate.sh replaces the five (prerelease) + three
   (ci) inline escalation loops. It classifies the failure:
     - signal death (exit >=128, or cargo reporting `signal: N`/SIGSEGV/
       SIGKILL) → compiler crash, NOT an sccache fault: keep the cache,
       one warm retry, then fail fast. Never escalate to uncached.
     - sccache fault (recognisable sccache error) → restart the server,
       retry, then one final uncached attempt.
     - deterministic compile/test error → fail fast (no wasteful retry).
   It also folds in the CUDA-image sccache probe the neuron/cuda-check
   jobs did inline. Classification verified locally against success,
   plain failure, exit-139, and the cargo-wrapped `signal: 11` form.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:02:50 +03:00
..