All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 1m6s
CI / Clippy (push) Successful in 2m56s
build-prerelease / Build neuron-blackwell (push) Successful in 3m40s
CI / Test (push) Successful in 5m1s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 4m29s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
The TP-2 inference probe against Qwen3.6-27B surfaced:
worker rank 1 ClearKvCache: expected KvCacheCleared, got
GenerateStepOk
Caused by pipe poisoning. The previous shape of `generate_step`:
for w in workers { w.send_only(GenerateStep) } // 1. fan-out
let logits = spawn_blocking(leader.forward)??; // 2. early return on err
for w in workers { w.recv_only() } // 3. drain (skipped on 2's err)
When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen
before, an OOM, a downstream squeeze that didn't match the shape),
the function bailed before step 3 — but workers had already written
`GenerateStepOk` to their stdout pipes, since their forwards (and
the NCCL collectives inside) completed independently of the leader's
post-collective Rust-side work.
The next call (typically `ClearKvCache` at the start of the *next*
inference request) would then send a fresh request and read those
stale replies as if they were the new operation's. Once a pipe is
poisoned, every subsequent call surfaces the same shape of error
even though nothing's actually broken.
Fix: introduce two helpers in `tp/mod.rs`:
- `drain_workers(workers, check)` — reads exactly one response from
every worker regardless of individual outcomes. Returns
`Vec<String>` of `rank N: detail` strings for any non-OK reply.
- `combine_leader_workers(leader, worker_errs, op)` — folds the
leader's `Result<Result<T>>` (the spawn_blocking shape) with the
worker drain into a single `Result<T>`. Leader failure takes
precedence but worker errors get appended so both halves surface.
`generate_step` and `clear_kv_cache` now use this pattern. Worst case:
both halves fail and the operator sees a combined error message;
either way the pipes are always drained so the next call's recv
matches the request it sent.
Note: the model is still poisoned in the current state — the
operator needs to either `POST /models/unload` + reload, or
`systemctl restart neuron`, to recover. The fix prevents *future*
desync; it doesn't repair existing stale pipe state.
Stage 7c-ii crash detection was tracked as the canonical solution to
this class of issue; this is the minimum-viable subset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>