Files
cortex/crates
rob thijssen d1a4aad91d
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 1m6s
CI / Clippy (push) Successful in 2m56s
build-prerelease / Build neuron-blackwell (push) Successful in 3m40s
CI / Test (push) Successful in 5m1s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 4m29s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
fix(tp): always drain worker responses on leader failure
The TP-2 inference probe against Qwen3.6-27B surfaced:
    worker rank 1 ClearKvCache: expected KvCacheCleared, got
    GenerateStepOk

Caused by pipe poisoning. The previous shape of `generate_step`:

  for w in workers { w.send_only(GenerateStep) }   // 1. fan-out
  let logits = spawn_blocking(leader.forward)??;   // 2. early return on err
  for w in workers { w.recv_only() }               // 3. drain (skipped on 2's err)

When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen
before, an OOM, a downstream squeeze that didn't match the shape),
the function bailed before step 3 — but workers had already written
`GenerateStepOk` to their stdout pipes, since their forwards (and
the NCCL collectives inside) completed independently of the leader's
post-collective Rust-side work.

The next call (typically `ClearKvCache` at the start of the *next*
inference request) would then send a fresh request and read those
stale replies as if they were the new operation's. Once a pipe is
poisoned, every subsequent call surfaces the same shape of error
even though nothing's actually broken.

Fix: introduce two helpers in `tp/mod.rs`:

- `drain_workers(workers, check)` — reads exactly one response from
  every worker regardless of individual outcomes. Returns
  `Vec<String>` of `rank N: detail` strings for any non-OK reply.

- `combine_leader_workers(leader, worker_errs, op)` — folds the
  leader's `Result<Result<T>>` (the spawn_blocking shape) with the
  worker drain into a single `Result<T>`. Leader failure takes
  precedence but worker errors get appended so both halves surface.

`generate_step` and `clear_kv_cache` now use this pattern. Worst case:
both halves fail and the operator sees a combined error message;
either way the pipes are always drained so the next call's recv
matches the request it sent.

Note: the model is still poisoned in the current state — the
operator needs to either `POST /models/unload` + reload, or
`systemctl restart neuron`, to recover. The fix prevents *future*
desync; it doesn't repair existing stale pipe state.

Stage 7c-ii crash detection was tracked as the canonical solution to
this class of issue; this is the minimum-viable subset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 07:39:36 +03:00
..