Two concurrent chat_completion requests against the same single-GPU
model could interleave their `clear_kv_cache → forward(chunk0) →
forward(chunk1) → ...` sequences. The device-worker channel serialises
individual jobs but not the sequence boundary, so the cache could end
up holding tokens from one request while another's mask was sized for
its own prompt — producing a shape mismatch mid-prefill.
Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories`
and `memorize solutions` extensions fired 4ms apart against
Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV
cache, and request a08b4a's chunk 0 forward produced scores of shape
[1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add
failed, both requests bubbled the error up, both flipped the model to
poisoned.
Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring
the TpLoadedModel.pool lock that the TP path already held. Acquire
it at the start of `chat_completion` and inside the spawned task of
`chat_completion_stream` (so the role chunk goes out immediately and
only the inference work queues behind the lock).
The CPU branch uses `blocking_lock` from inside spawn_blocking; the
CUDA branch uses async `.lock().await` inside tokio::spawn.
Throughput impact: zero. The GPU was already serialised at the
device-worker channel — multiple requests just produced corrupt KV
cache state instead of clean serial throughput. The lock makes the
existing serialisation honest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>