cortex

Files

fix(neuron): serialise single-GPU inference per loaded model

Two concurrent chat_completion requests against the same single-GPU
model could interleave their `clear_kv_cache → forward(chunk0) →
forward(chunk1) → ...` sequences. The device-worker channel serialises
individual jobs but not the sequence boundary, so the cache could end
up holding tokens from one request while another's mask was sized for
its own prompt — producing a shape mismatch mid-prefill.

Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories`
and `memorize solutions` extensions fired 4ms apart against
Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV
cache, and request a08b4a's chunk 0 forward produced scores of shape
[1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add
failed, both requests bubbled the error up, both flipped the model to
poisoned.

Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring
the TpLoadedModel.pool lock that the TP path already held. Acquire
it at the start of `chat_completion` and inside the spawned task of
`chat_completion_stream` (so the role chunk goes out immediately and
only the inference work queues behind the lock).

The CPU branch uses `blocking_lock` from inside spawn_blocking; the
CUDA branch uses async `.lock().await` inside tokio::spawn.

Throughput impact: zero. The GPU was already serialised at the
device-worker channel — multiple requests just produced corrupt KV
cache state instead of clean serial throughput. The lock makes the
existing serialisation honest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 18:54:04 +03:00

cortex-cli

feat(neuron): OpenAI-compatible non-streaming chat completion

2026-05-18 16:47:58 +03:00

cortex-core

feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)

2026-05-26 16:10:41 +03:00

cortex-gateway

feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large)

2026-05-26 16:10:41 +03:00

neuron

fix(neuron): serialise single-GPU inference per loaded model

2026-05-27 18:54:04 +03:00