All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 34s
CI / Clippy (push) Successful in 2m29s
CI / CUDA type-check (pull_request) Successful in 1m31s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m32s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Stop discarding cache state between requests. When an incoming prompt's token sequence starts with the exact tokens of a stored snapshot, restore it and prefill only the divergent suffix. For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v + GatedDeltaNet conv/recurrent state + the rope_delta counter, all at one token boundary; the recurrent state cannot rewind, so matching is exact-prefix only. GDN states are deep-copied both directions (the CUDA delta-rule kernels mutate the state buffer in place); attention k/v snapshots share storage safely (append-by-cat never mutates). Snapshots live in the device worker's state next to the model slab (Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds only an opaque id + token sequence + byte size. DropArch drops a model's snapshots with it, so unload and auto-recovery invalidate for free. CPU loads hold snapshots inline on the legacy path. Per-model LRU registry (harness/prefix_cache.rs) bounded by [harness.candle.prefix_cache] budget_mb / max_entries, enabled by default; inserting a snapshot drops entries it strictly extends. Vision requests and candle-transformers archs bypass the cache entirely (clear-every-request, unchanged). Covers the single-GPU worker path (streaming + non-streaming) and the CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR that closes #11 with before/after bench numbers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>