feat(neuron): prefix KV caching across requests #11
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Motivation
neuron currently calls
clear_kv_cache()before every inference (crates/neuron/src/harness/candle.rsaround line 1393). For chat workloads — where request N+1 = request N's prompt + new turn — this discards cache that could be reused, recomputing the entire prefix on every request.Reusing the cached prefix saves ~90% of prefill compute on chat workloads, with a proportional TTFT drop. We own the KV cache; the win is "stop deleting it."
This is the highest-ROI lever for token-economy in self-hosted inference and is independent of (and complementary to) gateway-side compression in #10.
Scope
Per-conversation prefix-cache state in neuron:
Job::Forwarddispatch on the device worker; no separate locks needed.Failure modes / open questions
Phased delivery
Non-goals
References
crates/neuron/src/harness/candle.rs(clear_kv_cache)crates/neuron/src/harness/device_worker/mod.rscrates/neuron/src/harness/arch/qwen3_5/full_attn.rsElevated by the 2026-06-12 reframing: adding this to the 7 → 8 milestone. For the agent workloads helexa actually serves (conversation N+1 = conversation N + a new turn), prefix reuse is the single biggest TTFT lever — bigger than chunked prefill (#23), which helps novel long prompts; the two compose. Also noting: #10 (gateway-side prompt compression) was closed out-of-scope in favor of this issue — prefix caching attacks the same prefill/VRAM cost without semantically altering what the model sees, which keeps the stability contract intact. Measurement lands via #21/#22; the per-device worker thread (CLAUDE.md) remains the concurrency story, as the body already anticipates.
Closing numbers, per the working agreement. Landed across #34 (single-GPU + CPU machinery), #35 (TP fan-out), #36 + #37 (snapshot-boundary fixes from live fleet validation).
Multi-turn TTFT — the agent workload this issue targets
Controlled three-turn conversation on beast (Qwen3.6-27B Q6K TP=2, ~5k-token context,
max_tokens=60,temperature=0, journal-verified):~37× prefill reduction on warm turns; request total 10.13 s → 2.27 s (the remainder is the 60 decode tokens).
script/bench.py(#22 harness, via the gateway)8f6f1d3)The bench repeats one prompt per cell, and the snapshot boundary sits before the prompt's volatile tail, so repeat runs hit the cache — the 27B rows are warm-TTFT. The Qwen3 models (candle-transformers archs, no snapshot support) are clean controls: unchanged TTFT, no regression anywhere. Decode tok/s drifted up across all models including the controls, so that variance is environmental, not attributed here.
What shipped vs the issue's phasing
[harness.candle.prefix_cache], on by default, 1024 MiB / 8 entries) — the system prompt is just the first cached prefix. On-by-default rather than header-opt-in: vision requests and non-qwen3_5 archs bypass automatically, and bench/probe behaviour stays honest because identical full prompts only reuse up to the snapshot boundary.<think>) and tool-call tokens are stripped/restructured when clients echo the assistant message, so the sequence never re-matches.…assistant\n+ reply merges differently). Snapshots cut one past the last<|im_start|>— special tokens are hard segmentation points, so that prefix is provably stable across renders.