feat(neuron): prefix KV caching across requests #11

New Issue

grenade · 2026-06-01T05:55:57Z

grenade commented

2026-06-01 05:55:57 +00:00

Motivation

neuron currently calls clear_kv_cache() before every inference (crates/neuron/src/harness/candle.rs around line 1393). For chat workloads — where request N+1 = request N's prompt + new turn — this discards cache that could be reused, recomputing the entire prefix on every request.

Reusing the cached prefix saves ~90% of prefill compute on chat workloads, with a proportional TTFT drop. We own the KV cache; the win is "stop deleting it."

This is the highest-ROI lever for token-economy in self-hosted inference and is independent of (and complementary to) gateway-side compression in #10.

Scope

Per-conversation prefix-cache state in neuron:

Match the longest cached prefix for an incoming tokenized prompt.
Skip prefill on the matched prefix; prefill only the divergent suffix.
Eviction: LRU bounded by an explicit per-device VRAM budget. Cache slots live on the device bound to the worker (see CLAUDE.md "Per-device worker thread").
Concurrency: reads happen during a Job::Forward dispatch on the device worker; no separate locks needed.
System prompt as a special case: cache the system prompt's KV once per model load — it's the universal prefix and gives a measurable win even without per-conversation tracking.

Failure modes / open questions

KV footprint per token is model-specific. Qwen3.5-32B numbers needed for a sensible VRAM budget default.
Cache key must be stable under tokenizer determinism — canonical-form the prompt (whitespace, role markers, tool-call serialization) before hashing.
Tool-call interleaving in OpenAI-style chats produces non-trivial token sequences. Need a stable serialization.
Eviction granularity: whole-conversation vs token-tree (RadixAttention). Tree is more flexible but much more complex.

Phased delivery

Phase 1: cache only the system prompt KV (universal prefix). Trivial win, validates the plumbing.
Phase 2: per-conversation full-prefix cache, LRU eviction, opt-in via header initially.
Phase 3: token-tree (RadixAttention) for branching conversations.

Non-goals

Cross-node KV sharing (different neurons hold different VRAM).
KV compression (orthogonal — can layer on later).

References

Current cache clear site: crates/neuron/src/harness/candle.rs (clear_kv_cache)
Per-device worker discipline: crates/neuron/src/harness/device_worker/mod.rs
ConcatKvCache: crates/neuron/src/harness/arch/qwen3_5/full_attn.rs
vLLM RadixAttention (for Phase 3 reference): https://blog.lmsys.org/2024-01-17-sglang/

## Motivation neuron currently calls `clear_kv_cache()` before every inference (`crates/neuron/src/harness/candle.rs` around line 1393). For chat workloads — where request N+1 = request N's prompt + new turn — this discards cache that could be reused, recomputing the entire prefix on every request. Reusing the cached prefix saves ~90% of prefill compute on chat workloads, with a proportional TTFT drop. We own the KV cache; the win is "stop deleting it." This is the highest-ROI lever for token-economy in self-hosted inference and is independent of (and complementary to) gateway-side compression in #10. ## Scope Per-conversation prefix-cache state in neuron: 1. **Match the longest cached prefix** for an incoming tokenized prompt. 2. **Skip prefill on the matched prefix**; prefill only the divergent suffix. 3. **Eviction:** LRU bounded by an explicit per-device VRAM budget. Cache slots live on the device bound to the worker (see CLAUDE.md "Per-device worker thread"). 4. **Concurrency:** reads happen during a `Job::Forward` dispatch on the device worker; no separate locks needed. 5. **System prompt as a special case:** cache the system prompt's KV once per model load — it's the universal prefix and gives a measurable win even without per-conversation tracking. ## Failure modes / open questions - KV footprint per token is model-specific. Qwen3.5-32B numbers needed for a sensible VRAM budget default. - Cache key must be stable under tokenizer determinism — canonical-form the prompt (whitespace, role markers, tool-call serialization) before hashing. - Tool-call interleaving in OpenAI-style chats produces non-trivial token sequences. Need a stable serialization. - Eviction granularity: whole-conversation vs token-tree (RadixAttention). Tree is more flexible but much more complex. ## Phased delivery - **Phase 1:** cache only the system prompt KV (universal prefix). Trivial win, validates the plumbing. - **Phase 2:** per-conversation full-prefix cache, LRU eviction, opt-in via header initially. - **Phase 3:** token-tree (RadixAttention) for branching conversations. ## Non-goals - Cross-node KV sharing (different neurons hold different VRAM). - KV compression (orthogonal — can layer on later). ## References - Current cache clear site: `crates/neuron/src/harness/candle.rs` (`clear_kv_cache`) - Per-device worker discipline: `crates/neuron/src/harness/device_worker/mod.rs` - ConcatKvCache: `crates/neuron/src/harness/arch/qwen3_5/full_attn.rs` - vLLM RadixAttention (for Phase 3 reference): https://blog.lmsys.org/2024-01-17-sglang/

grenade referenced this issue

2026-06-01 05:56:16 +00:00

feat(cortex-gateway): Rust-native context compressor for prompt token reduction #10

grenade referenced this issue

2026-06-12 08:57:47 +00:00

feat(cortex-gateway): Rust-native context compressor for prompt token reduction #10

grenade commented

2026-06-12 08:58:14 +00:00

Elevated by the 2026-06-12 reframing: adding this to the 7 → 8 milestone. For the agent workloads helexa actually serves (conversation N+1 = conversation N + a new turn), prefix reuse is the single biggest TTFT lever — bigger than chunked prefill (#23), which helps novel long prompts; the two compose. Also noting: #10 (gateway-side prompt compression) was closed out-of-scope in favor of this issue — prefix caching attacks the same prefill/VRAM cost without semantically altering what the model sees, which keeps the stability contract intact. Measurement lands via #21/#22; the per-device worker thread (CLAUDE.md) remains the concurrency story, as the body already anticipates.

Elevated by the 2026-06-12 reframing: adding this to the **7 → 8 milestone**. For the agent workloads helexa actually serves (conversation N+1 = conversation N + a new turn), prefix reuse is the single biggest TTFT lever — bigger than chunked prefill (#23), which helps novel long prompts; the two compose. Also noting: #10 (gateway-side prompt compression) was closed out-of-scope in favor of this issue — prefix caching attacks the same prefill/VRAM cost without semantically altering what the model sees, which keeps the stability contract intact. Measurement lands via #21/#22; the per-device worker thread (CLAUDE.md) remains the concurrency story, as the body already anticipates.

grenade added this to the 7 → 8 milestone 2026-06-12 08:58:14 +00:00

grenade added the p2-next label 2026-06-12 09:01:45 +00:00

grenade referenced this issue

2026-06-12 09:02:07 +00:00

tracking: prioritised path to closing every open issue #27

grenade referenced this issue from a commit

2026-06-12 14:14:12 +00:00

feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)

grenade referenced this issue

2026-06-12 14:14:50 +00:00

feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11) #34

grenade closed this issue

2026-06-12 14:20:25 +00:00

grenade referenced this issue from a commit

2026-06-12 14:20:25 +00:00

Merge pull request 'feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)' (#34) from feat/11-prefix-kv-cache into main

grenade referenced this issue from a commit

2026-06-12 14:34:52 +00:00

feat(neuron): prefix KV caching for the TP path (#11)

grenade referenced a pull request that will close this issue

2026-06-12 14:35:10 +00:00

feat(neuron): prefix KV caching for the TP path (#11) #35

grenade referenced this issue from a commit

2026-06-12 14:49:21 +00:00

Merge pull request 'feat(neuron): prefix KV caching for the TP path (#11)' (#35) from feat/11-prefix-kv-cache-tp into main

grenade referenced this issue from a commit

2026-06-12 15:29:05 +00:00

fix(neuron): snapshot prefix cache at the prefill boundary (#11)

grenade referenced this issue

2026-06-12 15:29:17 +00:00

fix(neuron): snapshot prefix cache at the prefill boundary (#11) #36

grenade referenced this issue from a commit

2026-06-12 15:35:00 +00:00

Merge pull request 'fix(neuron): snapshot prefix cache at the prefill boundary (#11)' (#36) from fix/11-prefix-snapshot-at-prefill into main

grenade referenced this issue from a commit

2026-06-12 16:16:48 +00:00

fix(neuron): snapshot at the last special-token boundary (#11)

grenade referenced this issue

2026-06-12 16:17:01 +00:00

fix(neuron): snapshot at the last special-token boundary (#11) #37

grenade referenced this issue from a commit

2026-06-12 16:24:17 +00:00

Merge pull request 'fix(neuron): snapshot at the last special-token boundary (#11)' (#37) from fix/11-snapshot-cut-retokenization into main

grenade commented

2026-06-12 17:05:53 +00:00

Closing numbers, per the working agreement. Landed across #34 (single-GPU + CPU machinery), #35 (TP fan-out), #36 + #37 (snapshot-boundary fixes from live fleet validation).

Multi-turn TTFT — the agent workload this issue targets

Controlled three-turn conversation on beast (Qwen3.6-27B Q6K TP=2, ~5k-token context, max_tokens=60, temperature=0, journal-verified):

turn	prompt tok	reused	prefill (ms)	request total (s)
1 (cold)	4993	0	8074	10.13
2	5077	4989	216	2.27
3	5159	5073	215	2.27

~37× prefill reduction on warm turns; request total 10.13 s → 2.27 s (the remainder is the 60 decode tokens).

`script/bench.py` (#22 harness, via the gateway)

model	prompt tok	TTFT before (`8f6f1d3`)	TTFT after
Qwen/Qwen3.6-27B	~4096	7.067 s	1.431 s
Qwen/Qwen3.6-27B	~128	1.658 s	1.355 s
Qwen/Qwen3-8B (no-cache control)	~4096	1.818 s	1.824 s
Qwen/Qwen3-1.7B (no-cache control)	~4096	2.743 s	2.749 s

The bench repeats one prompt per cell, and the snapshot boundary sits before the prompt's volatile tail, so repeat runs hit the cache — the 27B rows are warm-TTFT. The Qwen3 models (candle-transformers archs, no snapshot support) are clean controls: unchanged TTFT, no regression anywhere. Decode tok/s drifted up across all models including the controls, so that variance is environmental, not attributed here.

What shipped vs the issue's phasing

Phase 1+2 merged into one mechanism: longest-strict-prefix matching with LRU over a per-model VRAM budget ([harness.candle.prefix_cache], on by default, 1024 MiB / 8 entries) — the system prompt is just the first cached prefix. On-by-default rather than header-opt-in: vision requests and non-qwen3_5 archs bypass automatically, and bench/probe behaviour stays honest because identical full prompts only reuse up to the snapshot boundary.
Two correctness findings worth recording for Phase 3 (RadixAttention) or any future cache work:
1. Post-generation state is unusable as a key — reasoning (<think>) and tool-call tokens are stripped/restructured when clients echo the assistant message, so the sequence never re-matches.
2. Full-prompt snapshots break on BPE retokenization at the append boundary (…assistant\n + reply merges differently). Snapshots cut one past the last <|im_start|> — special tokens are hard segmentation points, so that prefix is provably stable across renders.
Eviction is whole-snapshot LRU; exact-boundary matching is forced by the GDN recurrent state (no partial rewind), as anticipated in the issue body.

Closing numbers, per the working agreement. Landed across #34 (single-GPU + CPU machinery), #35 (TP fan-out), #36 + #37 (snapshot-boundary fixes from live fleet validation). ## Multi-turn TTFT — the agent workload this issue targets Controlled three-turn conversation on beast (Qwen3.6-27B Q6K TP=2, ~5k-token context, `max_tokens=60`, `temperature=0`, journal-verified): | turn | prompt tok | reused | prefill (ms) | request total (s) | |---|---:|---:|---:|---:| | 1 (cold) | 4993 | 0 | 8074 | 10.13 | | 2 | 5077 | 4989 | **216** | **2.27** | | 3 | 5159 | 5073 | **215** | **2.27** | **~37× prefill reduction** on warm turns; request total 10.13 s → 2.27 s (the remainder is the 60 decode tokens). ## `script/bench.py` (#22 harness, via the gateway) | model | prompt tok | TTFT before (`8f6f1d3`) | TTFT after | |---|---:|---:|---:| | Qwen/Qwen3.6-27B | ~4096 | 7.067 s | **1.431 s** | | Qwen/Qwen3.6-27B | ~128 | 1.658 s | 1.355 s | | Qwen/Qwen3-8B (no-cache control) | ~4096 | 1.818 s | 1.824 s | | Qwen/Qwen3-1.7B (no-cache control) | ~4096 | 2.743 s | 2.749 s | The bench repeats one prompt per cell, and the snapshot boundary sits before the prompt's volatile tail, so repeat runs hit the cache — the 27B rows are warm-TTFT. The Qwen3 models (candle-transformers archs, no snapshot support) are clean controls: unchanged TTFT, no regression anywhere. Decode tok/s drifted up across all models including the controls, so that variance is environmental, not attributed here. ## What shipped vs the issue's phasing - Phase 1+2 merged into one mechanism: longest-strict-prefix matching with LRU over a per-model VRAM budget (`[harness.candle.prefix_cache]`, on by default, 1024 MiB / 8 entries) — the system prompt is just the first cached prefix. On-by-default rather than header-opt-in: vision requests and non-qwen3_5 archs bypass automatically, and bench/probe behaviour stays honest because identical full prompts only reuse up to the snapshot boundary. - Two correctness findings worth recording for Phase 3 (RadixAttention) or any future cache work: 1. **Post-generation state is unusable as a key** — reasoning (`<think>`) and tool-call tokens are stripped/restructured when clients echo the assistant message, so the sequence never re-matches. 2. **Full-prompt snapshots break on BPE retokenization** at the append boundary (`…assistant\n` + reply merges differently). Snapshots cut one past the last `<|im_start|>` — special tokens are hard segmentation points, so that prefix is provably stable across renders. - Eviction is whole-snapshot LRU; exact-boundary matching is forced by the GDN recurrent state (no partial rewind), as anticipated in the issue body.

grenade referenced this issue from a commit

2026-06-12 17:51:56 +00:00

perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)

grenade referenced this issue

2026-06-12 17:52:16 +00:00

perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23) #39

grenade referenced this issue

2026-06-12 20:12:12 +00:00

perf(neuron): chunked delta-rule prefill for Gated DeltaNet #23

grenade referenced this issue

2026-06-13 05:26:20 +00:00

perf(neuron): speculative decoding with a small same-family drafter #25