feat(neuron): prefix KV caching across requests #11

Closed
opened 2026-06-01 05:55:57 +00:00 by grenade · 2 comments
Owner

Motivation

neuron currently calls clear_kv_cache() before every inference (crates/neuron/src/harness/candle.rs around line 1393). For chat workloads — where request N+1 = request N's prompt + new turn — this discards cache that could be reused, recomputing the entire prefix on every request.

Reusing the cached prefix saves ~90% of prefill compute on chat workloads, with a proportional TTFT drop. We own the KV cache; the win is "stop deleting it."

This is the highest-ROI lever for token-economy in self-hosted inference and is independent of (and complementary to) gateway-side compression in #10.

Scope

Per-conversation prefix-cache state in neuron:

  1. Match the longest cached prefix for an incoming tokenized prompt.
  2. Skip prefill on the matched prefix; prefill only the divergent suffix.
  3. Eviction: LRU bounded by an explicit per-device VRAM budget. Cache slots live on the device bound to the worker (see CLAUDE.md "Per-device worker thread").
  4. Concurrency: reads happen during a Job::Forward dispatch on the device worker; no separate locks needed.
  5. System prompt as a special case: cache the system prompt's KV once per model load — it's the universal prefix and gives a measurable win even without per-conversation tracking.

Failure modes / open questions

  • KV footprint per token is model-specific. Qwen3.5-32B numbers needed for a sensible VRAM budget default.
  • Cache key must be stable under tokenizer determinism — canonical-form the prompt (whitespace, role markers, tool-call serialization) before hashing.
  • Tool-call interleaving in OpenAI-style chats produces non-trivial token sequences. Need a stable serialization.
  • Eviction granularity: whole-conversation vs token-tree (RadixAttention). Tree is more flexible but much more complex.

Phased delivery

  • Phase 1: cache only the system prompt KV (universal prefix). Trivial win, validates the plumbing.
  • Phase 2: per-conversation full-prefix cache, LRU eviction, opt-in via header initially.
  • Phase 3: token-tree (RadixAttention) for branching conversations.

Non-goals

  • Cross-node KV sharing (different neurons hold different VRAM).
  • KV compression (orthogonal — can layer on later).

References

  • Current cache clear site: crates/neuron/src/harness/candle.rs (clear_kv_cache)
  • Per-device worker discipline: crates/neuron/src/harness/device_worker/mod.rs
  • ConcatKvCache: crates/neuron/src/harness/arch/qwen3_5/full_attn.rs
  • vLLM RadixAttention (for Phase 3 reference): https://blog.lmsys.org/2024-01-17-sglang/
## Motivation neuron currently calls `clear_kv_cache()` before every inference (`crates/neuron/src/harness/candle.rs` around line 1393). For chat workloads — where request N+1 = request N's prompt + new turn — this discards cache that could be reused, recomputing the entire prefix on every request. Reusing the cached prefix saves ~90% of prefill compute on chat workloads, with a proportional TTFT drop. We own the KV cache; the win is "stop deleting it." This is the highest-ROI lever for token-economy in self-hosted inference and is independent of (and complementary to) gateway-side compression in #10. ## Scope Per-conversation prefix-cache state in neuron: 1. **Match the longest cached prefix** for an incoming tokenized prompt. 2. **Skip prefill on the matched prefix**; prefill only the divergent suffix. 3. **Eviction:** LRU bounded by an explicit per-device VRAM budget. Cache slots live on the device bound to the worker (see CLAUDE.md "Per-device worker thread"). 4. **Concurrency:** reads happen during a `Job::Forward` dispatch on the device worker; no separate locks needed. 5. **System prompt as a special case:** cache the system prompt's KV once per model load — it's the universal prefix and gives a measurable win even without per-conversation tracking. ## Failure modes / open questions - KV footprint per token is model-specific. Qwen3.5-32B numbers needed for a sensible VRAM budget default. - Cache key must be stable under tokenizer determinism — canonical-form the prompt (whitespace, role markers, tool-call serialization) before hashing. - Tool-call interleaving in OpenAI-style chats produces non-trivial token sequences. Need a stable serialization. - Eviction granularity: whole-conversation vs token-tree (RadixAttention). Tree is more flexible but much more complex. ## Phased delivery - **Phase 1:** cache only the system prompt KV (universal prefix). Trivial win, validates the plumbing. - **Phase 2:** per-conversation full-prefix cache, LRU eviction, opt-in via header initially. - **Phase 3:** token-tree (RadixAttention) for branching conversations. ## Non-goals - Cross-node KV sharing (different neurons hold different VRAM). - KV compression (orthogonal — can layer on later). ## References - Current cache clear site: `crates/neuron/src/harness/candle.rs` (`clear_kv_cache`) - Per-device worker discipline: `crates/neuron/src/harness/device_worker/mod.rs` - ConcatKvCache: `crates/neuron/src/harness/arch/qwen3_5/full_attn.rs` - vLLM RadixAttention (for Phase 3 reference): https://blog.lmsys.org/2024-01-17-sglang/
Author
Owner

Elevated by the 2026-06-12 reframing: adding this to the 7 → 8 milestone. For the agent workloads helexa actually serves (conversation N+1 = conversation N + a new turn), prefix reuse is the single biggest TTFT lever — bigger than chunked prefill (#23), which helps novel long prompts; the two compose. Also noting: #10 (gateway-side prompt compression) was closed out-of-scope in favor of this issue — prefix caching attacks the same prefill/VRAM cost without semantically altering what the model sees, which keeps the stability contract intact. Measurement lands via #21/#22; the per-device worker thread (CLAUDE.md) remains the concurrency story, as the body already anticipates.

Elevated by the 2026-06-12 reframing: adding this to the **7 → 8 milestone**. For the agent workloads helexa actually serves (conversation N+1 = conversation N + a new turn), prefix reuse is the single biggest TTFT lever — bigger than chunked prefill (#23), which helps novel long prompts; the two compose. Also noting: #10 (gateway-side prompt compression) was closed out-of-scope in favor of this issue — prefix caching attacks the same prefill/VRAM cost without semantically altering what the model sees, which keeps the stability contract intact. Measurement lands via #21/#22; the per-device worker thread (CLAUDE.md) remains the concurrency story, as the body already anticipates.
grenade added this to the 7 → 8 milestone 2026-06-12 08:58:14 +00:00
grenade added the p2-next label 2026-06-12 09:01:45 +00:00
Author
Owner

Closing numbers, per the working agreement. Landed across #34 (single-GPU + CPU machinery), #35 (TP fan-out), #36 + #37 (snapshot-boundary fixes from live fleet validation).

Multi-turn TTFT — the agent workload this issue targets

Controlled three-turn conversation on beast (Qwen3.6-27B Q6K TP=2, ~5k-token context, max_tokens=60, temperature=0, journal-verified):

turn prompt tok reused prefill (ms) request total (s)
1 (cold) 4993 0 8074 10.13
2 5077 4989 216 2.27
3 5159 5073 215 2.27

~37× prefill reduction on warm turns; request total 10.13 s → 2.27 s (the remainder is the 60 decode tokens).

script/bench.py (#22 harness, via the gateway)

model prompt tok TTFT before (8f6f1d3) TTFT after
Qwen/Qwen3.6-27B ~4096 7.067 s 1.431 s
Qwen/Qwen3.6-27B ~128 1.658 s 1.355 s
Qwen/Qwen3-8B (no-cache control) ~4096 1.818 s 1.824 s
Qwen/Qwen3-1.7B (no-cache control) ~4096 2.743 s 2.749 s

The bench repeats one prompt per cell, and the snapshot boundary sits before the prompt's volatile tail, so repeat runs hit the cache — the 27B rows are warm-TTFT. The Qwen3 models (candle-transformers archs, no snapshot support) are clean controls: unchanged TTFT, no regression anywhere. Decode tok/s drifted up across all models including the controls, so that variance is environmental, not attributed here.

What shipped vs the issue's phasing

  • Phase 1+2 merged into one mechanism: longest-strict-prefix matching with LRU over a per-model VRAM budget ([harness.candle.prefix_cache], on by default, 1024 MiB / 8 entries) — the system prompt is just the first cached prefix. On-by-default rather than header-opt-in: vision requests and non-qwen3_5 archs bypass automatically, and bench/probe behaviour stays honest because identical full prompts only reuse up to the snapshot boundary.
  • Two correctness findings worth recording for Phase 3 (RadixAttention) or any future cache work:
    1. Post-generation state is unusable as a key — reasoning (<think>) and tool-call tokens are stripped/restructured when clients echo the assistant message, so the sequence never re-matches.
    2. Full-prompt snapshots break on BPE retokenization at the append boundary (…assistant\n + reply merges differently). Snapshots cut one past the last <|im_start|> — special tokens are hard segmentation points, so that prefix is provably stable across renders.
  • Eviction is whole-snapshot LRU; exact-boundary matching is forced by the GDN recurrent state (no partial rewind), as anticipated in the issue body.
Closing numbers, per the working agreement. Landed across #34 (single-GPU + CPU machinery), #35 (TP fan-out), #36 + #37 (snapshot-boundary fixes from live fleet validation). ## Multi-turn TTFT — the agent workload this issue targets Controlled three-turn conversation on beast (Qwen3.6-27B Q6K TP=2, ~5k-token context, `max_tokens=60`, `temperature=0`, journal-verified): | turn | prompt tok | reused | prefill (ms) | request total (s) | |---|---:|---:|---:|---:| | 1 (cold) | 4993 | 0 | 8074 | 10.13 | | 2 | 5077 | 4989 | **216** | **2.27** | | 3 | 5159 | 5073 | **215** | **2.27** | **~37× prefill reduction** on warm turns; request total 10.13 s → 2.27 s (the remainder is the 60 decode tokens). ## `script/bench.py` (#22 harness, via the gateway) | model | prompt tok | TTFT before (`8f6f1d3`) | TTFT after | |---|---:|---:|---:| | Qwen/Qwen3.6-27B | ~4096 | 7.067 s | **1.431 s** | | Qwen/Qwen3.6-27B | ~128 | 1.658 s | 1.355 s | | Qwen/Qwen3-8B (no-cache control) | ~4096 | 1.818 s | 1.824 s | | Qwen/Qwen3-1.7B (no-cache control) | ~4096 | 2.743 s | 2.749 s | The bench repeats one prompt per cell, and the snapshot boundary sits before the prompt's volatile tail, so repeat runs hit the cache — the 27B rows are warm-TTFT. The Qwen3 models (candle-transformers archs, no snapshot support) are clean controls: unchanged TTFT, no regression anywhere. Decode tok/s drifted up across all models including the controls, so that variance is environmental, not attributed here. ## What shipped vs the issue's phasing - Phase 1+2 merged into one mechanism: longest-strict-prefix matching with LRU over a per-model VRAM budget (`[harness.candle.prefix_cache]`, on by default, 1024 MiB / 8 entries) — the system prompt is just the first cached prefix. On-by-default rather than header-opt-in: vision requests and non-qwen3_5 archs bypass automatically, and bench/probe behaviour stays honest because identical full prompts only reuse up to the snapshot boundary. - Two correctness findings worth recording for Phase 3 (RadixAttention) or any future cache work: 1. **Post-generation state is unusable as a key** — reasoning (`<think>`) and tool-call tokens are stripped/restructured when clients echo the assistant message, so the sequence never re-matches. 2. **Full-prompt snapshots break on BPE retokenization** at the append boundary (`…assistant\n` + reply merges differently). Snapshots cut one past the last `<|im_start|>` — special tokens are hard segmentation points, so that prefix is provably stable across renders. - Eviction is whole-snapshot LRU; exact-boundary matching is forced by the GDN recurrent state (no partial rewind), as anticipated in the issue body.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/helexa#11