helexa

helexa/helexa

Fork 0

Files

History

rob thijssen c5378d532d

CI / Format (push) Successful in 32s

Details

CI / Format (pull_request) Successful in 34s

Details

CI / Clippy (push) Successful in 2m29s

Details

CI / CUDA type-check (pull_request) Successful in 1m31s

Details

CI / CUDA type-check (push) Successful in 1m37s

Details

CI / Clippy (pull_request) Successful in 2m32s

Details

CI / Test (push) Successful in 4m24s

Details

CI / Build cortex SRPM (push) Has been skipped

Details

CI / Build neuron SRPM (push) Has been skipped

Details

CI / Publish cortex to COPR (push) Has been skipped

Details

CI / Publish neuron to COPR (push) Has been skipped

Details

CI / Bump version in source (push) Has been skipped

Details

CI / Test (pull_request) Successful in 4m23s

Details

CI / Build cortex SRPM (pull_request) Has been skipped

Details

CI / Publish cortex to COPR (pull_request) Has been skipped

Details

CI / Build neuron SRPM (pull_request) Has been skipped

Details

CI / Publish neuron to COPR (pull_request) Has been skipped

Details

CI / Bump version in source (pull_request) Has been skipped

Details

feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11 )

Stop discarding cache state between requests. When an incoming
prompt's token sequence starts with the exact tokens of a stored
snapshot, restore it and prefill only the divergent suffix.

For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v
+ GatedDeltaNet conv/recurrent state + the rope_delta counter, all at
one token boundary; the recurrent state cannot rewind, so matching is
exact-prefix only. GDN states are deep-copied both directions (the
CUDA delta-rule kernels mutate the state buffer in place); attention
k/v snapshots share storage safely (append-by-cat never mutates).

Snapshots live in the device worker's state next to the model slab
(Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds
only an opaque id + token sequence + byte size. DropArch drops a
model's snapshots with it, so unload and auto-recovery invalidate for
free. CPU loads hold snapshots inline on the legacy path.

Per-model LRU registry (harness/prefix_cache.rs) bounded by
[harness.candle.prefix_cache] budget_mb / max_entries, enabled by
default; inserting a snapshot drops entries it strictly extends.
Vision requests and candle-transformers archs bypass the cache
entirely (clear-every-request, unchanged).

Covers the single-GPU worker path (streaming + non-streaming) and the
CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR
that closes #11 with before/after bench numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 17:14:07 +03:00

cortex-cli

feat(neuron): OpenAI-compatible non-streaming chat completion

2026-05-18 16:47:58 +03:00

cortex-core

feat(gateway): Anthropic streaming SSE translation (#24 )

2026-06-12 15:47:30 +03:00

cortex-gateway

feat(gateway): Anthropic streaming SSE translation (#24 )

2026-06-12 15:47:30 +03:00

helexa-acp

chore: rename repo cortex -> helexa

2026-06-12 10:54:01 +03:00

neuron

feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11 )

2026-06-12 17:14:07 +03:00