Commit Graph

267 Commits

Author SHA1 Message Date
49a8dbcd28 Merge pull request 'perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)' (#40) from perf/1-parallel-isq into main
Some checks are pending
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m21s
build-prerelease / Test (push) Successful in 3m56s
build-prerelease / Build neuron-blackwell (push) Successful in 9m38s
build-prerelease / Build neuron-ada (push) Successful in 14m21s
build-prerelease / Build neuron-ampere (push) Successful in 19m0s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m12s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m29s
2026-06-12 20:12:44 +00:00
90e971dcf5 perf(neuron): parallel in-situ quantization + cold-load phase timing (#1)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 35s
CI / CUDA type-check (push) Successful in 1m50s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (pull_request) Successful in 2m18s
CI / Clippy (push) Successful in 2m46s
CI / Test (push) Successful in 5m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 5m33s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
QTensor::quantize runs its per-block math strictly sequentially on
one core (CUDA storage round-trips through the same CPU path), which
made Q6K ISQ the dominant phase of the 27B TP cold load. Blocks are
independent, so quantize_parallel re-implements the same encoding
through candle's public per-block API (k_quants::GgmlType::from_float)
with rayon fanning blocks across the CPU pool — byte-identical output,
pinned by parity tests against QTensor::quantize for Q6K/Q5K/Q4K/Q8_0.

Threading discipline holds: the device-to-host read and the
QStorage::from_data upload stay on the calling thread (device worker /
subprocess main); rayon workers touch host memory only.

Also adds the per-phase timing the issue asked for first: per-layer
debug + layer-loop total + lm_head info lines, so the next cold load
shows where the time actually goes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:47:57 +03:00
92273eb936 chore(ci): retrigger build-prerelease — ampere/blackwell packaging skipped after transient build failure on 128b381
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m10s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m54s
build-prerelease / Build neuron-blackwell (push) Successful in 9m36s
build-prerelease / Build neuron-ada (push) Successful in 14m6s
build-prerelease / Build neuron-ampere (push) Successful in 19m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m8s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
2026-06-12 22:38:31 +03:00
128b3818cb Merge pull request 'perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)' (#39) from perf/23-chunked-gdn-prefill into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m57s
build-prerelease / Build neuron-blackwell (push) Successful in 10m3s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 14m3s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
2026-06-12 18:44:22 +00:00
812d191e50 fix(neuron): UT transform by forward substitution, not nilpotent squaring
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 53s
CI / CUDA type-check (push) Successful in 1m52s
CI / CUDA type-check (pull_request) Successful in 2m12s
CI / Clippy (push) Successful in 2m18s
CI / Clippy (pull_request) Successful in 2m36s
CI / Test (push) Successful in 4m18s
CI / Test (pull_request) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Live A/B on beast produced NaN logits ("!!!" replies) on real prompts:
the nilpotent-squaring form of (I - T)^-1 computes raw powers of T,
whose entries grow combinatorially (path counts ~ C(62,31)) before
nilpotency collapses them — fine on uncorrelated test data, f32
precision death on real prompts whose repetitive text makes keys
highly correlated. The reference's forward-substitution loop never
forms raw powers; its intermediates are the convergent M entries.

Port the reference loop faithfully (rows accumulate into a fresh
tensor). New adversarial parity test with near-identical keys and
beta ~= 1 diverges to 8e30 under the squaring form and passes under
forward substitution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:18:32 +03:00
2a9def6d2d perf(neuron): chunked delta-rule prefill for Gated DeltaNet (#23)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 24s
CI / CUDA type-check (push) Successful in 1m38s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m29s
CI / Test (pull_request) Successful in 4m21s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Prefill (seq_len >= 64) now runs the chunk-parallel gated delta rule
ported from the HF reference torch_chunk_gated_delta_rule
(chunk_size=64): identical math reorganised into per-chunk batched
matmuls (cuBLAS/tensor cores on CUDA, gemm on CPU) instead of the
O(L)-sequential per-token recurrence. Decode steps and short prompts
keep the recurrent paths (CUDA kernel / Rust loop) unchanged.

One deliberate deviation from the reference: its in-place row-by-row
UT-transform computes (I - T)^-1 - I by forward substitution; T is
strictly lower triangular and therefore nilpotent at chunk size 64,
so the same inverse is the product of six squarings
prod_{j=0..5}(I + T^(2^j)) — batched matmuls instead of 63 sequential
row updates, which suits candle's immutable tensors. Chunk-local math
runs rank-3 over a flattened B*H*N batch dim (candle matmul supports
at most two batch dims).

Initial-state continuation is supported, so chunked prefill composes
with #11's restored prefix snapshots. Both single-GPU and TP paths
pick this up through the shared run_delta_rule dispatch.
NEURON_GDN_CHUNKED=0 forces the recurrent paths for A/B measurement.

Parity tests pin chunked against recurrent (2e-4 abs) across padding
(L=130), exact multiples with non-zero initial state (L=128 after a
50-token prefix), and a single exact chunk.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:51:51 +03:00
ddb331e1a3 Merge pull request 'docs(bench): record post-#11 fleet numbers' (#38) from docs/benchmarks-post-11 into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Build neuron-blackwell (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Has been skipped
build-prerelease / Build neuron-ada (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m47s
build-prerelease / Test (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
2026-06-12 17:14:00 +00:00
df0bf4c518 docs(bench): record post-#11 fleet numbers
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 1m31s
CI / CUDA type-check (pull_request) Successful in 2m7s
CI / Clippy (push) Successful in 2m24s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m24s
CI / Test (pull_request) Successful in 3m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Appends the 2026-06-12 post-prefix-cache run: 27B @4k warm TTFT
7.07 s -> 1.43 s, no-cache control models unchanged, with a
methodology note that repeated-prompt cells now measure warm TTFT on
qwen3_5-arch models.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:06:53 +03:00
a1952a4522 Merge pull request 'fix(neuron): snapshot at the last special-token boundary (#11)' (#37) from fix/11-snapshot-cut-retokenization into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m37s
build-prerelease / Test (push) Successful in 4m21s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 10m22s
build-prerelease / Build neuron-ampere (push) Successful in 13m8s
build-prerelease / Build neuron-ada (push) Successful in 21m31s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
2026-06-12 16:24:15 +00:00
4f266dbd82 fix(neuron): snapshot at the last special-token boundary (#11)
All checks were successful
CI / Format (push) Successful in 42s
CI / Format (pull_request) Successful in 34s
CI / CUDA type-check (push) Successful in 1m31s
CI / Clippy (push) Successful in 2m19s
CI / CUDA type-check (pull_request) Successful in 2m10s
CI / Test (push) Successful in 4m13s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m9s
CI / Test (pull_request) Successful in 4m5s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Second finding from live 27B validation: prompt-covering snapshots
still never matched. The rendered prompt ends with
`<|im_start|>assistant\n`, and when the next turn re-tokenizes that
text followed by the assistant's reply, BPE merges the trailing
newline with the reply's first characters — the final token(s) of the
cached sequence differ from the next prompt's, so the exact-prefix
match never fires. (A reply starting with an atomic special token
like <think> masks this, which is why the 0.8B check passed.)

Snapshot one past the last <|im_start|> instead: special tokens are
hard segmentation points, so ids up to and including it are provably
identical across renders. Prefill pauses at that boundary to capture
the snapshot, then finishes the ~2-token `assistant\n` tail. Applied
to all six request paths; unit tests for the cut helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:16:45 +03:00
43a6d96d5f Merge pull request 'fix(neuron): snapshot prefix cache at the prefill boundary (#11)' (#36) from fix/11-prefix-snapshot-at-prefill into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 36s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m16s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 4m2s
build-prerelease / Build neuron-ampere (push) Successful in 13m22s
build-prerelease / Build neuron-blackwell (push) Successful in 13m31s
build-prerelease / Build neuron-ada (push) Successful in 14m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m12s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m50s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
2026-06-12 15:34:59 +00:00
3fd1989b2b fix(neuron): snapshot prefix cache at the prefill boundary (#11)
All checks were successful
CI / Format (push) Successful in 41s
CI / Format (pull_request) Successful in 42s
CI / CUDA type-check (push) Successful in 1m39s
CI / CUDA type-check (pull_request) Successful in 2m6s
CI / Clippy (push) Successful in 3m10s
CI / Clippy (pull_request) Successful in 3m3s
CI / Test (pull_request) Successful in 4m2s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Live validation on beast's Qwen3.6-27B showed reused=0 on every turn:
the post-generation snapshot includes reasoning tokens (<think>...)
that get stripped when the client echoes the assistant message back,
so the cached sequence is never a token-prefix of the next prompt.
quadbrat's 0.8B only matched because its think block round-tripped as
literal text.

Snapshot after prefill instead (covering exactly the prompt tokens) —
that is the state the next turn provably extends under a stable chat
template, regardless of how reasoning or tool-call content is
transformed on echo. Taken after the first healthy sample so
NaN-poisoned prefills never cache their state; this also retires the
forwarded-token bookkeeping and the consumer-hangup store sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 18:29:00 +03:00
f7952547e7 Merge pull request 'feat(neuron): prefix KV caching for the TP path (#11)' (#35) from feat/11-prefix-kv-cache-tp into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m12s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Test (push) Successful in 3m58s
build-prerelease / Build neuron-blackwell (push) Successful in 9m5s
build-prerelease / Build neuron-ada (push) Successful in 14m22s
build-prerelease / Build neuron-ampere (push) Successful in 19m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m51s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
2026-06-12 14:49:19 +00:00
7e66f77851 fix(neuron): CUDA type-check fixes for TP prefix cache
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 39s
CI / CUDA type-check (pull_request) Successful in 1m26s
CI / CUDA type-check (push) Successful in 1m34s
CI / Clippy (push) Successful in 3m14s
CI / Clippy (pull_request) Successful in 3m18s
CI / Test (push) Successful in 5m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Two errors only the cuda config surfaces: the TpSnapshotKv dispatch
arms mixed candle and anyhow error types, and restore_or_clear_tp held
the registry MutexGuard across the cleanup await inside a let-chain
(making the TP request futures non-Send). Bind the removed ref before
awaiting, same discipline as the other lock sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:39:32 +03:00
e629e1872c feat(neuron): prefix KV caching for the TP path (#11)
Some checks failed
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 31s
CI / CUDA type-check (push) Failing after 1m55s
CI / CUDA type-check (pull_request) Failing after 1m47s
CI / Clippy (push) Successful in 2m11s
CI / Test (push) Successful in 4m15s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m0s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Extends the prefix cache to tensor-parallel models — Qwen3.6-27B on
beast, where the TTFT win is largest. Closes #11.

Every rank holds its shard's snapshot under one pool-minted id: the
leader's lives in the device worker beside the TP slab
(Job::TpSnapshotKv / TpRestoreKv / TpDropKvSnapshot), each subprocess
rank stores its own in-process via new WorkerRequest variants
(SnapshotKvCache / RestoreKvCache / DropKvSnapshot). Shard state has
the same shape as single-GPU (attention ConcatKvCache + GDN
conv/recurrent state + rope_delta), so the snapshot types are reused;
all ranks sit at the same token boundary because step fan-out is
synchronous.

Consistency on partial failure: a failed restore falls back to
clear-all-ranks + full prefill (and drops the entry); a failed
snapshot drops the id on every rank so nothing half-stored leaks.
DropTp / UnloadModel invalidate a model's snapshots with it, covering
auto-recovery. Vision requests bypass as on single-GPU. Budget
accounting uses leader bytes x world_size (shards are symmetric).

Wired into both TP request paths (non-streaming inner + streaming
orchestration task); chunked_prefill_tp gains the restored-offset
start.

Closes #11

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:34:49 +03:00
bb558451db Merge pull request 'feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)' (#34) from feat/11-prefix-kv-cache into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Build cortex binary (push) Has been skipped
build-prerelease / Package cortex RPM (push) Has been skipped
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m15s
build-prerelease / Test (push) Successful in 4m0s
build-prerelease / Build neuron-blackwell (push) Successful in 9m44s
build-prerelease / Build neuron-ampere (push) Successful in 12m47s
build-prerelease / Build neuron-ada (push) Successful in 19m6s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m2s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m8s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
2026-06-12 14:20:24 +00:00
c5378d532d feat(neuron): prefix KV caching across requests — single-GPU + CPU paths (#11)
All checks were successful
CI / Format (push) Successful in 32s
CI / Format (pull_request) Successful in 34s
CI / Clippy (push) Successful in 2m29s
CI / CUDA type-check (pull_request) Successful in 1m31s
CI / CUDA type-check (push) Successful in 1m37s
CI / Clippy (pull_request) Successful in 2m32s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
Stop discarding cache state between requests. When an incoming
prompt's token sequence starts with the exact tokens of a stored
snapshot, restore it and prefill only the divergent suffix.

For the hybrid qwen3_5 arch a snapshot is attention ConcatKvCache k/v
+ GatedDeltaNet conv/recurrent state + the rope_delta counter, all at
one token boundary; the recurrent state cannot rewind, so matching is
exact-prefix only. GDN states are deep-copied both directions (the
CUDA delta-rule kernels mutate the state buffer in place); attention
k/v snapshots share storage safely (append-by-cat never mutates).

Snapshots live in the device worker's state next to the model slab
(Job::SnapshotKv / RestoreKv / DropKvSnapshot); the async side holds
only an opaque id + token sequence + byte size. DropArch drops a
model's snapshots with it, so unload and auto-recovery invalidate for
free. CPU loads hold snapshots inline on the legacy path.

Per-model LRU registry (harness/prefix_cache.rs) bounded by
[harness.candle.prefix_cache] budget_mb / max_entries, enabled by
default; inserting a snapshot drops entries it strictly extends.
Vision requests and candle-transformers archs bypass the cache
entirely (clear-every-request, unchanged).

Covers the single-GPU worker path (streaming + non-streaming) and the
CPU-local path. The TP path (Qwen3.6-27B on beast) is a follow-up PR
that closes #11 with before/after bench numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:14:07 +03:00
9f383e7bc7 Merge pull request 'feat(gateway): Anthropic streaming SSE translation (#24)' (#33) from feat/gateway-24-anthropic-sse into main
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m13s
build-prerelease / Test (push) Successful in 3m59s
build-prerelease / Build cortex binary (push) Successful in 2m15s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m24s
build-prerelease / Build neuron-ampere (push) Successful in 19m3s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m59s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
2026-06-12 12:57:09 +00:00
569c528c4b feat(gateway): Anthropic streaming SSE translation (#24)
All checks were successful
CI / Format (push) Successful in 36s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m25s
CI / Format (pull_request) Successful in 41s
CI / CUDA type-check (pull_request) Successful in 2m9s
CI / Clippy (pull_request) Successful in 2m45s
CI / Test (push) Successful in 5m3s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m29s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The /v1/messages handler translated request envelopes but proxied raw
OpenAI SSE frames back to streaming Anthropic clients — the gap
between the README's "point your tooling at it once" contract and
what Claude Code actually received.

cortex-core gains AnthropicStreamTranslator, a pure per-stream state
machine: OpenAI chunks in, ordered (event, payload) pairs out —
message_start → content_block_start/delta/stop (text and tool_use
blocks, indexed; tool_calls map to input_json_delta) → message_delta
(stop_reason mapped via the now-shared map_stop_reason, which also
teaches the non-streaming path tool_calls→tool_use) → message_stop.
Without an upstream usage frame the output count falls back to the
delta count (engine-exact for neuron's one-chunk-per-token streams,
#31); with one, input/output tokens ride message_delta.

cortex-gateway gains anthropic_sse: the wire pump that splits the
upstream byte stream into SSE events, parses data: payloads
(leniently — engines omit fields on special frames), feeds the
translator, and frames results as `event:`/`data:` pairs through a
bounded channel (slow client back-pressures the upstream read).
Upstream truncation without [DONE] still closes the Anthropic event
sequence. Nothing is buffered beyond the current event's bytes.

Tests: 5 state-machine unit tests (text flow, stop-reason mapping +
defaults, tool_use blocks, usage propagation, idempotent finish) and
2 gateway integration tests (full event sequence + text reassembly,
usage propagation into message_delta). Validated end-to-end by
running this branch's gateway against a production neuron and
streaming a live Anthropic request.

Closes #24

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:47:30 +03:00
06e4ffc25c Merge pull request 'feat(bench): reproducible benchmark harness + first fleet numbers (#22)' (#32) from feat/22-benchmark-harness into main
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 2m41s
build-prerelease / Package cortex RPM (push) Successful in 1m29s
build-prerelease / Test (push) Successful in 4m44s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:46:33 +00:00
a2e73a8907 feat(bench): reproducible batch-1 benchmark harness + first fleet numbers (#22)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m8s
CI / CUDA type-check (pull_request) Successful in 2m8s
CI / Clippy (push) Successful in 2m23s
CI / Test (pull_request) Successful in 3m54s
CI / Test (push) Successful in 6m23s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 4m23s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
script/bench.py: stdlib-only, works against any OpenAI-compatible /v1
endpoint (helexa, llama.cpp, Ollama, vLLM) so cross-engine tables are
a concatenation via the --label column. Measures the operator-felt
trio per (model, prompt-size) cell: TTFT (first SSE content chunk),
decode tok/s (visible tokens over the first→last chunk window,
chunk-per-token engine invariant since streaming usage frames aren't
emitted yet — #31), total wall-clock. Medians over N runs after one
warmup; append-only JSONL for longitudinal tracking.

Measurement traps found against the live fleet and handled:
- thinking models burn the budget invisibly (reasoning deltas are
  off-wire by default) — the prompt appends Qwen's /no_think soft
  switch
- short coalesced replies collapse the decode window to one TCP read
  — rates require a ≥200 ms window and the prompt demands ~300 words

doc/benchmarks.md: method, fleet table, and the first published
numbers (2026-06-12, 8f6f1d3): 1.7B@3060 81 tok/s, 8B@4090 62 tok/s,
27B@2×5090 Q6K TP=2 35 tok/s with flat decode from 128→4k context —
and the 7.1 s 4k-prefill TTFT recorded as #23's before-number.

Refs #22 (competitor baselines still pending — the harness is ready
for them)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:39:13 +03:00
8f6f1d3205 feat(deploy): validate neuron capability after every deploy
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 10m36s
build-prerelease / Build cortex binary (push) Successful in 2m35s
build-prerelease / Test (push) Successful in 6m35s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
A deploy previously went green the moment systemd reported the
service started — a merge that broke model loading or inference
itself would deploy "successfully" and only surface when a human
noticed. Each neuron deploy now earns its green:

1. Wait for default models: poll /health until activation.state is
   ready, with per-host timeouts in the matrix (beast 900s for the
   27B Q6K TP=2 cold-load, benjy/quadbrat 300s). Any entry in
   activation.failed fails the deploy with the per-model error —
   the structured equivalent of watching the journal for
   "loaded default model", plus failure detail the journal line
   can't carry.
2. LLM smoke probe: ask the first loaded model to reply with one
   specific word (max_tokens 512 so thinking models have room,
   temperature 0) and grep the response for it. Not a quality bar —
   just proof the deploy didn't lobotomize inference.

Hosts whose package is already current still skip everything — the
validation cost is only paid when a restart actually happened. The
probe was dry-run against benjy's production neuron before landing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:28:20 +03:00
b0d0b939af Merge pull request 'feat(gateway): per-request token metrics — TTFT and tok/s (#21)' (#30) from feat/gateway-21-token-metrics into main
Some checks failed
build-prerelease / Lint (fmt + clippy) (push) Blocked by required conditions
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build cortex binary (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 33s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:25:32 +00:00
6a36d15ef1 feat(gateway): per-request token metrics — TTFT and tok/s (#21)
All checks were successful
CI / Format (push) Successful in 45s
CI / Format (pull_request) Successful in 37s
CI / CUDA type-check (push) Successful in 2m25s
CI / Clippy (push) Successful in 2m37s
CI / Test (push) Successful in 4m22s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Clippy (pull_request) Successful in 2m23s
CI / Test (pull_request) Successful in 4m19s
CI / CUDA type-check (pull_request) Successful in 1m57s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The deferred Phase 6b, and the unblock for the 7→8 milestone's
benchmark work (#22): until cortex measures itself per request,
nothing downstream can be benchmarked or graphed.

The proxy wraps the upstream byte stream in a pass-through inspector
(TokenMetricsStream): chunks are forwarded verbatim — never buffered
or re-serialised — while the inspector records arrival times and
keeps a bounded (64 KiB) tail of the body text. At stream end (or
client disconnect, via Drop) it extracts the final OpenAI usage
object — present on the last SSE chunk and non-streaming JSON bodies
alike — for engine-truth token counts.

Per request, labelled {model, node}:
- cortex_time_to_first_token_seconds (histogram) — first body chunk
- cortex_tokens_per_second (histogram) — completion tokens over the
  decode window (first→last chunk); falls back to total request
  duration for single-chunk non-streaming bodies
- cortex_prompt_tokens_total / cortex_completion_tokens_total
  (counters)

The extractor is pure and chunk-boundary-safe; quoted-needle matching
keeps completion_tokens_details from shadowing completion_tokens,
and the last usage object wins. Covers chat completions, completions,
the Responses API, and the Anthropic streaming path (which currently
proxies OpenAI SSE).

Tests: 4 extractor unit tests; integration test with a streaming
mock emitting a stream_options-style final usage chunk, asserting
both histograms and exact-or-greater counter values (the test
recorder is process-global and shared across the binary's tests).

Closes #21

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:11:52 +03:00
b463439416 Merge pull request 'feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)' (#29) from feat/neuron-19-driver-preflight into main
Some checks failed
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 29s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m11s
build-prerelease / Build cortex binary (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m24s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-blackwell (push) Successful in 10m18s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 12:08:20 +00:00
716558c8ff feat(neuron): startup preflight for NVIDIA driver/library mismatch (#19)
All checks were successful
CI / Format (push) Successful in 38s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 2m11s
CI / Clippy (push) Successful in 2m13s
CI / Clippy (pull_request) Successful in 2m37s
CI / Test (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 3m56s
CI / CUDA type-check (pull_request) Successful in 1m44s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
The un-rebooted driver update (userspace libs bumped, kernel module
still old) kills every CUDA call on the host including nvidia-smi,
and neuron surfaced it only as `Comm::from_rank ... NcclError` deep
inside the first model load — 30 minutes of forensics on beast
(2026-06-08) to diagnose. Make it instantly legible instead:

- discovery distinguishes nvidia-smi absent (CPU-only, fine) from
  present-but-failing, classifies the "Driver/library version
  mismatch" signature, and pairs the userspace NVML version with the
  loaded kernel-module version from /proc/driver/nvidia/version.
- DiscoveryResponse gains `cuda_unavailable_reason` (omitted when
  None — wire-compatible) so cortex can see why the node has no
  devices and route around it.
- startup logs one loud ERROR line with the actionable reason
  ("reboot the host to reload the kernel module") and skips default
  model loads entirely, marking each failed with that reason so
  /health activation shows the real cause.
- POST /models/load fast-rejects with 503 + code=cuda_unavailable on
  a mismatch host instead of dying minutes later in cuInit/NCCL.

No false positives: other nvidia-smi failures (no devices, perms)
keep their existing behaviour, CPU-only hosts stay silent.

Closes #19

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:00:00 +03:00
112e4e124a fix(ci): export RUSTC_WRAPPER in the build step itself — GITHUB_ENV doesn't propagate
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 32s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m22s
build-prerelease / Build cortex binary (push) Successful in 2m20s
build-prerelease / Test (push) Successful in 3m50s
build-prerelease / Build neuron-blackwell (push) Successful in 10m10s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ada (push) Successful in 14m29s
build-prerelease / Build neuron-ampere (push) Successful in 14m31s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Run 375 proved the CUDA image ships sccache (probe step printed
"sccache enabled") but the wrapper never reached cargo: the runner
does not propagate GITHUB_ENV across steps, so the builds ran
unwrapped (server stats: 4 compile requests for a ~600-crate build,
durations unchanged). Probe and export inside the build step's own
shell instead, in both build-neuron and ci.yml's cuda-check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:50:25 +03:00
dc6feec6dc fix(deploy): gate on the publish manifest, not unprivileged dnf check-update
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 31s
build-prerelease / Build cortex binary (push) Successful in 2m18s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m33s
build-prerelease / Test (push) Successful in 4m20s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 9m46s
build-prerelease / Build neuron-ampere (push) Successful in 13m57s
build-prerelease / Build neuron-ada (push) Successful in 15m29s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m8s
The f5fa840 deploy exposed both failure modes of gating with
`dnf check-update` as the gitea_ci user in one run: it hung
indefinitely on quadbrat (blocked process, 0 CPU, killed manually),
and on benjy/beast it silently reported "no updates" two minutes
after new RPMs were published — both hosts skipped a real (luckily
binary-identical) update.

Gate with data we own instead: fetch packages.json from
rpm.lair.cafe (plain curl, no privileges, no dnf locks), take the
newest release per package by buildTime, and skip the
stop/upgrade/start cycle only when it exactly equals
`rpm -q %{VERSION}-%{RELEASE}`. Unreachable or unparsable manifest
fails open to a full deploy. The dnf transaction itself still runs
under the scoped sudoers rules, unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:20:21 +03:00
02f20bc9e1 Merge pull request 'feat: keep auto-recovering models visible as recovering (#20)' (#28) from feat/neuron-20-recovering-status into main
Some checks failed
build-prerelease / Test (push) Blocked by required conditions
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Build neuron-ampere (push) Blocked by required conditions
build-prerelease / Build neuron-ada (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m39s
build-prerelease / Build cortex binary (push) Successful in 3m46s
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
2026-06-12 11:15:38 +00:00
2a231e49de merge main (sccache enablement supersedes branch cuda-check pin)
All checks were successful
CI / Format (push) Successful in 40s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m17s
CI / CUDA type-check (push) Successful in 2m39s
CI / CUDA type-check (pull_request) Successful in 2m30s
CI / Test (push) Successful in 4m51s
CI / Clippy (pull_request) Successful in 2m12s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m49s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
# Conflicts:
#	.gitea/workflows/ci.yml
2026-06-12 14:05:55 +03:00
2dadea5d8d ci: enable sccache on the build jobs (conditional on the CUDA image)
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 34s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m57s
build-prerelease / Test (push) Has been cancelled
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The 3 CUDA flavour builds (10-14 min each, the critical path of every
full run) and build-cortex compiled entirely uncached. With the
gongfoo-side sccache hardening in place, wire them up:

- build-cortex: full sccache env (rust image ships it) + the standard
  escalation loop (retry -> server restart -> uncached final attempt).
- build-neuron: probe for sccache before enabling the wrapper — the
  CUDA image may not ship it, and a missing binary must degrade to an
  uncached build, not fail cargo at `sccache rustc -vV` (the original
  reason the wrapper was cleared here). rustc compilations are shared
  across all three flavours; candle-kernels' nvcc output stays
  uncached (build-script artifact).
- ci.yml cuda-check: same probe pattern replaces the blanket env
  clear; also pins CUDA_COMPUTE_CAP=86 since the image no longer
  ships nvidia-smi for candle-kernels' fallback detection (mirrors
  9bb9678 on the #20 branch).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:05:26 +03:00
9bb9678f93 fix(ci): pin CUDA_COMPUTE_CAP in cuda-check — builder image has no nvidia-smi
All checks were successful
CI / Format (push) Successful in 37s
CI / Format (pull_request) Successful in 38s
CI / CUDA type-check (push) Successful in 1m45s
CI / Clippy (push) Successful in 2m24s
CI / Clippy (pull_request) Successful in 2m19s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
CI / Test (pull_request) Successful in 4m35s
CI / CUDA type-check (pull_request) Successful in 1m50s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
candle-kernels' build script shells out to nvidia-smi for compute-cap
detection when CUDA_COMPUTE_CAP is unset; the current GPU-less builder
image doesn't ship it, so the type-check died in the build script
before borrow-checking anything. Pin an arbitrary valid cap — the
check is feature-gate compilation only; real caps live in
build-prerelease.yml's flavour matrix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:55:23 +03:00
df9c490614 feat(neuron+gateway): keep auto-recovering models visible as recovering (#20)
Some checks failed
CI / Format (push) Successful in 37s
CI / CUDA type-check (pull_request) Failing after 28s
CI / Format (pull_request) Successful in 37s
CI / Clippy (push) Successful in 2m54s
CI / Clippy (pull_request) Successful in 3m36s
CI / Test (push) Successful in 4m37s
CI / Test (pull_request) Successful in 5m20s
CI / Build cortex SRPM (pull_request) Has been skipped
CI / Build neuron SRPM (pull_request) Has been skipped
CI / Publish cortex to COPR (pull_request) Has been skipped
CI / Publish neuron to COPR (pull_request) Has been skipped
CI / Bump version in source (pull_request) Has been skipped
CI / CUDA type-check (push) Failing after 31s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
During the #17 auto-recovery window (unload → reload, minutes for a
large TP model) the model's registry slot is absent, so it vanished
from neuron's /models — and cortex, routing by /models presence,
answered "model not found on any node" while a direct request to
neuron would have correctly said "recovering, retry shortly".

neuron: the recovery set becomes a map carrying a devices/capabilities
snapshot taken at trigger time (while the registry slot still exists).
list_models reports `recovering` for models in the set — both while
the poisoned slot is still present and during the reload gap, where
the snapshot keeps the model listed.

gateway: ModelStatus grows a Recovering variant (parsed from the
wire); the router holds the route — new RouteError::ModelRecovering
mapped to 503 instead of 404 — and deliberately does not fall through
to the catalogue cold-load, which would race a second placement
against the in-flight recovery. The evictor already ignores
non-Loaded entries.

Tests: neuron unit test (recovering model stays listed with snapshot),
gateway integration tests (poller parses `recovering`; request gets
503 retry-shortly and the model stays on /v1/models).

Closes #20

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:42:03 +03:00
f5fa840dfb ci: escalate sccache retries — restart server, then fall back uncached
All checks were successful
build-prerelease / Resolve version stamps + change detection (push) Successful in 30s
build-prerelease / Lint (fmt + clippy) (push) Successful in 2m6s
build-prerelease / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 3m45s
build-prerelease / Build neuron-blackwell (push) Successful in 9m59s
build-prerelease / Build neuron-ada (push) Successful in 14m11s
build-prerelease / Build neuron-ampere (push) Successful in 14m13s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m28s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m54s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Run 361's Test job failed all 3 attempts with the sccache
dead-server signature (sccache fatal error, ENOENT on its own tmp
files under target/debug/deps). Retrying the same invocation only
helps for transient races; against a wedged server every same-VM
retry fails identically — and under the new pipeline that blocks
publish and the deploy behind it.

Escalate instead: attempt 1 plain, attempt 2 after an sccache server
restart, attempt 3 with RUSTC_WRAPPER unset (uncached). A sick cache
now costs build minutes, never the deploy. Applied to the lint/test
jobs in build-prerelease.yml and ci.yml alike.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:24:02 +03:00
7557c5e877 ci: cut iteration latency — change-aware builds, gated deploys, dev fast path
Some checks failed
build-prerelease / Build neuron-blackwell (push) Blocked by required conditions
build-prerelease / Resolve version stamps + change detection (push) Successful in 28s
build-prerelease / Test (push) Failing after 1m16s
build-prerelease / Lint (fmt + clippy) (push) Successful in 3m7s
build-prerelease / Build cortex binary (push) Successful in 3m57s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Push-to-testable was ~20.5 min for every commit (measured on the
2026-06-08 green chain) plus a ~5 min 27B cold-load, regardless of
what changed. Three structural fixes:

- build-prerelease: a change-detection step in `prepare` diffs HEAD
  against the git sha embedded in the last *published* unstable RPM
  (per package, from packages.json) and skips builds whose inputs
  didn't change. Docs-only commits build nothing; gateway-only
  commits skip the 3 CUDA flavour builds. Detection failures fall
  open to a full build.
- ci.yml no longer runs on pushes to main; fmt/clippy/test live in
  build-prerelease as parallel jobs gating publish. The two workflows
  previously queued against each other on the same runner labels,
  delaying the cortex build ~12 min. Branches, PRs, and tags keep the
  full ci.yml gate.
- deploy: each host self-gates with `dnf check-update` and leaves the
  service untouched when the installed package is already current —
  no more neuron restarts (and 27B cold-loads) for commits that
  didn't change neuron.
- deploy-dev (new): manual single-host fast path — build one CUDA
  flavour, scp the binary, restart the service. Skips packaging,
  signing, publish, and dnf entirely. Backed by a new exact-form
  sudoers rule in asset/sudoers.d/neuron-host.conf (already applied
  to all three hosts).

Expected loop times when runners behave: docs ≈ 1 min (nothing
deploys), gateway-only ≈ 6-8 min, single-neuron dev ≈ 8-10 min,
full fleet ≈ 13-15 min.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:17:22 +03:00
91e95ca979 docs: rewrite README around project positioning
Some checks failed
CI / CUDA type-check (push) Failing after 46s
CI / Format (push) Successful in 47s
CI / Clippy (push) Successful in 2m53s
CI / Test (push) Successful in 4m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 39s
build-prerelease / Build cortex binary (push) Successful in 3m52s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-blackwell (push) Successful in 11m34s
build-prerelease / Build neuron-ampere (push) Successful in 15m31s
build-prerelease / Build neuron-ada (push) Successful in 15m37s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Lead with what helexa is for — near-frontier open-weight models on
consumer hardware you own — instead of a feature list. Adds the scope
section (intentional divergence from vLLM/SGLang; CUDA-only today as a
test-coverage constraint, not a principle), an engine section covering
the per-device worker threads and consumer-GPU tensor parallelism, the
previously-missing helexa-acp crate, and a status section pointing at
git.lair.cafe as the source of truth with GitHub as read-only mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 11:37:00 +03:00
1a74cb0c56 chore: rename repo cortex -> helexa
Some checks failed
CI / CUDA type-check (push) Failing after 30s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Format (push) Successful in 32s
build-prerelease / Build neuron-blackwell (push) Failing after 31s
build-prerelease / Build neuron-ada (push) Failing after 34s
build-prerelease / Build neuron-ampere (push) Failing after 38s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Clippy (push) Failing after 1m11s
build-prerelease / Build cortex binary (push) Successful in 3m47s
CI / Test (push) Successful in 5m32s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m22s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
helexa is the project; cortex (per-operator control plane / LLM proxy)
and neuron (per-host LLM harness) are its components. The Gitea repo
is now helexa/helexa. Update repository URLs in Cargo metadata, RPM
specs, and docs; make the CI changelog push URL rename-proof via the
github.repository context; reframe README.md and CLAUDE.md around the
project name. Binary, package, service, and config-path names are
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 10:54:01 +03:00
60f5598542 build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync)
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 35s
CI / Test (push) Failing after 1m9s
CI / Clippy (push) Successful in 2m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Build neuron-ampere (push) Successful in 7m35s
build-prerelease / Build neuron-ada (push) Successful in 5m7s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Build cortex binary (push) Successful in 4m33s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's
thread-safety invariant upstream) and makes `Comm::abort` idempotent via
an `aborted` flag (so abort-then-Drop can't double-free) — strictly
better than the previous Drop-no-panic workaround, and the `abort()`
signature is unchanged so the watchdog call site is unaffected.

Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` /
`NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119)
with neuron's manual `unsafe impl`s. Remove the four now-redundant impls
— the safety assertion lives upstream in cudarc where it belongs. The
conflict is in cuda-gated code, so only the CUDA type-check catches it
(non-cuda build + clippy + tests stay green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:33:14 +03:00
7945240646 chore: re-trigger deploy (#17 Stage 2, attempt 3)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m45s
build-prerelease / Build neuron-blackwell (push) Successful in 5m50s
CI / Test (push) Successful in 6m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m38s
build-prerelease / Build neuron-ada (push) Successful in 5m36s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
No code change. Each deploy run, the degraded CI runner kills a different
single arch build (blackwell, then ada) ~fast, and the all-arch-gated
packaging skips → no publish. Every arch HAS built green across runs
(blackwell  in 342, ampere , ada  in 339) and the gate + CUDA
type-check pass. Re-running to catch all three green in one run so the
Stage-2 RPMs publish. Runner FS/cache health is the real fix (separate
infra work).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:06:04 +03:00
0c74d89d15 chore: re-trigger deploy (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
build-prerelease / Build neuron-ada (push) Failing after 51s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Build neuron-blackwell (push) Successful in 6m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m42s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
CI / Test (push) Successful in 6m6s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
No code change. The c94a2ae deploy's neuron-blackwell build died ~12min
into the Blackwell kernel compile on the degraded runner, while
neuron-ampere + neuron-ada built the identical Rust + patched cudarc
cleanly and the CUDA type-check passed. Transient infra; re-running to
get a healthy blackwell build so the RPMs publish and beast (Blackwell)
picks it up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:45:16 +03:00
c94a2ae755 fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 44s
build-prerelease / Build cortex binary (push) Successful in 4m57s
build-prerelease / Package cortex RPM (push) Successful in 1m36s
CI / Test (push) Successful in 7m10s
CI / Clippy (push) Failing after 1m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 9m5s
build-prerelease / Build neuron-blackwell (push) Failing after 12m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
`super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state`
(nonexistent); the module is the child `nccl_state` (cf. the existing
`nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the
non-cuda build couldn't catch it; the branch CUDA type-check flaked on the
runner before compiling. Self-audited fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:21:43 +03:00
99920dd322 feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2)
Some checks failed
CI / CUDA type-check (push) Failing after 47s
CI / Format (push) Successful in 31s
CI / Test (push) Failing after 1m3s
CI / Clippy (push) Successful in 2m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Make a hung NCCL collective recoverable instead of a permanent brick.
Today a wedged collective hangs the in-process leader thread forever, and
even Stage 1's recovery can't help — its unload's DropTp queues behind the
stuck thread and hangs too.

- Cache the leader's NCCL Comm handle async-side at init (new cuda-gated
  Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on
  WorkerPool.leader_comm). Fetched while the thread is responsive — a
  wedged thread can't service the fetch, which is why it's cached up front.
- Wrap the leader forward in both generate_step and
  generate_step_with_images in tokio::time::timeout (default 120s,
  NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls
  Comm::abort() (ncclCommAbort) on the cached handle from the async
  thread — the one NCCL op sanctioned concurrently with an in-flight
  collective — which unblocks the leader thread, then fails the step
  WITHOUT draining (workers are wedged too; recovery's unload kills them).
  The error is a device fault → poison → Stage 1 auto-recovery, which now
  completes because the leader thread is responsive again.
- Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so
  the post-abort comm teardown during recovery doesn't double-abort-panic).

Logs the whole sequence at ERROR with greppable `tp watchdog:` /
`ncclCommAbort` markers so a real-world hang leaves a forensic trail —
verification is by inspecting journals after real hangs, not a synthetic
harness. cuda-gated → validated by the blackwell build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:15:29 +03:00
c4f239ceb9 build(neuron): patch cudarc to expose Comm::abort/get_async_error (#17 Stage 2)
All checks were successful
CI / CUDA type-check (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 6m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
#17 Stage 2 (TP hang-recovery) needs to call ncclCommAbort on a LIVE
communicator from another thread — to unblock a collective wedged on a
dead/hung peer so the ranks can resync. No cudarc release (incl. main)
exposes this: the safe Comm only aborts in Drop, which can't fire while a
stuck thread holds an Arc<Comm> clone.

Pin neuron's cudarc 0.19.7 to a fork (grenade/cudarc @ nccl-comm-abort,
rev 4dff0be) adding three thin methods — Comm::abort, get_async_error,
and a raw comm() accessor — to be submitted upstream. The patch targets
0.19.x only; candle's transitive cudarc 0.17.8 stays on crates.io.

Foundation only; the watchdog + abort + comm-rebuild that consume these
land in follow-up commits (cuda-gated → validated by the blackwell build).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 13:49:59 +03:00
ac445c1569 chore: re-trigger deploy (#17 Stage 1)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
CI / Format (push) Successful in 37s
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Clippy (push) Successful in 3m54s
build-prerelease / Build cortex binary (push) Successful in 4m43s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-blackwell (push) Successful in 5m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 8m10s
build-prerelease / Build neuron-ada (push) Successful in 5m21s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
No code change. The abc6e60 deploy's neuron-ada build died on the
degraded CI runner (container dropped mid-checkout), skipping the
gated publish — even though neuron-blackwell + neuron-ampere compiled
the Stage-1 fault-recovery code cleanly. Re-running to get a healthy
ada build so the RPMs publish and beast picks up the build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:34:20 +03:00
abc6e605b8 test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17)
Some checks failed
CI / CUDA type-check (push) Failing after 19s
build-prerelease / Resolve version stamps (push) Successful in 43s
CI / Format (push) Successful in 50s
CI / Clippy (push) Failing after 57s
build-prerelease / Build neuron-ada (push) Failing after 48s
build-prerelease / Build cortex binary (push) Successful in 5m5s
build-prerelease / Build neuron-blackwell (push) Successful in 6m38s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 7m27s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
CI / Test (push) Successful in 10m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
One-shot, env-gated fault injector for beast verification: when
NEURON_DEBUG_POISON names a model, the first request for it triggers the
auto-recovery path as if a device fault had occurred — exercising
unload→reload→healthy without corrupting the GPU. Latched so it fires
exactly once (no recovery loop). No-op unless the env var is set; wired
into both the single-GPU and TP chat poison gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:08:40 +03:00
4f2957af9e feat(neuron): auto-recover poisoned models (#17 Stage 1c)
When an inference hit a device fault, the model was flagged poisoned and
every subsequent request rejected with "unload and reload the model to
recover" — until a *human* did exactly that. Now the harness rebuilds the
context automatically.

- Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+
  `LoadedHandle::spec()`) so a poisoned model can be reloaded without an
  operator reconstructing the spec.
- A background recovery task (held via `Weak<CandleHarness>`, spawned in
  `new()` when a runtime is present) drains poisoned model ids and runs
  `unload_model` → `load_model(spec)`. Unload drops the model → cudarc
  `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL
  init + sanity inside the load path, so a successful reload yields a
  fresh, healthy model. A failed reload leaves it unloaded (next load
  retries) — never poisoned forever.
- The request-entry poison gates now `trigger_recovery` (single-flight
  per model via a `recovering` set) and return a transient "recovering,
  retry shortly" error instead of the manual-reload message. Requests
  that arrive during the brief reload gap (model absent from the registry)
  also get "recovering" rather than a misleading "not loaded".

`new()` now returns `Arc<Self>`. Recovery runs only on the background
task — never inline on the request path, which holds `inference_lock`
and would deadlock on the `models` write lock.

Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog
(1b) + a fault-injection hook for beast verification follow. The
in-process rank-0 leader's own context fault still needs a reload that
can't rebind it (Stage 3); comm-desync + worker faults recover here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:05:02 +03:00
75cd088b61 fix(neuron): cap vision max_pixels to the pos_embed patch budget (#14)
All checks were successful
CI / CUDA type-check (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m32s
build-prerelease / Build neuron-blackwell (push) Successful in 6m5s
CI / Test (push) Successful in 5m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 8m11s
build-prerelease / Build neuron-ada (push) Successful in 5m40s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m16s
Beast testing surfaced a real regression in the dynamic-resolution
default: a tall 808×1600 image resized (within the 1024² max_pixels) to a
90×44 patch grid = 3960 patches, exceeding the vision tower's hard
`num_position_embeddings = 2304` pos-embed budget. The per-rank
`patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP-
forward and poisoned the device context, bricking the model until reload.

Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches →
≤ 576 LM tokens), clamping even the operator env override. `smart_resize`
floors the pixel count under the cap, so no resized image can ever exceed
the budget — the tower check never fires, no poison. The pos-embed grid
(48×48) is the resolution Qwen3.6 was trained at, so the cap is
principled, not just defensive. Still ~3× the old fixed 196 tokens, and
the book-cover OCR test (1176 patches) already reads full title+subtitle.

Test: a huge/tall/wide/extreme image battery stays within the 2304 patch
budget. (Per-rank-error poison robustness itself remains issue #17.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 23:30:47 +03:00
d311c8ca7a feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5)
Some checks failed
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Format (push) Successful in 45s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m41s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-blackwell (push) Successful in 6m20s
build-prerelease / Build neuron-ampere (push) Successful in 7m18s
build-prerelease / Build neuron-ada (push) Successful in 5m10s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
- PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS /
  NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the
  NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain
  256²…1024² (64…1024 LM tokens/image).
- Test: a max-resolution source caps within the token budget (can't blow
  NEURON_MAX_PROMPT_TOKENS).
- Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from
  the preprocess, mod, and rope doc-comments now that resolution is
  dynamic and M-RoPE is implemented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:50:03 +03:00
c97a8654f5 feat(neuron): dynamic-resolution images via Qwen smart_resize (#14)
Some checks failed
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
CI / CUDA type-check (push) Successful in 32s
CI / Format (push) Successful in 34s
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Replace the fixed 448×448-square preprocess with native-aspect
`smart_resize`, and thread the resulting per-image grid through the LM
so spatial structure survives non-square images (documents, screenshots,
charts, panoramas, OCR) instead of being squished into a square.

- preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32;
  pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens).
  `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/
  `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized
  `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for
  the TP leader's token count.
- rope.rs: `compute_mrope_index`/`get_rope_index` take per-image
  `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`.
  Walk image runs in order, validate `run == gh*gw`, emit row-major
  positions, resume the shared counter at `base + max(gh,gw)`. Correct
  for multiple images of differing grids interleaved with text.
- candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the
  `image_grid_factor` (patch×merge) instead of the constant 196; all four
  prompt-build sites compute per-image counts from each image's resized
  grid (single-GPU from the extracted `ImageInput.h/w`, TP from
  `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`.
- single-GPU (`mod.rs`, `dispatch.rs`) and TP
  (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`,
  `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank
  recomputes grids from its own deterministic preprocess — no rpc.rs
  change, single source of truth.

The vision tower itself was already grid-general (recent pos-embed
interpolation + 2D rotary fix). No patch-count cap: pos-embed is
interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT
attention + prefill) instead.

Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index`
non-square + two-image + mismatch cases, square-grid regression guard.
Non-cuda build + clippy + full workspace tests green; TP load/dispatch
paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget
config + remaining doc cleanup follow in C5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:47:27 +03:00
dc048ffcc9 fix(neuron): vision-tower 2D positions + M-RoPE default on
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m36s
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 5m59s
CI / Test (push) Successful in 6m35s
build-prerelease / Build neuron-ampere (push) Successful in 7m51s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s
Two fixes to the spatial handling of images, validated against the HF
transformers 4.57.1 qwen3_vl reference on beast.

**Vision tower (the real cause of poor spatial vision).** The Stage-A
tower encoded position two ways wrong, so the model saw image *content*
but not *layout* (a row of 5 people read as "a line of 23", sky
inverted), regardless of the LM-side rope:

- Learned pos-embed was a naive sequential lookup of the first
  `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid —
  wrong stride for a 28×28 patch grid. Now bilinearly interpolates the
  grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major.
- The 2D vision rotary was absent entirely. Added
  `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch
  `(row, col)` rotary to q/k in every ViT block via rope_slow, matching
  HF `apply_rotary_pos_emb_vision`.

Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1`
revert each for A/B (no rebuild). New unit tests: interpolation reduces
to the sequential lookup at the native grid; rotary row/col structure.

**M-RoPE default on.** The interleaved M-RoPE matches HF
apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥
plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in
— defaults should encode the model's trained behaviour, not freeze the
broken state.

Vision tower is plain candle (CPU-testable): built, clippy-clean, full
workspace tests green locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 20:53:07 +03:00