cortex

Author	SHA1	Message	Date
rob thijssen	60f5598542	build(neuron): bump cudarc fork to 63327a2 (idempotent abort + Comm Send+Sync) Some checks failed build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 35s Details CI / Test (push) Failing after 1m9s Details CI / Clippy (push) Successful in 2m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Build neuron-ampere (push) Successful in 7m35s Details build-prerelease / Build neuron-ada (push) Successful in 5m7s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m14s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Build cortex binary (push) Successful in 4m33s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The fork's new commit makes `Comm: Send + Sync` (asserting NCCL's thread-safety invariant upstream) and makes `Comm::abort` idempotent via an `aborted` flag (so abort-then-Drop can't double-free) — strictly better than the previous Drop-no-panic workaround, and the `abort()` signature is unchanged so the watchdog call site is unaffected. Because `Comm` is now `Send + Sync`, `Arc<Comm>` and the `SendComm` / `NcclState` wrappers auto-derive `Send`/`Sync`, which conflicts (E0119) with neuron's manual `unsafe impl`s. Remove the four now-redundant impls — the safety assertion lives upstream in cudarc where it belongs. The conflict is in cuda-gated code, so only the CUDA type-check catches it (non-cuda build + clippy + tests stay green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:33:14 +03:00
rob thijssen	c94a2ae755	fix(neuron): correct nccl_state path on WorkerPool.leader_comm (#17 S2) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 44s Details build-prerelease / Build cortex binary (push) Successful in 4m57s Details build-prerelease / Package cortex RPM (push) Successful in 1m36s Details CI / Test (push) Successful in 7m10s Details CI / Clippy (push) Failing after 1m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m40s Details build-prerelease / Build neuron-ada (push) Successful in 9m5s Details build-prerelease / Build neuron-blackwell (push) Failing after 12m2s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details `super::nccl_state` from tp/mod.rs resolves to `crate::harness::nccl_state` (nonexistent); the module is the child `nccl_state` (cf. the existing `nccl_state::generate_comm_id_hex` call). The field is cuda-gated so the non-cuda build couldn't catch it; the branch CUDA type-check flaked on the runner before compiling. Self-audited fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:21:43 +03:00
rob thijssen	99920dd322	feat(neuron): TP step watchdog aborts wedged collectives (#17 Stage 2) Some checks failed CI / CUDA type-check (push) Failing after 47s Details CI / Format (push) Successful in 31s Details CI / Test (push) Failing after 1m3s Details CI / Clippy (push) Successful in 2m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Make a hung NCCL collective recoverable instead of a permanent brick. Today a wedged collective hangs the in-process leader thread forever, and even Stage 1's recovery can't help — its unload's DropTp queues behind the stuck thread and hangs too. - Cache the leader's NCCL Comm handle async-side at init (new cuda-gated Job::GetLeaderComm → DeviceWorkerHandle::get_leader_comm → stored on WorkerPool.leader_comm). Fetched while the thread is responsive — a wedged thread can't service the fetch, which is why it's cached up front. - Wrap the leader forward in both generate_step and generate_step_with_images in tokio::time::timeout (default 120s, NEURON_TP_STEP_TIMEOUT_S). On expiry the watchdog calls Comm::abort() (ncclCommAbort) on the cached handle from the async thread — the one NCCL op sanctioned concurrently with an in-flight collective — which unblocks the leader thread, then fails the step WITHOUT draining (workers are wedged too; recovery's unload kills them). The error is a device fault → poison → Stage 1 auto-recovery, which now completes because the leader thread is responsive again. - Bumps the cudarc patch to dbc425a (adds the Drop-must-not-panic fix so the post-abort comm teardown during recovery doesn't double-abort-panic). Logs the whole sequence at ERROR with greppable `tp watchdog:` / `ncclCommAbort` markers so a real-world hang leaves a forensic trail — verification is by inspecting journals after real hangs, not a synthetic harness. cuda-gated → validated by the blackwell build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:15:29 +03:00
rob thijssen	abc6e605b8	test(neuron): NEURON_DEBUG_POISON hook to verify auto-recovery (#17 ) Some checks failed CI / CUDA type-check (push) Failing after 19s Details build-prerelease / Resolve version stamps (push) Successful in 43s Details CI / Format (push) Successful in 50s Details CI / Clippy (push) Failing after 57s Details build-prerelease / Build neuron-ada (push) Failing after 48s Details build-prerelease / Build cortex binary (push) Successful in 5m5s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m38s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 7m27s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details CI / Test (push) Successful in 10m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details One-shot, env-gated fault injector for beast verification: when NEURON_DEBUG_POISON names a model, the first request for it triggers the auto-recovery path as if a device fault had occurred — exercising unload→reload→healthy without corrupting the GPU. Latched so it fires exactly once (no recovery loop). No-op unless the env var is set; wired into both the single-GPU and TP chat poison gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 09:08:40 +03:00
rob thijssen	4f2957af9e	feat(neuron): auto-recover poisoned models (#17 Stage 1c) When an inference hit a device fault, the model was flagged poisoned and every subsequent request rejected with "unload and reload the model to recover" — until a human did exactly that. Now the harness rebuilds the context automatically. - Retain the loading `ModelSpec` on `LoadedModel`/`TpLoadedModel` (+ `LoadedHandle::spec()`) so a poisoned model can be reloaded without an operator reconstructing the spec. - A background recovery task (held via `Weak<CandleHarness>`, spawned in `new()` when a runtime is present) drains poisoned model ids and runs `unload_model` → `load_model(spec)`. Unload drops the model → cudarc `Comm::drop` aborts NCCL + releases the context; reload re-runs NCCL init + sanity inside the load path, so a successful reload yields a fresh, healthy model. A failed reload leaves it unloaded (next load retries) — never poisoned forever. - The request-entry poison gates now `trigger_recovery` (single-flight per model via a `recovering` set) and return a transient "recovering, retry shortly" error instead of the manual-reload message. Requests that arrive during the brief reload gap (model absent from the registry) also get "recovering" rather than a misleading "not loaded". `new()` now returns `Arc<Self>`. Recovery runs only on the background task — never inline on the request path, which holds `inference_lock` and would deadlock on the `models` write lock. Stage 1c of the #17 plan (verified-healthy auto-recovery). Watchdog (1b) + a fault-injection hook for beast verification follow. The in-process rank-0 leader's own context fault still needs a reload that can't rebind it (Stage 3); comm-desync + worker faults recover here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 09:05:02 +03:00
rob thijssen	75cd088b61	fix(neuron): cap vision max_pixels to the pos_embed patch budget (#14 ) All checks were successful CI / CUDA type-check (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m32s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m5s Details CI / Test (push) Successful in 5m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m11s Details build-prerelease / Build neuron-ada (push) Successful in 5m40s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m4s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m16s Details Beast testing surfaced a real regression in the dynamic-resolution default: a tall 808×1600 image resized (within the 1024² max_pixels) to a 90×44 patch grid = 3960 patches, exceeding the vision tower's hard `num_position_embeddings = 2304` pos-embed budget. The per-rank `patch count 3960 exceeds pos_embed budget 2304` error fired mid-TP- forward and poisoned the device context, bricking the model until reload. Hard-cap `max_pixels` to `2304 × 16² = 589_824` px (≤ 2304 patches → ≤ 576 LM tokens), clamping even the operator env override. `smart_resize` floors the pixel count under the cap, so no resized image can ever exceed the budget — the tower check never fires, no poison. The pos-embed grid (48×48) is the resolution Qwen3.6 was trained at, so the cap is principled, not just defensive. Still ~3× the old fixed 196 tokens, and the book-cover OCR test (1176 patches) already reads full title+subtitle. Test: a huge/tall/wide/extreme image battery stays within the 2304 patch budget. (Per-rank-error poison robustness itself remains issue #17.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 23:30:47 +03:00
rob thijssen	d311c8ca7a	feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5) Some checks failed CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Format (push) Successful in 45s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m41s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m20s Details build-prerelease / Build neuron-ampere (push) Successful in 7m18s Details build-prerelease / Build neuron-ada (push) Successful in 5m10s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m7s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details - PreprocessProfile::qwen3_6() reads NEURON_VISION_MIN_PIXELS / NEURON_VISION_MAX_PIXELS (clamped to factor² ≤ min ≤ max), matching the NEURON_VISION_LEGACY_* / NEURON_MROPE knob convention. Defaults remain 256²…1024² (64…1024 LM tokens/image). - Test: a max-resolution source caps within the token budget (can't blow NEURON_MAX_PROMPT_TOKENS). - Strip stale fixed-resolution / "MRoPE gap (#15)" / 14×14 language from the preprocess, mod, and rope doc-comments now that resolution is dynamic and M-RoPE is implemented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:50:03 +03:00
rob thijssen	c97a8654f5	feat(neuron): dynamic-resolution images via Qwen smart_resize (#14 ) Some checks failed CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 34s Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Replace the fixed 448×448-square preprocess with native-aspect `smart_resize`, and thread the resulting per-image grid through the LM so spatial structure survives non-square images (documents, screenshots, charts, panoramas, OCR) instead of being squished into a square. - preprocess.rs: port Qwen `smart_resize` (factor = patch×merge = 32; pixel budget [min,max], default 256²–1024² → 64–1024 LM tokens). `PreprocessProfile` drops the fixed target dims for `factor`/`min_pixels`/ `max_pixels`; `preprocess`/`preprocess_data_uri` now return the resized `(h, w)`; add `resized_dims_for_uri` (decode + resize, no normalize) for the TP leader's token count. - rope.rs: `compute_mrope_index`/`get_rope_index` take per-image `grids: &[(lm_gh, lm_gw)]` instead of assuming a square `isqrt(run)`. Walk image runs in order, validate `run == gh*gw`, emit row-major positions, resume the shared counter at `base + max(gh,gw)`. Correct for multiple images of differing grids interleaved with text. - candle.rs: `VisionMeta`/`LoadedModel`/`TpLoadedModel` carry the `image_grid_factor` (patch×merge) instead of the constant 196; all four prompt-build sites compute per-image counts from each image's resized grid (single-GPU from the extracted `ImageInput.h/w`, TP from `resized_dims_for_uri`). `ModelArch` gains `vision_grid_factor`. - single-GPU (`mod.rs`, `dispatch.rs`) and TP (`tp_qwen3_5.rs::prefill_with_images_chunked`, `dispatch.rs`, `tp/worker.rs`) thread the grids into `get_rope_index`. Each TP rank recomputes grids from its own deterministic preprocess — no rpc.rs change, single source of truth. The vision tower itself was already grid-general (recent pos-embed interpolation + 2D rotary fix). No patch-count cap: pos-embed is interpolated to any grid; `max_pixels` bounds cost (O(patches²) ViT attention + prefill) instead. Tests: smart_resize (aspect/cap/floor/reject), `compute_mrope_index` non-square + two-image + mismatch cases, square-grid regression guard. Non-cuda build + clippy + full workspace tests green; TP load/dispatch paths are cuda-gated → Gitea CUDA type-check. Operator pixel-budget config + remaining doc cleanup follow in C5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:47:27 +03:00
rob thijssen	dc048ffcc9	fix(neuron): vision-tower 2D positions + M-RoPE default on All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m59s Details CI / Test (push) Successful in 6m35s Details build-prerelease / Build neuron-ampere (push) Successful in 7m51s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s Details Two fixes to the spatial handling of images, validated against the HF transformers 4.57.1 qwen3_vl reference on beast. Vision tower (the real cause of poor spatial vision). The Stage-A tower encoded position two ways wrong, so the model saw image content but not layout (a row of 5 people read as "a line of 23", sky inverted), regardless of the LM-side rope: - Learned pos-embed was a naive sequential lookup of the first `n_patches` rows of the 48×48 (`num_position_embeddings=2304`) grid — wrong stride for a 28×28 patch grid. Now bilinearly interpolates the grid to `gh×gw` (port of HF `fast_pos_embed_interpolate`), row-major. - The 2D vision rotary was absent entirely. Added `VisionRotaryEmbedding` (θ=10000, dim=head_dim/2) applying per-patch `(row, col)` rotary to q/k in every ViT block via rope_slow, matching HF `apply_rotary_pos_emb_vision`. Both default on; `NEURON_VISION_LEGACY_POS=1` / `NEURON_VISION_LEGACY_ROPE=1` revert each for A/B (no rebuild). New unit tests: interpolation reduces to the sequential lookup at the native grid; rotary row/col structure. M-RoPE default on. The interleaved M-RoPE matches HF apply_interleaved_mrope / get_rope_index exactly and A/B'd strictly ≥ plain. `NEURON_MROPE` is now a kill switch (`=0` for plain), not opt-in — defaults should encode the model's trained behaviour, not freeze the broken state. Vision tower is plain candle (CPU-testable): built, clippy-clean, full workspace tests green locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 20:53:07 +03:00
rob thijssen	7ebcfba5ca	fix(neuron): gate M-RoPE behind NEURON_MROPE (default off) All checks were successful CI / CUDA type-check (push) Successful in 33s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m34s Details build-prerelease / Build cortex binary (push) Successful in 4m33s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m14s Details CI / Test (push) Successful in 6m50s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ada (push) Successful in 5m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details On beast the interleaved M-RoPE degraded image understanding rather than fixing it: the model misread spatial layout (a horizontal row of people described as a "diagonal receding line"), got attributes wrong, and rambled — a "how many people" follow-up generated 4459 tokens over 3.5 minutes, past agent-0's HTTP timeout (the "fails to respond without an error"). The interleave is evidently not numerically correct, and it can't be validated remotely without a transformers reference. Gate it: `get_rope_index` now returns plain sequential identity positions unless NEURON_MROPE is truthy, so mrope_cos_sin reduces to plain RoPE and image tokens behave exactly as pre-M-RoPE (content recognition works; spatial layout approximate; no rambling). The real computation moves to `compute_mrope_index` (still unit-tested). Default off restores the working vision and unblocks agent-0; the M-RoPE code stays in place to debug + validate before flipping the default on. Pure non-cuda change (rope.rs); both single-GPU and TP forwards call the gated get_rope_index unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 19:32:59 +03:00
rob thijssen	825bf4e905	feat(neuron): M-RoPE Stage 4 — wire interleaved M-RoPE into the TP path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 42s Details build-prerelease / Build cortex binary (push) Successful in 5m9s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details CI / Test (push) Successful in 7m19s Details build-prerelease / Build neuron-ampere (push) Successful in 8m40s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details CI / Clippy (push) Successful in 2m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Mirror Stage 3 into the tensor-parallel Qwen3.6 model: - TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar offset and apply via apply_cos_sin. - TpQwen3_5Model gains the replicated rotary + rope_delta (reset in clear_kv_cache, settable). forward_inner builds the cos/sin once — interleaved M-RoPE from explicit position_ids (vision) or plain at offset+rope_delta (text/decode). forward() and forward_with_positions() delegate; the old single-shot forward_with_vision is gone. - prefill_with_images_chunked now computes get_rope_index over the whole prompt once, stores rope_delta on the base model, and slices the (3, prompt_len) position tensor per chunk — so every rank assigns image tokens their 14×14 grid coordinates and steps in lockstep (every chunk, text or image, carries the M-RoPE slice because the image shifts the surrounding text positions). Also build the position-id tensor as f32 directly (positions are small integers, exact in f32) to avoid an i64→f32 cast on the GPU. The TP forward is cuda-gated — CI CUDA type-check is the compile gate. Non-cuda build + clippy + full workspace tests green; rope math + the plain-RoPE-reduction invariant covered by unit tests. Completes the interleaved-M-RoPE work for the vision spatial misread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:46:27 +03:00
rob thijssen	4c12c7e2f0	feat(neuron): M-RoPE Stage 3 — wire interleaved M-RoPE into single-GPU Qwen3_5Model now builds the rotary cos/sin once per forward and threads (cos, sin) through the decoder → full-attention → rope, replacing the scalar offset that reached RotaryEmbedding: - vision forward computes get_rope_index over the (single-shot) prompt, sets rope_delta, and builds interleaved-M-RoPE cos/sin so image tokens carry their 14×14 grid (height/width) positions; - text / decode take plain_cos_sin at offset + rope_delta — with rope_delta == 0 (no image) this is bit-for-bit the old plain RoPE, and the device→host id copy is skipped on the text decode hot path. rope_delta is stored on the model and reset in clear_kv_cache, so decode after a vision prefill resumes text positions from the image-compressed counter. decoder.rs / full_attn.rs take (cos, sin) instead of offset; linear-attention layers are unchanged (no RoPE). The TP path still uses the retained apply(offset) — wired in Stage 4. Full workspace tests green; the load-bearing invariant (M-RoPE == plain for equal axes) keeps text unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:39:52 +03:00
rob thijssen	ba1b5ba408	feat(neuron): M-RoPE Stage 2 — get_rope_index position-id helper Pure function computing the interleaved-M-RoPE 3D position ids for a prompt with image-placeholder runs, plus the decode rope_delta: text tokens advance a single counter (all axes equal); each image run gets [base+t, base+h, base+w] row-major over a square grid_t=1, grid_h=grid_w=isqrt(run) (196 → 14×14); the counter resumes from base + max(grid). rope_delta = final_counter - seq_len lets decode resume text positions after the position-compressed image blocks. Plus mrope_position_tensor to build the (3, seq) tensor. Unit tests: text-only is sequential (delta 0); text+image+text matches hand-computed grid ids + resume + delta; 196 → 14×14; non-square run rejected; end-to-end through mrope_cos_sin tracks the height axis. #[allow(dead_code)] until Stage 3/4 wire it into the forward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:34:28 +03:00
rob thijssen	5731f4c318	feat(neuron): M-RoPE Stage 1 — interleaved rope machinery + config Parse + store mrope_section / mrope_interleaved in RopeParameters (previously accepted-but-ignored). RotaryEmbedding gains: - inv_freq + per-axis column masks (mask_t/h/w) built from mrope_section; - plain_cos_sin(pos, seq_len): narrow the precomputed tables (text/decode); - mrope_cos_sin(position_ids (3,seq)): per-axis freqs blended at the interleave columns (vision); - apply_cos_sin(q,k,cos,sin): the rope_slow application, factored out. The existing apply(q,k,offset) is retained (delegates to plain_cos_sin + apply_cos_sin) so current callers are unchanged; Stages 3–4 move cos/sin construction into the model forward and thread the 3D position ids for image tokens. Tests: masks partition the half-dim; interleave drives the right axis per column; and the load-bearing invariant — mrope_cos_sin reduces bit-for-bit to plain_cos_sin when the three axes are equal (so text inference is unchanged). Refs the MRoPE-gap diagnosis (vision spatial misread). Pure non-cuda; no behaviour change until wired. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 18:31:15 +03:00
rob thijssen	fa013505d1	fix(neuron): chunked TP-vision prefill + pre-flight VRAM guard All checks were successful build-prerelease / Resolve version stamps (push) Successful in 29s Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m6s Details build-prerelease / Build neuron-ampere (push) Successful in 8m30s Details CI / Format (push) Successful in 38s Details CI / CUDA type-check (push) Successful in 47s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details CI / Test (push) Successful in 6m3s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m32s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details agent-0 sent a ~13k-token prompt + image; the TP vision prefill was single-shot, so it tried to materialise activations for all 12,960 positions at once and OOM'd rank 1 mid-forward. Rank 1 died before issuing its row-parallel AllReduce, stranding rank 0 on the collective (it hung holding the pool lock). The text path survives the same size because it chunks the prefill. Chunk the vision prefill the same way: - TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the image(s) once, then walks the pre-expanded prompt in prefill_chunk_tokens() windows, splicing the patch-embedding rows into whichever chunk(s) carry <\|image_pad\|> positions (pure-text chunks take the plain forward). Activation is bounded by the chunk, not the prompt. - Every rank runs the identical chunk sequence (chunk_size threaded through GenerateStepWithImages / TpForwardLogitsWithImages / generate_step_with_images), so the per-chunk AllReduces stay paired across ranks with no extra sync — the KV cache accumulates via the growing offset, only the last chunk's logits are kept. Pre-flight guard (validate_vision_prefill): even chunked, a long prompt's KV cache can exhaust VRAM mid-forward, and on TP that hangs the collective. Reject up front with a clean InsufficientVram when the estimated footprint exceeds free VRAM, so a doomed request fails fast instead of hanging the daemon. Heuristic + tunable (NEURON_VISION_PREFILL_MB_PER_1K_TOKENS / _BASE_MB); default permissive so the now-working 12,960-token case still passes. Applied to every vision path (single-GPU + TP); single-GPU vision stays single-shot for now, so the guard is its protection until it's chunked too. Tests: pre-flight guard behaviour; RPC round-trip carries chunk_size. The chunked forward is cuda-gated — CI CUDA type-check validates it. Refs #16 / TP-vision. Operational note: a TP rank OOM still hangs the daemon (needs restart); making a worker failure abort the leader's collective is separate, broader TP hardening. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 17:21:36 +03:00
rob thijssen	c8bcaabc38	fix(neuron): render HF chat templates via minijinja pycompat All checks were successful build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 34s Details CI / CUDA type-check (push) Successful in 39s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m4s Details CI / Test (push) Successful in 6m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 7m43s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Successful in 5m41s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details The Qwen3.6 chat_template.jinja (now loaded after the precedence fix) failed to render in minijinja: it uses Python str methods (content.startswith/endswith/split/rstrip/lstrip) and the raise_exception global that HF transformers patches into its Jinja env but minijinja doesn't provide. The render error tripped the text-only fallback, so image requests still produced zero <\|image_pad\|> tokens. Wire the standard bridge into render_chat_template: - minijinja-contrib `pycompat::unknown_method_callback` supplies the Python string/list/dict methods; - a `raise_exception` global maps to a render error (so malformed inputs — e.g. an image in a system message — surface cleanly). Add the real Qwen3.6-27B chat_template.jinja (verbatim from beast's HF cache) as a test fixture and assert it renders one <\|image_pad\|> for a text+image turn — the end-to-end check that would have caught this before deploy. Refs #16 / TP-vision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 16:32:23 +03:00
rob thijssen	7ad56c6a86	fix(neuron): load chat_template.jinja (transformers precedence) The chat-template loader only read the `chat_template` field from tokenizer_config.json. Qwen3.6-27B ships its vision-aware template only in a standalone `chat_template.jinja` (and has no tokenizer_config.json at all), so the loader returned None and image requests fell back to the text-only format_qwen3_prompt — rendering zero `<\|image_pad\|>` tokens and tripping "expand_image_pad_tokens: prompt has 0 image_token_id occurrences". load_chat_template_alongside now follows HF transformers precedence: standalone chat_template.jinja → chat_template.json → the chat_template field in tokenizer_config.json. Tests cover the precedence, the text-only fallback, and that an OpenAI image_url content part renders `<\|image_pad\|>` through the real template condition (`'image_url' in item`). Refs #16 / TP-vision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 16:25:30 +03:00
rob thijssen	1b0e36c119	fix(neuron): cover TpForwardLogitsWithImages in drain_poisoned match All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build cortex binary (push) Successful in 4m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m48s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details CI / Test (push) Successful in 6m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m26s Details build-prerelease / Build neuron-ada (push) Successful in 5m21s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details The CUDA type-check caught a non-exhaustive match: drain_poisoned() must reply an error to every Job variant's reply channel, including the new cuda-gated TpForwardLogitsWithImages. The non-cuda build couldn't see it — the variant is #[cfg(feature = "cuda")], so the match is exhaustive without it on CPU. Refs TP-vision plan Stage 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:26:46 +03:00
rob thijssen	ed2d09864e	feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill Some checks failed CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m51s Details CI / Test (push) Successful in 5m52s Details CI / CUDA type-check (push) Failing after 50s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details End-to-end TP-vision: an image request to a TP-loaded Qwen3.6-27B now conditions on the image across both ranks. - TpLoadedModel carries has_vision / image_token_id / lm_tokens_per_image, populated at load via the shared VisionMeta::from_config_path (same config.json the shards loaded from; Stage 1 materialises the replicated tower on every rank). - LoadedHandle::capabilities() now advertises "vision" for TP loads with a tower (cortex-gateway already unions this into /v1/models via C3). - The TP rejection guards (chat_completion_tp + inference_tp_stream) are now conditional on !has_vision — text-only TP models still 400 cleanly, vision-capable ones fall through. - chat_completion_tp_inner and the streaming orchestration task detect images (request_has_images), expand <\|image_pad\|> to the per-image patch count, and run a single-shot generate_step_with_images prefill (every rank encodes + splices its replicated tower) before the unchanged decode loop. Text requests keep chunked_prefill_tp. - extract_image_data_uris ships the source data URIs to every rank for identical per-rank preprocessing. prompt_tokens now reflects the patch expansion, so usage accounting and KV offsets match the single-GPU baseline. TP entry points are cuda-gated (validated by CI's CUDA type-check); capabilities() + extract_image_data_uris + VisionMeta reuse compile on the non-cuda build. Full workspace test green. Refs TP-vision plan Stage 3. Implements #12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:14:44 +03:00
rob thijssen	4994b94c84	feat(neuron): TP-vision Stage 2 — per-rank image RPC + worker plumbing Carry image content through the TP forward path so every rank encodes and splices locally (replicated tower, no embedding broadcast). - rpc.rs: new WorkerRequest::GenerateStepWithImages carrying the source image data URIs + image_token_id for the single-shot vision prefill; worker still replies GenerateStepOk. Round-trip test added. - tp_qwen3_5.rs: TpQwen3_5ForCausalLM::forward_with_images — encode each preprocessed image through the rank's replicated tower, cat, splice, forward. Shared by leader and worker so every rank runs identical work. - tp/mod.rs: TpLeaderModel::forward_with_images and WorkerPool::generate_step_with_images (mirrors generate_step: fan out GenerateStepWithImages to subprocess ranks, run the leader's image forward on its device worker thread, drain, combine). - worker.rs: WorkerModel::forward_with_images + handle_generate_step_with_images — each subprocess rank preprocesses the same data URIs via the shared deterministic preprocess_data_uri, encodes, splices, forwards. - device_worker: Job::TpForwardLogitsWithImages + tp_forward_logits_with_images dispatch handler + DeviceWorkerHandle::tp_forward_logits_with_images. Determinism: every rank runs the same preprocess on the same source URIs through the same replicated tower, so the spliced hidden state matches across ranks — preserving the replicated-hidden-state invariant the row-parallel AllReduce relies on, with no NCCL broadcast. No caller yet — Stage 3 wires the TP chat/stream entry points to invoke generate_step_with_images for image prefill. cuda-gated plumbing covered by CI's CUDA type-check; rpc/route/forward_with_images compile on the non-cuda build. Refs TP-vision plan Stage 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:08:08 +03:00
rob thijssen	9a24b05866	feat(neuron): TP-vision Stage 1 — replicated vision tower on the TP model Load the full, unsharded model.visual.* vision tower on every TP rank (leader + each subprocess worker mmaps the same local safetensors) when config.vision_config is present. VisionTower::load already takes a ShardedVarBuilder whose plain .get() returns the full replicated tensor, so the tower loads identically regardless of world_size — no sharding, no NCCL broadcast. - TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id, plus has_vision/image_token_id/encode_image/forward_with_vision, mirroring the single-GPU Qwen3_5ForCausalLM wrapper. - TpQwen3_5Model::forward_with_vision mirrors the single-GPU forward_inner splice: embed locally, replace rows at image_token_id positions, run the sharded decoder stack. Because every rank encodes the same pixels through its replicated tower, the spliced input embeddings are identical across ranks — preserving the TP replicated-hidden-state invariant the row-parallel AllReduce relies on. - splice_runs is now pub(crate) and shared with the TP model. No caller yet — Stage 2 wires the RPC/worker path that invokes encode_image + forward_with_vision per rank. Most of this compiles on the non-cuda build (only the cuda load variant's tower line is gated); CI's CUDA type-check covers the rest. Refs TP-vision plan Stage 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 15:00:05 +03:00
rob thijssen	f8c0da0ebf	fix(neuron): TP-vision Stage 0 — reject image requests on the TP path Some checks failed build-prerelease / Resolve version stamps (push) Waiting to run Details CI / Format (push) Waiting to run Details CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Clippy (push) Has been cancelled Details CI / Test (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details The TP inference path has no vision tower, and the TP dispatch in chat_completion / inference_stream returns before the VisionUnsupported guard runs — so an image request to a TP-loaded model (e.g. beast's tp=2 Qwen3.6-27B) was silently dropped and answered from text alone, the exact issue-#3 confident-hallucination pattern Stage C killed for single-GPU. Add the request_has_images → VisionUnsupported guard to both chat_completion_tp and inference_tp_stream, before prefill / before the SSE stream opens, so beast returns a clean 400 vision_unsupported. The guard is unconditional for now (TP has no tower); Stage 3 makes it conditional on the TP model's has_vision once real TP-vision lands. Detection is covered by the existing request_has_images unit test; the guard itself is cuda-gated (validated by CI's CUDA type-check). Refs TP-vision plan Stage 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 14:53:56 +03:00
rob thijssen	dd592d918d	test(neuron): C2 — guard Responses→chat image translation contract All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 44s Details CI / Clippy (push) Successful in 2m51s Details build-prerelease / Build cortex binary (push) Successful in 4m42s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m52s Details CI / Test (push) Successful in 6m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 8m12s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The Responses request translator already emits the chat `image_url` Parts array Stage B5's vision path consumes, and the non-streaming (`chat_completion`) and streaming (`responses_stream` → `inference_stream`, Stage C1) Responses paths both route image content to the vision-aware prefill — so vision works end-to-end through `/v1/responses` with no translator change required. Add a multi-image test asserting order preservation and that the `detail` hint is tolerated (and dropped, since chat image_url has no analogue), locking the translator's output to the exact `image_url.url` shape `extract_images_from_request` walks. Closes part of #16 (Stage C2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:57:43 +03:00
rob thijssen	766c20ba47	feat(neuron): C1 — streaming SSE chat completion with vision The streaming worker path now splices image embeddings on prefill, closing the silent text-only degrade for `stream=true` image requests. `inference_stream` gains the same vision-routing block as the non-streaming `chat_completion`: detect `image_url` content, reject it against text-only models with `VisionUnsupported` (before any SSE frame is sent), preprocess each image and expand its `<\|image_pad\|>` sentinel to the per-image patch count, then carry the payload through dispatch. Rather than duplicate the 75-line `route_token!` reasoning/tool-call state machine into a sibling streamer, `stream_inference_via_worker` takes an `Option<(Vec<ImageInput>, u32)>`: when `Some`, prefill is a single-shot `forward_logits_with_images` splice; when `None`, the original chunked text-only prefill. Image embeddings are prefill-only, so every decode step stays on the plain `forward_logits` path and the shared decode loop is untouched. This keeps exactly one copy of the tool-call/reasoning logic to maintain. The Responses API streaming path (`responses_stream`) inherits vision for free since it drives the same `inference_stream`. Unit test covers `request_has_images` (the shared routing gate); the real-weights SSE smoke is the manual curl on beast (cuda-integration). Closes part of #16 (Stage C1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 13:57:02 +03:00
rob thijssen	577781de8d	fix(neuron): derive Clone on ImageInput for the CUDA vision dispatch All checks were successful CI / CUDA type-check (push) Successful in 32s Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m47s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Test (push) Successful in 6m14s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m58s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 8m5s Details build-prerelease / Build neuron-ada (push) Successful in 8m9s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA type-check in CI failed on commit `24968e9` with E0308: error[E0308]: mismatched types --> crates/neuron/src/harness/candle.rs:1707:33 1707 \| images.clone(), \| ^^^^^^^^^^^^^^ expected `Vec<ImageInput>`, found `&Vec<ImageInput>` In Stage B5 the cuda branch of `chat_completion` matches `&vision_route` to keep the `vision_route: Option<...>` alive for both arms, which makes `images` bind as `&Vec<ImageInput>`. The subsequent `images.clone()` call doesn't deep-clone because `ImageInput` doesn't derive `Clone` — rustc falls back to cloning the `&Vec` reference, which has the wrong type for the worker job. The CPU build (non-cuda) compiled fine because that branch is behind `#[cfg(feature = "cuda")]`; the cuda-check job is what catches the regression. Fix: derive `Clone` on `ImageInput`. The clone cost is one pixel-buffer memcpy per image (~2.4 MiB at fixed 448×448), which is fine on the chat-completion hot path — vision requests are rare per second relative to text-only decode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 15:51:57 +03:00
rob thijssen	24968e9233	feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6 Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 33s Details CI / CUDA type-check (push) Failing after 46s Details CI / Clippy (push) Successful in 2m37s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m35s Details CI / Test (push) Successful in 6m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Failing after 7m46s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ada (push) Failing after 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Stage B of the vision plan (doc/vision-qwen3_6-spec.md). Wires the vision tower from Stage A through to a complete non-streaming chat completion: extract images from the request, preprocess, encode on the worker thread, splice embeddings into the LM input at `<\|image_pad\|>` positions, return coherent text response with `prompt_tokens` reflecting patch tokens. Closes the silent-drop class of failures from issue #3 — vision requests against Qwen3.6 now condition the model on the image instead of producing confident text-only hallucinations. Streaming for vision is Stage C. Deferred items tracked under #12 (TP-vision), #13 (27B production), #14 (dynamic resolution), #15 (numerical validation). What landed: - B1 — `Qwen3_5Model::forward_with_vision`: text-only `forward` unchanged; new method takes `(input_ids, offset, image_embeds, image_token_id)`, embeds tokens, locates `image_token_id` positions, splices via the new `splice_runs` helper. MRoPE applies text-positions to image tokens for Stage B (spatial MRoPE is the issue #15 numerical-validation follow-up). 2 unit tests for `splice_runs` covering contiguous + non-contiguous runs. - B2 — `ModelArch::forward_with_vision` dispatch: routes Qwen3_5Dense to the new method; other arches return an error. Defence-in-depth — the HTTP layer (B6) already rejects image content for non-vision models. - B3 — `Job::ForwardLogitsWithImages`: new worker variant carrying tokens + per-image `(pixels, c, h, w)` payloads. The dispatcher encodes each image (device-resident), concatenates the resulting embeddings, calls `arch.forward_with_vision`, and returns CPU logits. Image embeddings never copy back to CPU — the "tensors don't escape the worker" invariant from the per-device worker refactor still holds. Poisoned-worker drain path handles the new variant. - B4 — Prompt builder: - `request_has_images` detects image content cheaply. - `extract_images_from_request(request, profile)` walks `MessageContent::Parts`, decodes data URIs, runs `harness::preprocess::preprocess` per image, returns `Vec<ImageInput>` in request order. - `expand_image_pad_tokens(input_ids, image_token_id, patches_per_image)` walks the tokenized prompt and replaces each `<\|image_pad\|>` (id 248056 for Qwen3.6) with N copies matching the per-image patch count. 4 unit tests. - `VisionMeta::from_config_path` peeks `config.json` at load time for `image_token_id`, vision_config patch/merge sizes, and derives `lm_tokens_per_image` for the Stage B fixed resolution. - B5 — `chat_completion` vision routing: detects image content, validates the loaded model has vision, expands the prompt, and calls a new `run_inference_with_images_via_worker` helper that does single-shot prefill + standard decode loop (KV cache holds the post-splice hidden states from prefill, so decode steps don't re-splice). Stage B skips chunked prefill for vision — at 448×448 fixed resolution the budget stays well under the activation-memory threshold. Long-vision chunking is Stage D follow-up. - B6 — `InferenceError::VisionUnsupported`: structured 400 with `code=vision_unsupported, model_id, suggestion` when an image request hits a non-vision model. Closes the agent0 failure mode where vision requests degraded silently. - B7 — `ModelInfo.capabilities`: per-model array (`["text"]` vs `["text", "vision"]`) in `/v1/models` and forwarded verbatim by cortex-gateway. Lets clients (litellm, agent0) gate image_url submission on the declared capability set. Optional in the wire format; defaults to empty for older clients. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, 124 lib tests). New unit-test counts: +2 splice_runs, +4 expand_image_pad. Manual verification (after RPMs deploy on beast): curl http://hanzalova.internal:31313/v1/chat/completions \ -H 'Content-Type: application/json' \ -d "{\"model\":\"Qwen/Qwen3.6-27B\", \"messages\":[{\"role\":\"user\",\"content\":[ {\"type\":\"text\",\"text\":\"What's in this image?\"}, {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,...\"}} ]}], \"max_tokens\":120}" \| jq Expect prompt_tokens > 196 (text + 196 patch tokens) and a response that references actual image content. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 15:33:00 +03:00
rob thijssen	7df84fed8f	feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6 All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 28s Details CI / Clippy (push) Successful in 2m35s Details build-prerelease / Build cortex binary (push) Successful in 5m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m23s Details build-prerelease / Build neuron-ampere (push) Successful in 7m56s Details CI / Test (push) Successful in 7m11s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ada (push) Successful in 5m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Stage A of the vision implementation plan (doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding that today's silent-drop failure mode (issue #3) needs — the Qwen3.6 ViT loads from `model.visual.`, runs forward producing post-merger LM-side image embeddings, and routes through the device worker via a new `Job::EncodeImage`. No LM splice yet — that's Stage B. Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision), #13 (27B production deploy), #14 (dynamic resolution), #15 (numerical validation). What landed: - A0 — investigation: pulled config.json, preprocessor_config.json, chat_template.jinja, and safetensors index from beast's local Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with exact tensor shapes for every `model.visual.` weight. Confirms 27-block ViT with `hidden_size=1152`, `patch_size=16`, `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives in 2 of the 15 safetensors shards. - A1 — deps + scaffolding: added `image = "0.25"` (default- features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to crates/neuron/Cargo.toml. Created `harness::preprocess` and `harness::arch::qwen3_5::vision` modules. - A2 — preprocess.rs: `decode_data_uri` strips `data:image/...;base64,...` → image bytes → `image::DynamicImage` (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess` resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448), normalises to `[-1, 1]` per the model's mean/std=0.5, emits row-major `(3, H, W)` f32. 9 unit tests covering data URI parse, decode failure paths, grayscale-to-RGB promotion, and the exact-value normalisation contract. - A3 — vision.rs: `VisionTower` struct with `patch_embed: Conv2d`, learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN + multi-head self-attention with fused QKV + GELU-tanh MLP + residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat → linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size). Includes the Conv3d→Conv2d fold trick documented at the top of the file — the published patch_embed.proj.weight is 5D `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static images we sum-collapse the temporal axis. Video would need real Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh` reference values from PyTorch. - A4 — wire vision into Qwen3_5ForCausalLM: extended `Config` with optional `vision_config: Option<VisionConfig>` and `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision tower when present, exposes `has_vision()` and `vision()` so the HTTP layer can advertise capability and so the encode path can reach it. - A5 — device worker `Job::EncodeImage`: new job variant carrying CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the tensor on the worker's device, calls `arch.encode_image(image)`, copies the result back to CPU as flat `Vec<f32>`. Keeps the "tensors don't escape the worker" invariant. Poisoned-worker drain path handles the new variant. - A6 — dispatch round-trip test: `encode_image_routes_to_dispatch_ and_errors_on_unknown_handle` proves the channel/dispatch wiring works end-to-end via the CPU device worker (errors on unknown ArchHandle, which is the expected behaviour without a loaded model — real-weights validation happens in Stage B when the LM splice path exists). CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, zero failures). New test counts: +9 in preprocess, +5 in vision, +1 in device_worker. Out of scope (deferred): - LM-side splice of image embeddings at `<\|image_pad\|>` positions → Stage B. - Streaming SSE for vision-bearing chat completions → Stage C. - Reject `image_url` with HTTP 400 for non-vision models / advertise `capabilities` in /v1/models → Stage C. - TP-vision (#12), 27B production deploy (#13), dynamic resolution (#14), numerical validation (#15). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 11:40:47 +03:00
rob thijssen	d4e1b05956	feat(neuron,cortex-core): source-aware loader (scheme:org/name) All checks were successful CI / CUDA type-check (push) Successful in 46s Details CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Clippy (push) Successful in 2m40s Details build-prerelease / Build cortex binary (push) Successful in 4m23s Details CI / Test (push) Successful in 5m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 7m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m18s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Phase 1 of plan-source-aware-loader-preflight. Makes neuron's loader treat `huggingface:org/name` and `helexa:org/name` as first-class distinct sources with per-source endpoint + cache, while staying backwards-compatible with bare `org/name` ids. Zero behavior change for existing operator configs. Motivation: helexa is adding an EU-hosted registry (`registry.helexa.ai`) alongside HF. Both speak HF-compatible wire format, but the bytes, jurisdiction, trust root, and cache namespace are distinct. The loader needs to disambiguate which registry serves a given model id, and to keep their caches from colliding on disk when both happen to host the same `org/name`. What lands: - `cortex-core::source` — new module. `ModelSourceId { scheme, org, name }` with `FromStr` accepting both `scheme:org/name` and bare `org/name`. `Display` round-trips. `repo_path()` emits the `org/name` half for the hf-hub `Api::model(...)` call regardless of which scheme/endpoint we're hitting. Rejects malformed input with typed `ParseError` variants (empty scheme, missing slash, scheme with `/`, name with `:`, etc.). - `neuron::config::CandleHarnessConfig` gains `default_source: Option<String>` and `sources: HashMap<String, SourceConfig>`. `SourceConfig` mirrors what `hf_hub::ApiBuilder` consumes: endpoint URL, optional `auth_env` (env var name read at startup so secrets stay out of TOML), and optional cache_dir. Defaults synthesise a `huggingface` entry pointing at `https://huggingface.co` with the legacy `hf_cache` field as its cache_dir — so existing configs that only set `hf_cache` keep working unchanged. - `CandleHarness::new(bind_url, &CandleHarnessConfig)` replaces `CandleHarness::new(bind_url, hf_cache)`. Resolves every configured source's auth env var and cache dir up front so `hf_api_for(scheme)` is a pure HashMap lookup on the hot load path. Only the `huggingface` scheme gets the legacy `HF_HUB_CACHE`/`HF_HOME` env-var fallback chain; other schemes resolve to whatever the operator typed. - `hf_api()` -> `hf_api_for(scheme)`. Builds an `hf_hub::Api` with the source's endpoint, cache_dir, and auth token. Errors with a useful message naming the configured schemes when an unknown scheme is requested. - `CandleHarness::load_model` parses `spec.model_id` into a `ModelSourceId`, substitutes `default_source` for bare ids, and threads the parsed source through `preflight`, `resolve_files`, `resolve_dense_files`, `load_arch_gguf`, `load_arch_dense`, and `load_tp`. The hf-hub `Api::model()` call now uses `source_id.repo_path()` so registry calls hit the right URL shape regardless of scheme. - `preflight()` signature gains a `&ModelSourceId` parameter (it's the canonical id for log lines and error display); `RepoFetchFailed.model_id` etc. now carry the scheme-qualified form so operator-visible errors echo exactly what was configured. - `neuron.example.toml` documents the new `[harness.candle.sources.*]` table with commented-out examples for `huggingface` (explicit override) and `helexa`. Tests: - 13 new unit tests in `cortex-core::source` covering parse / display round-trip, default-scheme substitution semantics, and every `ParseError` variant. - 6 new unit tests in `neuron::config` covering the `effective_sources` synth (legacy `hf_cache` carry-through, explicit override preservation, helexa-alongside-huggingface) and `effective_default_source` fallback. - 2 new unit tests in `harness::candle::tests` covering multi-scheme `hf_api_for` routing, including the "unknown scheme" error path naming configured schemes. - Preflight integration tests updated to construct `ModelSourceId` and assert against the scheme-qualified error form. CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 24 test groups ok, zero failures). Out of scope (Phase 3): - Cortex catalogue `source` field — independent of Phase 1+2, ships when the registry comes online. - `helexa` source endpoint itself — separate project; this PR adds the client-side rails only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 13:42:11 +03:00
rob thijssen	61adff347a	feat(neuron): preflight placement check with structured errors Some checks failed CI / CUDA type-check (push) Successful in 31s Details CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 48s Details CI / Test (push) Failing after 1m10s Details CI / Clippy (push) Successful in 2m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m53s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Phase 2 of plan-source-aware-loader-preflight. Adds a one-RTT placement feasibility check that runs before any device allocation, NCCL handshake, or weight fetch. Replaces today's opaque "fetch config.json … 404" failure mode (when an operator points `tensor_parallel = 2` at a GGUF-only repo) with a structured error that names the failure class and points at the fix. What lands: - `crates/neuron/src/harness/preflight.rs` — new module. Classifies a repo's siblings listing into `SourceFormat` (Gguf \| DenseSafetensors \| Mixed \| Empty), applies the tp/quant feasibility table, returns a `PlacementPlan` on success or a typed `PreflightError` on rejection. `PreflightError` is `serde::Serialize` so the HTTP layer can emit the structured shape verbatim; it's `thiserror::Error` so log lines get a single-line Display when downcasting from anyhow. Includes best-effort Levenshtein-nearest suggestion for malformed quant names (the second sharp edge the HauhauCS scenario surfaced — operator writes `q6k` against filenames containing `Q6_K_P`, and today's matcher just says "no GGUF file matching quant"). - `CandleHarness::load_model` — calls `preflight(...)` first thing after the "already loaded" guard, before any `ensure_device_worker` or `resolve_*`. Failure wraps the typed error in `anyhow::Error` so the existing trait surface is unchanged; the HTTP handler and the startup logger downcast to recover the structured form. - `crates/neuron/src/api.rs::load_model` handler — maps `PreflightError` to 422 Unprocessable Entity with `{"error": {"kind": "...", "model_id": "...", "suggestion": "..." }}`. Other failures keep the existing 400 + free-form `format!("{e:#}")` shape. - `crates/neuron/src/startup.rs::load_default_models` — when the failure is a preflight rejection, log as `reason=<kind> detail=<msg>` instead of the opaque `error=<chain>`, so journalctl on beast will now show `reason=tp_requires_safetensors detail="repo is GGUF-only (8 .gguf files); TP requires dense safetensors..."` instead of `error=fetch config.json from HauhauCS/...: 404 Not Found`. Tests: - 18 unit tests in `harness/preflight.rs` covering classifier, quant matching, Levenshtein, error serialization, and the full feasibility table (gguf+tp rejected, gguf+bad-quant suggests nearest, gguf+good-quant ok, dense+tp ok, empty rejected, mixed prefers safetensors). - 7 integration tests in `tests/preflight.rs` exercising the network path through an axum mock that serves hf-hub-compatible `/api/models/{org}/{name}/revision/main` payloads. Adds `tempfile` as a dev-dependency for per-test cache dirs. Out of scope (deferred to subsequent phases): - Phase 1 (source-aware loader plumbing — `scheme:org/name` parsing, per-scheme `SourceConfig`, cache disambiguation). Preflight runs against the single configured HuggingFace source today; the scheme threading lands cleanly when Phase 1 ships. - Phase 3 (cortex catalogue source field). - GGUF tensor-parallel loading. Preflight rejects this combination with `TpRequiresSafetensors`; the underlying loader gap is the separate `Helexa` curated-registry / heretic-rs conversation. Refs #4-#9 architectural follow-up; no specific issue closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 13:24:30 +03:00
rob thijssen	435fd10902	fix(neuron): macro-ify CUDA single-GPU route_token so DecodeStream type stays inferred All checks were successful CI / CUDA type-check (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 29s Details CI / Clippy (push) Successful in 2m47s Details build-prerelease / Build cortex binary (push) Successful in 4m27s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 8m30s Details build-prerelease / Build neuron-ada (push) Successful in 5m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m11s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 4m1s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details Prerelease build (run 270) failed on commit `cb30383` with: error[E0107]: struct takes 5 generic arguments but 0 generic arguments were supplied --> crates/neuron/src/harness/candle.rs:3554:41 \| 3554 \| decode_stream: &mut tokenizers::DecodeStream<'_>, \| ^^^^^^^^^^^^ The Step-2-era refactor for #6's tool-call extraction added a nested `async fn route_token` inside `stream_inference_via_worker` that named `tokenizers::DecodeStream<'_>` as a parameter type. `DecodeStream` actually has five generic parameters (`'tok, M, N, PT, PP, D`) which makes naming it explicitly painful — the working approach the CPU path uses is a macro, where the body expands inline at the call site and the decoder type stays inferred. This commit replicates the CPU-side macro for the CUDA worker path. Same shape, just with `.await` calls inside (macros tolerate that since they expand inline into the enclosing async context). Control flow uses a labelled-block + `consumer_alive` flag rather than `return` so the macro stays generic over the surrounding return type. The CPU build (default-feature workspace, what `clippy` and `test` jobs exercise) doesn't compile this `#[cfg(feature = "cuda")]` branch, which is why local CI green-lit it. The cuda-check job should catch this category of breakage now that #cb30383+CI-fix landed; this commit just resolves the actual breakage on the prerelease workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 08:59:56 +03:00
rob thijssen	cb303832bc	feat(neuron): render the model's chat_template with chat_template_kwargs Some checks failed CI / CUDA type-check (push) Failing after 58s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 40s Details build-prerelease / Build neuron-ampere (push) Failing after 1s Details CI / Clippy (push) Successful in 2m37s Details build-prerelease / Build cortex binary (push) Successful in 4m47s Details CI / Test (push) Successful in 6m13s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ada (push) Failing after 7m20s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Closes #9. Replaces the hardcoded `format_qwen3_prompt` ChatML glue with `minijinja`-driven rendering of the model's own `chat_template` from `tokenizer_config.json`. The request's `chat_template_kwargs` flow into the Jinja context so model-specific levers (Qwen3's `enable_thinking: false`, etc.) actually take effect. ## Implementation - New `harness::chat_template` module with three entry points: - `load_chat_template_alongside(tokenizer_json_path)` — probes `tokenizer_config.json` in the same hf-hub snapshot directory. Supports both the canonical string-form `chat_template` and the array-form some tokenizers ship (multi-template models). - `render_chat_template(template, messages, tools, kwargs)` — renders via `minijinja`. Messages flatten into the `[{role, content}]` shape HF templates iterate, with per-message extras (`tool_calls`, `tool_call_id`) preserved. `tools` and `kwargs` add into the Jinja context so templates that reference them work without us interpreting their shape. - `chat_templates_enabled()` reads `NEURON_USE_CHAT_TEMPLATE` (default true). Falsy values force the fallback path everywhere — a kill switch for emergency rollback without a rebuild. - `LoadedModel.chat_template: Option<String>` and the TP equivalent are populated once at load time. `None` (no tokenizer_config.json, parse error, missing field) routes the fallback path silently; logs go through `tracing::debug`/`warn` per condition. - New `build_prompt_for_request(chat_template, request)` wraps the decision: when both the template is present AND the kill switch is off, render with kwargs from `request.extra` (looks up `chat_template_kwargs` and `tools` lazily). On render error → warn + fallback to `format_qwen3_prompt`. Wired into all four current prompt-build sites (single-GPU stream + non-stream, TP stream + non-stream). ## Dependency `minijinja = "2"` with the `builtins`, `json`, and `serde` features. Pure-Rust Jinja2 implementation, ~80KB compiled. Used internally by HF's `tokenizers-rs` for its own chat templating; the API surface we touch (`Environment::add_template` + `Template::render(serde_value)`) is stable. ## Validation strategy I can't byte-compare the new path's output against `format_qwen3_prompt` for live models without GPU (CI doesn't have one). The fallback path and kill switch are the mitigations — a deploy can flip `NEURON_USE_CHAT_TEMPLATE=false` in the neuron service env if the chat template renders surprisingly on Qwen3-8B in production. The legacy formatter stays the fail-closed default. ## Scope cuts (documented in module header) - Tool-definition lifting from helexa-acp's system-prompt injection into the chat_template's native tools block is deferred. Today the request's `tools` array threads into the Jinja context, but helexa-acp continues to inject Hermes-format tool descriptions into the system prompt for backwards-compat with non-cortex endpoints. ## Tests 9 unit tests in `chat_template`: kill-switch matrix (truthy / falsy / unset), template loading (string form, array form, missing file, unparseable JSON, missing field), rendering (basic conversation threading, kwargs forwarding, message-extras threading for tool_calls). 215 workspace tests pass; clippy + fmt clean across all workspace features (default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:43:11 +03:00
rob thijssen	44008358c5	feat(neuron): emit response.in_progress between created and output_item.added Some checks failed build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 44s Details CI / Test (push) Failing after 1m5s Details CI / Clippy (push) Successful in 2m36s Details CI / CUDA type-check (push) Failing after 52s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Failing after 5m42s Details build-prerelease / Build neuron-ampere (push) Failing after 7m14s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Refs #7. OpenAI's Responses API spec emits `response.in_progress` between `response.created` and the first output-item event to mark "request validated, model is generating". Some Responses-API clients distinguish loading-spinner vs streaming-spinner UI based on which event arrived last; emitting both keeps the wire shape matched. Carries the same shell as `response.created` (status=in_progress, empty output, no usage yet) — both events are payload-light bookkeeping, distinguished only by the event name. The hosted-tool event families remaining in #7 (web_search_call, code_interpreter_call, file_search_call, image_generation_call) stay deferred until the underlying tools exist in neuron. Updated `full_stream_emits_expected_event_sequence` to assert the new event lands in position 1; downstream indexing shifted by one across the existing test assertions. CI green, fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:30:34 +03:00
rob thijssen	fc9a8c42a3	feat(neuron): extract `<tool_call>` blocks to structured tool_calls deltas Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details CI / CUDA type-check (push) Failing after 17s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 32s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Closes #6. Same model-agnostic seam as #8 but for tool-call markers (`<tool_call>` / `</tool_call>` on Qwen3-Coder, Hermes-format, DeepSeek-Coder, gpt-oss, …). Lets Zed's tool-use feature and any other vanilla OpenAI chat client get structured `tool_calls` deltas out of cortex without having to parse markers themselves. ## Implementation 1. Tokenizer probe at load time (`detect_tool_call_token_pair` in `wire::event`) — same shape as the reasoning-marker probe from #8. Both open AND close must resolve to single token ids; non-tool-use models get `None` and pass through unchanged. Stored on `LoadedModel.tool_call_tokens` and the TP analogue. 2. New `InferenceEvent::ToolCall` variant — carries `index` (call slot, per-turn counter), generated `id` (`call_<hex>_<idx>`), `name`, and the complete `arguments` JSON string. One event per parsed call. 3. Token-level state machine in all three streaming paths (CPU `run_inference_streaming`, CUDA single-GPU `stream_inference_via_worker`, CUDA TP `chat_completion_tp_stream`) layered on top of #8's reasoning routing: - `<tool_call>` token → enter buffering state, clear buffer. - Tokens while buffering → accumulate into `tool_call_buf` via the decoder (so multi-byte UTF-8 still buffers correctly) without emitting anything visible. - `</tool_call>` token → take the buffer, parse with `parse_tool_call_body` (extract `name` + `arguments`), emit a structured `ToolCall` event with a fresh `call_<hex>` id and the parsed fields. - On parse failure → fall back to re-emitting the original `<tool_call>{buf}</tool_call>` block as plain text content so helexa-acp's existing `ToolCallParser` repair passes still have a chance to recover the call. 4. OpenAI chat projector emits the OpenAI streaming `tool_calls` delta shape on `InferenceEvent::ToolCall` — `{tool_calls: [{index, id, type:"function", function:{name, arguments}}]}`. One chunk per call slot. 5. OpenAI Responses projector drops `ToolCall` events for now (Responses-side function_call event family routing tracked under #7); the chat path is what unblocks Zed's tool use today. ## Acceptance - Vanilla OpenAI chat clients (Zed's tool-use feature, any other OpenAI-compatible tool-call consumer) get structured tool_calls deltas against cortex+neuron without having to parse `<tool_call>` markers in content. - helexa-acp continues to work — when neuron parses cleanly, it consumes the structured deltas through its existing decoder. When the model emits malformed JSON, neuron falls back to text pass-through and helexa-acp's `ToolCallParser` recovers via the same path it always did. - Models without tool-call markers in their tokenizer pass through unchanged. - No hardcoded model knowledge — entirely driven by tokenizer metadata. ## Tests 2 new detection tests in `wire::event` (Qwen3-style marker detection, no-marker case). The streaming paths themselves stay covered by the existing chat-completions integration tests; full end-to-end exercise of the new path requires GPU-loaded models and lives outside the CI test surface. 215 workspace tests pass; clippy + fmt clean across the workspace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 23:26:31 +03:00
rob thijssen	7733eecba5	feat(neuron): strip reasoning from chat completions by default Some checks failed CI / CUDA type-check (push) Failing after 18s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m36s Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details CI / Test (push) Successful in 5m19s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m56s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 7m45s Details build-prerelease / Build neuron-ada (push) Successful in 5m24s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details Closes #8. Reasoning-capable models (Qwen3, DeepSeek-R1, gpt-oss, Mistral Magistral, …) emit `<think>...</think>` blocks inline in their content stream. The chat-completions wire format has no slot for reasoning, so until this change every consumer either parsed the markers themselves (helexa-acp) or wrote the raw scratchpad content into their UI (Zed's commit-message generator — visible as the leaked reasoning block on every generated commit message against benjy's Qwen3-8B). ## Implementation, model-agnostic by design The neuron side now does token-level routing without any hardcoded model knowledge: 1. At load time (`detect_reasoning_token_pair` in `wire::event`), probe the tokenizer's vocabulary for a known reasoning-marker pair: `<think>` / `</think>` (Qwen3, DeepSeek-R1, gpt-oss), `[THINK]` / `[/THINK]` (Mistral Magistral), and a couple of derivatives. Each marker must resolve to a single token id; if both open and close resolve, stash on `LoadedModel.reasoning_tokens` (similarly `TpLoadedModel`). Non-reasoning models get `None` and pass through unchanged. 2. At inference time, the three streaming paths (`run_inference_streaming` CPU, `stream_inference_via_worker` CUDA single-GPU, `chat_completion_tp_stream` CUDA TP) now check each sampled token against the pair via the new `handle_reasoning_marker` helper before feeding it to the detokeniser. Open marker → set `in_reasoning = true`, drop the marker. Close marker → unset, drop. Other tokens go through `emit_delta(_blocking)` which now picks `ReasoningDelta` or `TextDelta` based on state. Markers never appear in the streamed output. 3. In `wire::openai_chat`, the projector splits into: - `project_chat_stream` (unchanged signature; default behaviour — drops `ReasoningDelta`) - `project_chat_stream_with(rx, …, ChatProjectionConfig)` — when `include_thinking: true` and `reasoning_markers: Some(_)`, re-wraps reasoning content with the literal open/close marker text and emits as content deltas. Preserves the on-the-wire shape that helexa-acp's `ThinkParser` expects. 4. HTTP handler reads `x-include-thinking: true` (case- insensitive `1`/`true`/`yes`) from the request headers and threads it into the projection config. cortex-gateway already forwards arbitrary headers verbatim, so the opt-in works end-to-end without gateway changes. 5. helexa-acp's `openai_chat` provider sets `x-include-thinking: true` on every request so its existing `ThinkParser` keeps receiving the marked content stream. `ThinkParser` itself is unchanged — needed for endpoints that aren't reasoning-aware (OpenRouter, OpenAI directly, etc.). ## Acceptance - Zed's commit-message generator (vanilla chat-completions client, no `x-include-thinking`) gets clean commit messages with no `<think>` block. - helexa-acp sessions continue to render thinking in Zed's thought UI via the opt-in path. - Models without reasoning tokens declared in their tokenizer pass through unchanged. - Implementation contains zero references to "qwen3" or any specific model — entirely driven by tokenizer metadata. ## Tests 9 new tests in `wire::event` (token-pair detection across 4 marker conventions, edge cases) and `wire::openai_chat` (default drop, opt-in re-wrap with multi-chunk reasoning, close-marker on Finish, fallback when markers absent, off-switch with markers present). All 213 workspace tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 17:55:04 +03:00
rob thijssen	957f704efa	feat(neuron): OpenAI Responses API + ci cuda-check runner label Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / CUDA type-check (push) Failing after 11s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text \| items), `ResponsesInputItem` (message \| function_call \| function_call_output \| reasoning), `ResponsesContentPart` (input_text \| input_image \| output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit `1859777`: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:13:44 +03:00
rob thijssen	6927286cab	fix(neuron): clone id/model_id before TP spawn so wire projector can use them Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 39s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m34s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build cortex binary (push) Successful in 5m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 7m38s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The Step 1 refactor moved the InferenceEvent receiver wrap to after the orchestration spawn in chat_completion_tp_stream, but the spawn moves both `id` and `model_id` into its async closure (used heavily by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck error E0382 use-of-moved-value on the wire_chat::project_chat_stream call. The non-CUDA build doesn't exercise this branch (it lives behind `#[cfg(feature = "cuda")]`) which is why the workspace clippy/test gate passed locally and on the regular CI workflow. The RPM build workflow, which compiles with --features cuda, caught it (run 244 jobs 2/3/4 against beast / ampere / ada respectively, all the same error). Fix: snapshot `id` and `model_id` into `projector_id` / `projector_model_id` before the spawn, use those at the projector call site. The originals stay free to be moved into the closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 09:37:10 +03:00
rob thijssen	302ccfb982	refactor(neuron): introduce InferenceEvent + wire projection layer Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 3m28s Details build-prerelease / Build neuron-blackwell (push) Failing after 6m4s Details build-prerelease / Build neuron-ampere (push) Failing after 7m20s Details CI / Test (push) Successful in 7m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Failing after 4m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Step 1 of the OpenAI Responses API rollout. Pure refactor — no new endpoints, no behaviour change on the wire. Lays the seam for emitting Responses-shaped streaming events from the same harness output as chat completions in Step 2. - New `neuron::wire` module tree: - `wire::event::InferenceEvent` — format-agnostic enum (Start, TextDelta, ReasoningDelta, Finish) the candle harness now emits as its native streaming currency. - `wire::event::FinishReason` — typed reason that maps cleanly onto OpenAI `finish_reason`, OpenAI Responses `status`, and Anthropic `stop_reason` strings. - `wire::openai_chat::project_chat_stream` — async task that consumes an InferenceEvent receiver and produces a ChatCompletionChunk receiver, stamping per-request metadata (id, created, model_id) onto every chunk. Output matches the pre-refactor wire shape bit-for-bit. - candle.rs refactored to emit InferenceEvent on its internal channel through all three streaming paths (CPU run_inference_streaming, CUDA single-GPU stream_inference_via_worker, CUDA TP chat_completion_tp_stream). The streaming functions lost their id/created/model_id parameters since wire-format metadata now lives in the projector. - emit_delta + emit_delta_blocking simplified to single-purpose TextDelta emitters with no wire-format coupling. - chat_completion_stream wraps the InferenceEvent receiver in wire_chat::project_chat_stream before returning so the /v1/chat/completions HTTP handler keeps consuming ChatCompletionChunks unchanged. External signature preserved. Also fixes a pre-existing helexa-acp test race (three modules each declared their own static LOCK for HOME mutation, so cross-module parallelism flaked tests that read HOME at runtime). Consolidated onto a single crate-wide path_util::ENV_LOCK. 122 helexa-acp tests + 44 neuron tests pass (5 new wire projection tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp suite 3x to confirm the env race is closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:30:17 +03:00
rob thijssen	abbedf8d8a	chore(neuron): bump default max_tokens from 512 to 8192 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 45s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m35s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details 512 is too low for any modern coding model — clients that don't explicitly set max_tokens get clipped responses with no diagnostic. Bump the fallback at all four inference call sites (single-GPU streaming + non-streaming, TP leader + non-leader) to 8192, which fits comfortably within Qwen3-class context windows after a typical agent prompt and lines up with what helexa-acp / a0 / curl clients reasonably expect. Clients that explicitly set max_tokens (now including helexa-acp via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:38:28 +03:00
rob thijssen	e267f583e1	chore(neuron): rustfmt drift in is_device_fault test Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 58s Details CI / Clippy (push) Failing after 3m43s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m41s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details One assert! call grew past the line limit after the previous commits; cargo fmt --all picked it up. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:55 +03:00
rob thijssen	249b2e5c98	fix(neuron): only poison the model on actual device faults Some checks failed build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m22s Details CI / Test (push) Successful in 4m55s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CI / Format (push) Failing after 33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Previously every inference Err — shape mismatch, NaN logits, tokenizer error, missing handle — marked the model poisoned and rejected every subsequent request until an operator unload+reloaded. The benjy incident on 2026-05-27 showed how this misfires: a concurrency bug produced a `broadcast_add: shape mismatch` error that had nothing to do with CUDA, but the model was taken down anyway. Add `is_device_fault(err_chain: &str)` — a conservative classifier that returns false only for errors we know are pre-kernel / CPU-side (shape mismatches, NaN logits, tokenize/detokenize, missing handle, DecodeStream, empty prompt). Everything else defaults to true so a genuine driver fault still poisons. Applied at all six poisoning sites: - chat_completion CUDA worker path - chat_completion CPU spawn_blocking path - chat_completion_stream CUDA worker path - chat_completion_stream CPU spawn_blocking path - chat_completion_tp non-streaming wrapper - chat_completion_tp_stream spawned task Each site now logs either "model marked poisoned" (device fault) or "model NOT marked poisoned" (non-device) so the journal makes the classification visible. Tests cover the known non-device patterns and a couple of real CUDA driver messages. Pairs with the inference_lock commit (`c59da83`): together they eliminate both the cause of the spurious-poisoning we just observed (the shape mismatch) AND the over-reaction to it (the unconditional poison). Each fix is independently useful but the combination is what makes the system actually robust to concurrent agent workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:57:48 +03:00
rob thijssen	c59da83636	fix(neuron): serialise single-GPU inference per loaded model Two concurrent chat_completion requests against the same single-GPU model could interleave their `clear_kv_cache → forward(chunk0) → forward(chunk1) → ...` sequences. The device-worker channel serialises individual jobs but not the sequence boundary, so the cache could end up holding tokens from one request while another's mask was sized for its own prompt — producing a shape mismatch mid-prefill. Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories` and `memorize solutions` extensions fired 4ms apart against Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV cache, and request a08b4a's chunk 0 forward produced scores of shape [1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add failed, both requests bubbled the error up, both flipped the model to poisoned. Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring the TpLoadedModel.pool lock that the TP path already held. Acquire it at the start of `chat_completion` and inside the spawned task of `chat_completion_stream` (so the role chunk goes out immediately and only the inference work queues behind the lock). The CPU branch uses `blocking_lock` from inside spawn_blocking; the CUDA branch uses async `.lock().await` inside tokio::spawn. Throughput impact: zero. The GPU was already serialised at the device-worker channel — multiple requests just produced corrupt KV cache state instead of clean serial throughput. The lock makes the existing serialisation honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:54:04 +03:00
rob thijssen	f05882369d	fix(neuron): don't poison the model on tokio JoinError panics All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m28s Details build-prerelease / Build neuron-ampere (push) Successful in 8m25s Details build-prerelease / Build neuron-ada (push) Successful in 8m54s Details CI / Test (push) Successful in 4m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA driver failures propagate as Err through `?` and become `Ok(Err(InferenceError::Other(_)))` from the spawned task — those are real device faults and still poison the model. Tokio JoinError is different: it fires on Rust-level panic (tokenizer bug, sampler bug, serialisation, the UTF-8 slice that landed in commit `bd04d7f` before the fix) or task cancellation. Those don't touch the device context, so failing the one request without tearing down the model is correct. Two sites changed: - chat_completion's CPU spawn_blocking handler — JoinError no longer sets loaded.poisoned. - chat_completion_tp's tokio::spawn wrapper — JoinError no longer sets tp_for_marker.poisoned. The inner-Err case still does. Each path logs the cause (panicked / was cancelled / ended abnormally) explicitly so the journal makes the new behaviour obvious — search for "model NOT marked poisoned" to find these events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:02:52 +03:00
rob thijssen	bd04d7f580	fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji) across multiple tokens, the previous "decode the cumulative token list, byte-slice the delta against a stored prefix" pattern would panic with 'start byte index N is not a char boundary; it is inside <emoji>'. The race: at step N the tokenizer renders the partial bytes as U+FFFD (3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4 bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the codepoint in step N+1's `full` string, and `&str[start..]` panics. Replace with tokenizers' `DecodeStream::step(id)` which maintains an internal byte buffer across token boundaries and only emits when a clean codepoint completes. Applied at all three SSE emission sites: - stream_inference_via_worker (single-GPU CUDA stream) - chat_completion_tp_stream's spawned task (TP stream) - run_inference_streaming (CPU stream) The shared emit helper splits into emit_delta (async, mpsc::send) and emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its existing send semantics. The old emit_chunk helper that did the unsafe full-decode-and-slice is removed entirely. Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call response after a long agent-zero session; the spawned TP stream task panicked at candle.rs:2648. The model itself stayed healthy (no CUDA fault), only the one streaming request died. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:01:24 +03:00
rob thijssen	1e13889392	feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m7s Details build-prerelease / Build neuron-ampere (push) Successful in 12m16s Details build-prerelease / Build neuron-ada (push) Successful in 12m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload cycle observed on beast under agent-zero workloads. Three changes, all keyed off env-driven knobs so an operator can tune without a rebuild: 1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The initial forward is split into N-token windows, each with a monotonically growing offset. KV cache accumulates across chunks exactly as it would under one big prefill; only the final chunk's logits are kept for sampling. Activation memory now scales with chunk size instead of prompt length, so a 13 k-token prompt stops holding tens of GB of intermediate activations live at once. Wired into all six prefill call sites: - run_inference / run_inference_streaming (CPU path) - run_inference_via_worker / stream_inference_via_worker (CUDA single-GPU through device worker) - chat_completion_tp_inner / chat_completion_tp_stream (TP via WorkerPool) Three helpers — chunked_prefill_local, chunked_prefill_via_worker, chunked_prefill_tp — own the loop shape so the chunking semantics stay identical across paths. Per-chunk debug log shows progress. 2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384). Requests above the cap return a structured 400 with `code: prompt_too_long` rather than going through the prefill and discovering the limit by OOMing partway through. New InferenceError::PromptTooLong variant. 3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500). If `vram_free_mb` is below the threshold at request start (e.g. another concurrent request is mid-prefill), reject with a clean 503 + `code: insufficient_vram` rather than starting work that will OOM. New InferenceError::InsufficientVram variant. CPU loads (vram=0 sentinel) skip this check. All three gates fire BEFORE any device work, so a rejected request costs ~one tokenisation pass and never touches the worker thread — poison cascades from rejected work are now impossible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:46:54 +03:00
rob thijssen	35876954cd	chore(neuron): default tracing filter to info (was info,neuron=debug) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m43s Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Production deployments that want neuron-internal debug detail (e.g. trim_device_pool's per-clear-kv line, slab inserts/drops) override RUST_LOG explicitly via systemd. Defaulting to debug for the whole neuron target produced a lot of journal volume that wasn't useful in the common case. beast already sets RUST_LOG=debug in /etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:47:30 +03:00
rob thijssen	cdf0f4e66d	fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync returns memory to the device's default mempool, not to the OS), so mem_get_info under-reports free VRAM between requests. With Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw ~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into CUBLAS_STATUS_INTERNAL_ERROR depending on quant. Add a cuda-gated trim_device_pool helper that, after each successful clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool, 0) against the device's default mempool. Failures (no async-alloc support, transient driver errors) are non-fatal and log at debug. The before/after free-VRAM delta is logged so an operator can correlate the trim with the next request's prefill VRAM. ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors correctly; the leak was strictly at the cudarc pool layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:13 +03:00
rob thijssen	b4f3576d82	refactor(neuron): phase 4 — model loads move onto the device worker All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:24:38 +03:00
rob thijssen	76ab24d98c	refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker Some checks failed CI / Format (push) Successful in 29s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl` / `Job::Tp` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:16:02 +03:00
rob thijssen	b179204fd3	refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Resolve version stamps (push) Successful in 3m41s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build cortex binary (push) Successful in 12m20s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Second slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two spawn_blocking sites in `chat_completion` and `chat_completion_stream` now route through the device worker thread on CUDA loads. CPU loads keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's no context to own and the channel hop would only add latency. What this phase changes: - `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>` slab and a `next_handle` counter for minting opaque handles. - `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`, plus a new `arch_handle: Option<ArchHandle>` field. The two are mutually exclusive: CUDA loads set `arch_handle = Some(_)` after transferring the boxed arch into the worker's slab; CPU loads keep `arch = Some(_)` for the legacy spawn_blocking path. - New `run_inference_via_worker` and `stream_inference_via_worker` drive the prefill + decode loop by sending `Job::ForwardLogits` per step; the worker copies the resulting `[vocab]` logits to a CPU-side `Vec<f32>` before reply, so the async caller never holds a device-resident tensor. `apply_repeat_penalty` and `LogitsProcessor::sample` run on a CPU candle tensor; no context binding side-effects on tokio worker threads. - `logits_health_slice(&[f32])` complements the existing `logits_health(&Tensor)` so the new worker paths can compute health stats directly from the CPU vec. - `unload_model` for the single-GPU CUDA path now sends `Job::DropArch { handle }` to the worker so the `Box<ModelArch>` drops on the thread that allocated its CUDA tensors. The `Drop` runs with the bound context, freeing memory on the right context. What this phase doesn't touch (yet): - TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3. - Single-GPU model load — still spawn_blocking, followed by a `Job::TransferIn` to move the freshly-built `ModelArch` into the worker slab. Phase 4 moves the load itself onto the worker thread and eliminates the bootstrap TransferIn. - The `device_vram_mb` / `cuda_mem_mb` helpers — still present and used by the construction-time logs running inside spawn_blocking loads. Phase 4 cleanup folds them into `dispatch.rs`. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:55:08 +03:00
rob thijssen	081b532387	refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it Some checks failed CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Failing after 59s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m30s Details CI / Test (push) Successful in 4m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m16s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details First slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the infrastructure for a dedicated OS thread per CUDA device that owns the device's `CudaContext` for the daemon's lifetime, and routes the 8 async-context `device_vram_mb()` call sites in candle.rs through it. What this phase changes: - New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs). `DeviceWorkerHandle::spawn(idx)` creates a named OS thread (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel. Replies cross back via `tokio::sync::oneshot::Sender` so async callers await without parking a tokio worker. - Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward, ClearKv, NCCL init/sanity, and load variants. - `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at load time by a new `CandleHarness::ensure_device_worker(idx)` method that lazily spawns + caches one worker per device index. - Per-model `query_vram()` convenience method on both struct types so the 8 call sites in chat_completion / chat_completion_stream / chat_completion_tp_inner / chat_completion_tp_stream become `loaded.query_vram().await` (or `tp.query_vram().await`) — same field values logged, just sourced from the owner thread instead of the caller thread. What this phase doesn't touch (yet): - Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`. Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP forward + NCCL bring-up; Phase 4 moves the loads and deletes the now- unused `device_vram_mb` / `cuda_mem_mb` helpers. - Public API — unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical shapes. Tests: - 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query → shutdown round-trip, thread naming, post-shutdown submit returns `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across a Shutdown. CPU build (the only one CI runs) is enough to exercise channel mechanics. - All 37 lib tests + all integration tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:40:34 +03:00

1 2 3

113 Commits