cortex

Author	SHA1	Message	Date
rob thijssen	957f704efa	feat(neuron): OpenAI Responses API + ci cuda-check runner label Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / CUDA type-check (push) Failing after 11s Details build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 32s Details CI / Clippy (push) Successful in 2m31s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 6m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details Step 2 of the Responses rollout: native `/v1/responses` endpoint on neuron that consumes the same InferenceEvent stream as `/v1/chat/completions` but emits it as the Responses API's named SSE event family. No gateway-side translation. ## Surface - `cortex-core::responses` envelope types: `ResponsesRequest`, `ResponsesInput` (text \| items), `ResponsesInputItem` (message \| function_call \| function_call_output \| reasoning), `ResponsesContentPart` (input_text \| input_image \| output_text), `ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus a `events::*` constant module so the projector and the wire shape stay in sync without string-typos. - `neuron::wire::openai_responses`: - `request_to_chat(req)` flattens Responses input + instructions into a `ChatCompletionRequest` the candle harness already understands. Text-only Parts collapse to a string; mixed text+image Parts go to chat's content-array shape; reasoning items drop; function_call / function_call_output round-trip via tool_calls / tool_call_id metadata so the surface is consistent for the day the harness emits tool calls. - `project_responses_stream(rx, meta)` reads InferenceEvents and emits the eight named events that compose a Responses stream: response.created → output_item.added → content_part.added → output_text.delta×N → output_text.done → content_part.done → output_item.done → response.completed. Synthesises start frames if the producer skips Start (poisoned model, early disconnect) so the stream stays coherent. - `build_response(meta, text, reason, usage)` for the non-streaming path. - `CandleHarness::inference_stream(req)` extracted from `chat_completion_stream`, returning a typed `InferenceStream` (event receiver + id/created/model_id metadata). Both `chat_completion_stream` and the new `responses_stream` are now thin wrappers that pick their wire projection. TP path got the same treatment (`chat_completion_tp_stream` → `inference_tp_stream`). - `POST /v1/responses` route on neuron. Non-streaming returns one buffered `ResponsesResponse`; streaming returns axum SSE with both event names and JSON data per frame (Responses, unlike chat completions, uses named `event:` lines). Reused `inference_error_response` helper hoisted out so the chat and responses handlers share the InferenceError → HTTP mapping. ## CI Also bundles the `cuda-check` runner-label fix from feedback on commit `1859777`: `runs-on: rpm` doesn't ship the CUDA toolkit so cudarc's nvcc-version build script blew up. Switched to `runs-on: cuda-13.0` per the existing labels. ## Scope cuts (documented in the modules) - `previous_response_id` rejected at translate time with 400 (`code: chained_conversation_not_supported`) — stateful chained conversations need a persistence layer we haven't built. - Reasoning items dropped (no Qwen3 `<think>` routing yet). - Single output item per response (one `"message"` carrying text); `function_call` items reserved but not synthesised. - Streaming events cover the core set; `response.in_progress` and the web_search / image_generation event families are out-of-scope. 22 new tests: 5 in cortex-core (envelope round-trips), 13 in neuron::wire (request translator + projector + non-streaming builder), 4 in neuron's tests/api.rs (route surface — 503 when no candle, 400 on previous_response_id, 404 on missing model for both stream and non-stream). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 11:13:44 +03:00
rob thijssen	6927286cab	fix(neuron): clone id/model_id before TP spawn so wire projector can use them Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 39s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m34s Details CI / Test (push) Successful in 5m40s Details build-prerelease / Build cortex binary (push) Successful in 5m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 7m38s Details build-prerelease / Build neuron-ada (push) Successful in 5m34s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details The Step 1 refactor moved the InferenceEvent receiver wrap to after the orchestration spawn in chat_completion_tp_stream, but the spawn moves both `id` and `model_id` into its async closure (used heavily by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck error E0382 use-of-moved-value on the wire_chat::project_chat_stream call. The non-CUDA build doesn't exercise this branch (it lives behind `#[cfg(feature = "cuda")]`) which is why the workspace clippy/test gate passed locally and on the regular CI workflow. The RPM build workflow, which compiles with --features cuda, caught it (run 244 jobs 2/3/4 against beast / ampere / ada respectively, all the same error). Fix: snapshot `id` and `model_id` into `projector_id` / `projector_model_id` before the spawn, use those at the projector call site. The originals stay free to be moved into the closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 09:37:10 +03:00
rob thijssen	302ccfb982	refactor(neuron): introduce InferenceEvent + wire projection layer Some checks failed build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 3m28s Details build-prerelease / Build neuron-blackwell (push) Failing after 6m4s Details build-prerelease / Build neuron-ampere (push) Failing after 7m20s Details CI / Test (push) Successful in 7m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ada (push) Failing after 4m57s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped Details Step 1 of the OpenAI Responses API rollout. Pure refactor — no new endpoints, no behaviour change on the wire. Lays the seam for emitting Responses-shaped streaming events from the same harness output as chat completions in Step 2. - New `neuron::wire` module tree: - `wire::event::InferenceEvent` — format-agnostic enum (Start, TextDelta, ReasoningDelta, Finish) the candle harness now emits as its native streaming currency. - `wire::event::FinishReason` — typed reason that maps cleanly onto OpenAI `finish_reason`, OpenAI Responses `status`, and Anthropic `stop_reason` strings. - `wire::openai_chat::project_chat_stream` — async task that consumes an InferenceEvent receiver and produces a ChatCompletionChunk receiver, stamping per-request metadata (id, created, model_id) onto every chunk. Output matches the pre-refactor wire shape bit-for-bit. - candle.rs refactored to emit InferenceEvent on its internal channel through all three streaming paths (CPU run_inference_streaming, CUDA single-GPU stream_inference_via_worker, CUDA TP chat_completion_tp_stream). The streaming functions lost their id/created/model_id parameters since wire-format metadata now lives in the projector. - emit_delta + emit_delta_blocking simplified to single-purpose TextDelta emitters with no wire-format coupling. - chat_completion_stream wraps the InferenceEvent receiver in wire_chat::project_chat_stream before returning so the /v1/chat/completions HTTP handler keeps consuming ChatCompletionChunks unchanged. External signature preserved. Also fixes a pre-existing helexa-acp test race (three modules each declared their own static LOCK for HOME mutation, so cross-module parallelism flaked tests that read HOME at runtime). Consolidated onto a single crate-wide path_util::ENV_LOCK. 122 helexa-acp tests + 44 neuron tests pass (5 new wire projection tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp suite 3x to confirm the env race is closed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:30:17 +03:00
rob thijssen	abbedf8d8a	chore(neuron): bump default max_tokens from 512 to 8192 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 44s Details CI / Format (push) Successful in 45s Details CI / Clippy (push) Successful in 2m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m35s Details build-prerelease / Build cortex binary (push) Successful in 4m32s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 8m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details 512 is too low for any modern coding model — clients that don't explicitly set max_tokens get clipped responses with no diagnostic. Bump the fallback at all four inference call sites (single-GPU streaming + non-streaming, TP leader + non-leader) to 8192, which fits comfortably within Qwen3-class context windows after a typical agent prompt and lines up with what helexa-acp / a0 / curl clients reasonably expect. Clients that explicitly set max_tokens (now including helexa-acp via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 12:38:28 +03:00
rob thijssen	e267f583e1	chore(neuron): rustfmt drift in is_device_fault test Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 58s Details CI / Clippy (push) Failing after 3m43s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m41s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details One assert! call grew past the line limit after the previous commits; cargo fmt --all picked it up. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:55 +03:00
rob thijssen	249b2e5c98	fix(neuron): only poison the model on actual device faults Some checks failed build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m22s Details CI / Test (push) Successful in 4m55s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CI / Format (push) Failing after 33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Previously every inference Err — shape mismatch, NaN logits, tokenizer error, missing handle — marked the model poisoned and rejected every subsequent request until an operator unload+reloaded. The benjy incident on 2026-05-27 showed how this misfires: a concurrency bug produced a `broadcast_add: shape mismatch` error that had nothing to do with CUDA, but the model was taken down anyway. Add `is_device_fault(err_chain: &str)` — a conservative classifier that returns false only for errors we know are pre-kernel / CPU-side (shape mismatches, NaN logits, tokenize/detokenize, missing handle, DecodeStream, empty prompt). Everything else defaults to true so a genuine driver fault still poisons. Applied at all six poisoning sites: - chat_completion CUDA worker path - chat_completion CPU spawn_blocking path - chat_completion_stream CUDA worker path - chat_completion_stream CPU spawn_blocking path - chat_completion_tp non-streaming wrapper - chat_completion_tp_stream spawned task Each site now logs either "model marked poisoned" (device fault) or "model NOT marked poisoned" (non-device) so the journal makes the classification visible. Tests cover the known non-device patterns and a couple of real CUDA driver messages. Pairs with the inference_lock commit (`c59da83`): together they eliminate both the cause of the spurious-poisoning we just observed (the shape mismatch) AND the over-reaction to it (the unconditional poison). Each fix is independently useful but the combination is what makes the system actually robust to concurrent agent workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:57:48 +03:00
rob thijssen	c59da83636	fix(neuron): serialise single-GPU inference per loaded model Two concurrent chat_completion requests against the same single-GPU model could interleave their `clear_kv_cache → forward(chunk0) → forward(chunk1) → ...` sequences. The device-worker channel serialises individual jobs but not the sequence boundary, so the cache could end up holding tokens from one request while another's mask was sized for its own prompt — producing a shape mismatch mid-prefill. Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories` and `memorize solutions` extensions fired 4ms apart against Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV cache, and request a08b4a's chunk 0 forward produced scores of shape [1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add failed, both requests bubbled the error up, both flipped the model to poisoned. Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring the TpLoadedModel.pool lock that the TP path already held. Acquire it at the start of `chat_completion` and inside the spawned task of `chat_completion_stream` (so the role chunk goes out immediately and only the inference work queues behind the lock). The CPU branch uses `blocking_lock` from inside spawn_blocking; the CUDA branch uses async `.lock().await` inside tokio::spawn. Throughput impact: zero. The GPU was already serialised at the device-worker channel — multiple requests just produced corrupt KV cache state instead of clean serial throughput. The lock makes the existing serialisation honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:54:04 +03:00
rob thijssen	f05882369d	fix(neuron): don't poison the model on tokio JoinError panics All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m28s Details build-prerelease / Build neuron-ampere (push) Successful in 8m25s Details build-prerelease / Build neuron-ada (push) Successful in 8m54s Details CI / Test (push) Successful in 4m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA driver failures propagate as Err through `?` and become `Ok(Err(InferenceError::Other(_)))` from the spawned task — those are real device faults and still poison the model. Tokio JoinError is different: it fires on Rust-level panic (tokenizer bug, sampler bug, serialisation, the UTF-8 slice that landed in commit `bd04d7f` before the fix) or task cancellation. Those don't touch the device context, so failing the one request without tearing down the model is correct. Two sites changed: - chat_completion's CPU spawn_blocking handler — JoinError no longer sets loaded.poisoned. - chat_completion_tp's tokio::spawn wrapper — JoinError no longer sets tp_for_marker.poisoned. The inner-Err case still does. Each path logs the cause (panicked / was cancelled / ended abnormally) explicitly so the journal makes the new behaviour obvious — search for "model NOT marked poisoned" to find these events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:02:52 +03:00
rob thijssen	bd04d7f580	fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji) across multiple tokens, the previous "decode the cumulative token list, byte-slice the delta against a stored prefix" pattern would panic with 'start byte index N is not a char boundary; it is inside <emoji>'. The race: at step N the tokenizer renders the partial bytes as U+FFFD (3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4 bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the codepoint in step N+1's `full` string, and `&str[start..]` panics. Replace with tokenizers' `DecodeStream::step(id)` which maintains an internal byte buffer across token boundaries and only emits when a clean codepoint completes. Applied at all three SSE emission sites: - stream_inference_via_worker (single-GPU CUDA stream) - chat_completion_tp_stream's spawned task (TP stream) - run_inference_streaming (CPU stream) The shared emit helper splits into emit_delta (async, mpsc::send) and emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its existing send semantics. The old emit_chunk helper that did the unsafe full-decode-and-slice is removed entirely. Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call response after a long agent-zero session; the spawned TP stream task panicked at candle.rs:2648. The model itself stayed healthy (no CUDA fault), only the one streaming request died. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:01:24 +03:00
rob thijssen	1e13889392	feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m7s Details build-prerelease / Build neuron-ampere (push) Successful in 12m16s Details build-prerelease / Build neuron-ada (push) Successful in 12m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload cycle observed on beast under agent-zero workloads. Three changes, all keyed off env-driven knobs so an operator can tune without a rebuild: 1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The initial forward is split into N-token windows, each with a monotonically growing offset. KV cache accumulates across chunks exactly as it would under one big prefill; only the final chunk's logits are kept for sampling. Activation memory now scales with chunk size instead of prompt length, so a 13 k-token prompt stops holding tens of GB of intermediate activations live at once. Wired into all six prefill call sites: - run_inference / run_inference_streaming (CPU path) - run_inference_via_worker / stream_inference_via_worker (CUDA single-GPU through device worker) - chat_completion_tp_inner / chat_completion_tp_stream (TP via WorkerPool) Three helpers — chunked_prefill_local, chunked_prefill_via_worker, chunked_prefill_tp — own the loop shape so the chunking semantics stay identical across paths. Per-chunk debug log shows progress. 2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384). Requests above the cap return a structured 400 with `code: prompt_too_long` rather than going through the prefill and discovering the limit by OOMing partway through. New InferenceError::PromptTooLong variant. 3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500). If `vram_free_mb` is below the threshold at request start (e.g. another concurrent request is mid-prefill), reject with a clean 503 + `code: insufficient_vram` rather than starting work that will OOM. New InferenceError::InsufficientVram variant. CPU loads (vram=0 sentinel) skip this check. All three gates fire BEFORE any device work, so a rejected request costs ~one tokenisation pass and never touches the worker thread — poison cascades from rejected work are now impossible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:46:54 +03:00
rob thijssen	35876954cd	chore(neuron): default tracing filter to info (was info,neuron=debug) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m43s Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Production deployments that want neuron-internal debug detail (e.g. trim_device_pool's per-clear-kv line, slab inserts/drops) override RUST_LOG explicitly via systemd. Defaulting to debug for the whole neuron target produced a lot of journal volume that wasn't useful in the common case. beast already sets RUST_LOG=debug in /etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:47:30 +03:00
rob thijssen	cdf0f4e66d	fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync returns memory to the device's default mempool, not to the OS), so mem_get_info under-reports free VRAM between requests. With Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw ~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into CUBLAS_STATUS_INTERNAL_ERROR depending on quant. Add a cuda-gated trim_device_pool helper that, after each successful clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool, 0) against the device's default mempool. Failures (no async-alloc support, transient driver errors) are non-fatal and log at debug. The before/after free-VRAM delta is logged so an operator can correlate the trim with the next request's prefill VRAM. ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors correctly; the leak was strictly at the cudarc pool layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:13 +03:00
rob thijssen	b4f3576d82	refactor(neuron): phase 4 — model loads move onto the device worker All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:24:38 +03:00
rob thijssen	76ab24d98c	refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker Some checks failed CI / Format (push) Successful in 29s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl` / `Job::Tp` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:16:02 +03:00
rob thijssen	b179204fd3	refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Resolve version stamps (push) Successful in 3m41s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build cortex binary (push) Successful in 12m20s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Second slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two spawn_blocking sites in `chat_completion` and `chat_completion_stream` now route through the device worker thread on CUDA loads. CPU loads keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's no context to own and the channel hop would only add latency. What this phase changes: - `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>` slab and a `next_handle` counter for minting opaque handles. - `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`, plus a new `arch_handle: Option<ArchHandle>` field. The two are mutually exclusive: CUDA loads set `arch_handle = Some(_)` after transferring the boxed arch into the worker's slab; CPU loads keep `arch = Some(_)` for the legacy spawn_blocking path. - New `run_inference_via_worker` and `stream_inference_via_worker` drive the prefill + decode loop by sending `Job::ForwardLogits` per step; the worker copies the resulting `[vocab]` logits to a CPU-side `Vec<f32>` before reply, so the async caller never holds a device-resident tensor. `apply_repeat_penalty` and `LogitsProcessor::sample` run on a CPU candle tensor; no context binding side-effects on tokio worker threads. - `logits_health_slice(&[f32])` complements the existing `logits_health(&Tensor)` so the new worker paths can compute health stats directly from the CPU vec. - `unload_model` for the single-GPU CUDA path now sends `Job::DropArch { handle }` to the worker so the `Box<ModelArch>` drops on the thread that allocated its CUDA tensors. The `Drop` runs with the bound context, freeing memory on the right context. What this phase doesn't touch (yet): - TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3. - Single-GPU model load — still spawn_blocking, followed by a `Job::TransferIn` to move the freshly-built `ModelArch` into the worker slab. Phase 4 moves the load itself onto the worker thread and eliminates the bootstrap TransferIn. - The `device_vram_mb` / `cuda_mem_mb` helpers — still present and used by the construction-time logs running inside spawn_blocking loads. Phase 4 cleanup folds them into `dispatch.rs`. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:55:08 +03:00
rob thijssen	081b532387	refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it Some checks failed CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Failing after 59s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m30s Details CI / Test (push) Successful in 4m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m16s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details First slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the infrastructure for a dedicated OS thread per CUDA device that owns the device's `CudaContext` for the daemon's lifetime, and routes the 8 async-context `device_vram_mb()` call sites in candle.rs through it. What this phase changes: - New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs). `DeviceWorkerHandle::spawn(idx)` creates a named OS thread (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel. Replies cross back via `tokio::sync::oneshot::Sender` so async callers await without parking a tokio worker. - Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward, ClearKv, NCCL init/sanity, and load variants. - `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at load time by a new `CandleHarness::ensure_device_worker(idx)` method that lazily spawns + caches one worker per device index. - Per-model `query_vram()` convenience method on both struct types so the 8 call sites in chat_completion / chat_completion_stream / chat_completion_tp_inner / chat_completion_tp_stream become `loaded.query_vram().await` (or `tp.query_vram().await`) — same field values logged, just sourced from the owner thread instead of the caller thread. What this phase doesn't touch (yet): - Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`. Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP forward + NCCL bring-up; Phase 4 moves the loads and deletes the now- unused `device_vram_mb` / `cuda_mem_mb` helpers. - Public API — unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical shapes. Tests: - 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query → shutdown round-trip, thread naming, post-shutdown submit returns `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across a Shutdown. CPU build (the only one CI runs) is enough to exercise channel mechanics. - All 37 lib tests + all integration tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:40:34 +03:00
rob thijssen	7c19da9361	feat(neuron): construction-complete vram/config dump + logits health + per-step vram All checks were successful CI / Format (push) Successful in 40s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Clippy (push) Successful in 2m27s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m0s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 5m10s Details build-prerelease / Build neuron-ada (push) Successful in 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details CI / Test (push) Successful in 4m24s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B incident from "guess at KV cache / quant sizes" into "read the journal": 1. Construction-complete summary in TpQwen3_5ForCausalLM::load and TpQwen3ForCausalLM::load. After the last "after layer N" log fires, each rank emits a single info line with: free_mb/total_mb (the number that drops by ~9 GB between per-layer and first-request on beast, with no inference traffic), every resolved config knob (vocab_size, hidden_size, num_layers, head_dim, num_kv_heads, max_position_embeddings), and a per-token KV-cache byte estimate. For Qwen3-Next also includes the linear/full-attention layer split so the hybrid architecture's cache cost is unambiguous. 2. Logits health snapshot on sample failure. Today the failure logs "A weight is negative, too large or not a valid number" with no context — was it a NaN cascade, an Inf, a negative weight? `logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts plus finite_min/max/mean on the failure path (zero cost on the success path) and emits a warn line just before the wrapper's terminal "failed, model marked poisoned" log. Wired into both the prefill and decode sample sites of the non-streaming AND streaming TP chat paths. 3. VRAM snapshot at prefill complete + every decode step. The "prefill complete" info line now carries vram_free_mb so the activations + KV growth from the prefill itself is visible. The per-step trace line gets vram_free_mb too, so an operator running with RUST_LOG=trace can watch headroom shrink token by token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:04:55 +03:00
rob thijssen	800498f530	feat(neuron): bind listener before pre-warm, surface activation in /health Some checks failed build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m26s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Test (push) Successful in 4m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two coupled changes addressing the 2026-05-26 validate-neuron failure where a fresh deploy of beast had /health unreachable for ~5 minutes while Qwen3.6-27B q5k materialised, even though systemd reported the unit as active. 1. main.rs no longer awaits load_default_models before binding axum. The listener binds first; pre-warm runs in a spawned background task that holds a read lock on the harness registry for the duration of its sequential load loop. Concurrent on-demand /models/load and /v1/chat/completions traffic still flow. 2. /health gains an `activation` field carrying: state pre_warming \| ready pending model ids queued but not started in_progress model id currently loading (Option) completed model ids loaded successfully this activation failed [{model_id, error}] for failed entries The field is `#[serde(default)]` so a pre-change cortex polling a new neuron — or vice versa — keeps working. `ActivationTracker` (new module `neuron::activation`) owns the RwLock-wrapped state; load_default_models takes a tracker reference and updates it per-model. NeuronState holds an Arc clone for the /health handler. Tests updated to construct trackers and assert state transitions (empty noop, two failures → ready with both in `failed`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:18:04 +03:00
rob thijssen	2740e61a23	fix(neuron,candle): name lifetime on acquire_pool_lock All checks were successful build-prerelease / Resolve version stamps (push) Successful in 46s Details CI / Format (push) Successful in 46s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m8s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 5m25s Details build-prerelease / Build neuron-ada (push) Successful in 5m3s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 7m41s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Lifetime elision fails when a function has two reference parameters and returns a borrow: rustc can't infer whether the MutexGuard's lifetime ties to `pool` or `model_id`. The non-CUDA build skipped this code path (cfg-gated), so the error only surfaced on the GPU build at https://git.lair.cafe/helexa/cortex/actions/runs/162. The guard borrows the pool, so name the lifetime on `pool` and the return type. `model_id` keeps its independent (elided) lifetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:37:32 +03:00
rob thijssen	67f79c868f	fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain Some checks failed build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Format (push) Successful in 43s Details CI / Clippy (push) Successful in 2m46s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m32s Details CI / Test (push) Successful in 4m25s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two failure modes from the 2026-05-26 beast incident: 1. `unload_all_models` looped through models calling `unload_model`, logging individual failures at warn. The cumulative effect was a single warn line for the failed unload then "shutdown complete" — no signal that the model was actually still loaded. Now each unload is bounded by a 20s timeout, failures escalate to error, and a summary "leaving N model(s) loaded" line fires when anything is stuck so the operator knows the OS will reclaim VRAM after exit. 2. Returning `Ok(())` from `main` after the unload sweep dropped the tokio runtime, which then waited indefinitely on a CUDA-stuck spawn_blocking thread (the journal's "Stack trace of thread 2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec fired 2 minutes later, SIGABRT, core dump. Replacing the return with `std::process::exit(0)` skips the runtime drain and hands the OS a clean exit code; stuck threads get reaped with the process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:30:06 +03:00
rob thijssen	fc6ef0ee0f	feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups Once a CUDA driver error has hit a forward or kv-cache call, the device's context is unrecoverable in-process — subsequent kernels can hang (the failure mode seen on beast on 2026-05-26), return garbage, or trip another illegal-address. The harness now marks the model poisoned on any forward / spawn_blocking / TP-task failure, refuses further inference against it with a clear "unload and reload" error, and surfaces `status: "poisoned"` on `/models` so an operator running `curl beast:13131/models` (or cortex polling) can see the bad state. Without this, a single OOM on a too-large prefill quietly turned every subsequent request into a stuck wait on the pool lock; with it, the first request fails fast with the driver error in the journal and the client gets a usable 5xx instead of a hung connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:28:42 +03:00
rob thijssen	1385979e3d	feat(neuron,candle): log per-device VRAM at chat_completion start Every "starting" log line now carries vram_free_mb / vram_total_mb for the request's serving device (the leader device on TP). On the 2026-05-26 incident this would have made the 14k-token prefill OOM diagnosable from the first log line: with ~412 MB free, that prompt was never going to fit, and the operator could have caught the imbalance before the CUDA context got poisoned. `device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is kept separate to avoid coupling the inference path to the TP module. TpLoadedModel gains a `leader_device: Device` clone so the request path reads the device without locking the leader model (which would contend with an in-flight forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:26:23 +03:00
rob thijssen	0a1cfcd4d0	feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings Every chat completion path (single-GPU + TP, streaming + non-streaming) now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber prefixes every event with that span so `grep req_id=…` over journalctl reconstructs one request even when dozens overlap. Every path also emits a terminal log line on both success ("done", with prompt_tokens/completion_tokens/finish_reason/total_ms) and failure ("failed", with full anyhow chain + total_ms). Failures used to vanish silently — a request that hit a CUDA OOM left "starting" in the journal and no further trace. New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await` in both TP paths. It warns at 2s ("still waiting on pool lock") and re-warns every 2s thereafter, so queued requests stuck behind a deadlocked holder are visible immediately instead of looking like idle silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:25:11 +03:00
rob thijssen	ea0e0f7911	fix(neuron,tp): log leader forward errors with full context Worker rank failures were already surfaced at WARN, but the leader's own forward Result::Err was silently coerced to a `leader_ok=false` bool. When the leader and a worker both fail together — the typical shape of a CUDA OOM cascading into an illegal-address — the journal showed only the worker side and an operator had to guess what hit rank 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:22:30 +03:00
rob thijssen	e71181499e	feat(stage-8e-3): quantize lm_head in TP Qwen3-Next All checks were successful build-prerelease / Resolve version stamps (push) Successful in 42s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m23s Details build-prerelease / Build neuron-ada (push) Successful in 4m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m42s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the load spec has quant set and tie_word_embeddings is false, lm_head's (vocab_size, hidden_size) weight is quantized in-situ at load time along with all the per-layer linears. The non-tied case on Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2 bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of decode latency from the per-token logits matmul. Tied case (tie_word_embeddings=true) keeps the lm_head plain even when quant is set — quantizing the shared tensor would corrupt the embedding lookup, and the tied case already gets the memory win from only holding one copy. This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path. The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's the bottleneck for a model that actually needs TP at consumer scale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:53:14 +03:00
rob thijssen	ee663e5e99	fix(stage-8e-2e): bump quant prefill threshold to M > 64 Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The M > 8 threshold from 8e-2d activated forward_via_f16 on the test case (M=30) and slightly regressed prefill (143 -> 133 T/s). The dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms) eats the cuBLAS GEMM speedup at small M. Move the crossover to M > 64 so short prefills (typical for the validate probe) stay on the GGUF GEMV kernel where per-call cost is comparable but the dequant tax is zero. Long prefills still get the dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the fixed dequant cost. Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that needs either a dequant cache (gives back the ISQ memory win) or a fused dequant+gemm kernel. Both larger projects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:50:45 +03:00
rob thijssen	34f9b77d9d	feat(stage-8e-2d): route quantized matmul by M (prefill vs decode) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m40s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m58s Details build-prerelease / Build neuron-ampere (push) Successful in 5m14s Details build-prerelease / Package cortex RPM (push) Successful in 9m25s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details MaybeQuantLinear::forward picks between two QMatMul paths: - M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost is fixed per call, so it's amortised across the M tokens. - M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel on the quantized blocks directly. Requires f32 inputs so we still cast in/out at the boundary in that arm. Earlier 8e-2c sent everything through the GGUF GEMV kernel, which is excellent at GEMV (decode) but doesn't have a real batched GEMM path — prefill regressed ~4x. This restores prefill to roughly the bf16 cuBLAS GEMM throughput while keeping the decode gain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:15:32 +03:00
rob thijssen	f084aaab8e	fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul All checks were successful CI / Format (push) Successful in 33s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m18s Details CI / Test (push) Successful in 4m26s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m41s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 4m41s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly GGUF dequantize accumulates in f32. The model dtype flowing into MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with "unexpected dtype, expected: F32, got: BF16". Wrap the Quant arm to cast the activation to f32 before the matmul and cast the result back to the input dtype. The cast is a single launch on the activation tensor (small relative to weight traffic); it's the price of in-situ GGUF-style quantization, and what mistralrs does inside its own Linear wrapper. The Plain arm is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 20:05:19 +03:00
rob thijssen	68a606a79c	fix(stage-8e-2b): allow quant on the TP load path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 8m37s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ada (push) Successful in 4m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The pre-existing guard in candle.rs rejected any spec.quant on the TP path with "GGUF quantized models are not supported in the TP path" — written when quant only ever meant GGUF. With 8e-1/8e-2 in, quant != None on the TP path triggers in-situ quantization of the loaded safetensors shards. resolve_dense_files only looks for safetensors so a GGUF-source-file model with TP still errors out cleanly downstream. validate-neuron.sh: rebuild the load payload incrementally so tp_size > 1 + non-empty quant produces both fields. Same script now covers all four combos (single/TP × dense/ISQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 19:17:14 +03:00
rob thijssen	4aa71902d0	feat(stage-8e-2): plumb quant config from ModelSpec to TP load path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m7s Details CI / Test (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m47s Details build-prerelease / Build neuron-ampere (push) Successful in 5m17s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Build cortex binary (push) Successful in 18m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s Details - LoadDenseShard RPC gains an optional `quant` string field. - WorkerPool::load_dense_shard takes a `quant: Option<String>`, passes it via the RPC to workers and via parse_quant_string to the leader's local load. - The Qwen3-Next TP load chain (ForCausalLM → Model → DecoderLayer → Attention / GatedDeltaNet / MLP) takes `quant: Option<GgmlDType>` end-to-end, calling Column/RowParallelLinear::load_with_quant. - The fused in_proj_qkv inside TpQwen3_5GatedDeltaNet is now a MaybeQuantLinear so it also picks up quantization. - parse_quant_string accepts q4_0/q4_1/q5_0/q5_1/q8_0/q8_1, q2k..q8k (with or without underscore), and f16/bf16/f32. Empty / None means no quantization. Callers from candle.rs forward spec.quant through pool.load_dense_shard. This means a `quant = "q5k"` in models.toml now flows end-to-end to a QTensor-backed QMatMul for every per-rank linear in the Qwen3-Next TP path. Leaves lm_head and the small replicated bias/log tensors in their loaded dtype (Stage 8e-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 18:03:36 +03:00
rob thijssen	bef159b21c	feat(stage-8e-1): MaybeQuantLinear primitive + parallel-linear quant variants Some checks failed build-prerelease / Resolve version stamps (push) Successful in 37s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details CI / Format (push) Waiting to run Details CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Introduces MaybeQuantLinear, which wraps either a plain candle Linear or a candle QMatMul backed by a freshly-quantized QTensor. Forward dispatches identically through the Module trait so downstream code doesn't care which arm is active. ColumnParallelLinear and RowParallelLinear gain `load_with_quant` methods. The existing `load` methods stay as backward-compatible no-quantization wrappers — no churn at the 27 existing call sites. This is the foundation for in-situ quantization at load time. Wiring the user-facing quant config and switching call sites to load_with_quant follow in stages 8e-2 / 8e-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 17:55:26 +03:00
rob thijssen	8d7b099b36	feat(stage-8d-7): direct safetensors fused-region loader Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m18s Details CI / Test (push) Successful in 4m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Replaces load_fused_qkv_slice_2d/_3d with reads from a separate MmapedSafetensors handle. Each per-rank fused tensor is built by reading the three region byte-slices directly from the mmap, concatenating them host-side, and uploading as one device allocation — no full-fused-tensor device materialisation. The prior approach allocated a ~100 MB transient device tensor per linear-attention layer; on Qwen3.6-27B with 48 linear-attn layers that's ~4.8 GB of allocator churn during load — enough to fragment the cuda caching allocator on a tight-VRAM 32 GB consumer GPU, which is what triggered the layer-22 up_proj OOM seen on beast. Threading: MmapedSafetensors flows worker → ForCausalLM → Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs) and worker (worker.rs) construct their own mmap; Linux's page cache shares the underlying pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 17:49:35 +03:00
rob thijssen	89d98d1fb2	diag(stage-8d-6): per-layer VRAM logging in TP load path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m14s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m59s Details CI / Test (push) Successful in 4m58s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ampere (push) Successful in 4m52s Details build-prerelease / Build neuron-ada (push) Successful in 5m11s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Wraps each TpQwen3_5DecoderLayer::load in a with_context that captures free/total VRAM on failure, plus an info-level log after every layer that succeeds. Uses cudarc::driver::result::mem_get_info — same API mistralrs uses. Diagnostic only: forward path is unchanged. Helps distinguish true VRAM exhaustion from allocator fragmentation when loading large models at BF16 on 2x consumer GPUs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:54:05 +03:00
rob thijssen	cc95fe28d9	feat(stage-8d-5b): wire fused_gdn_gating CUDA kernel All checks were successful build-prerelease / Resolve version stamps (push) Successful in 1m45s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m40s Details build-prerelease / Build cortex binary (push) Successful in 4m27s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Build neuron-ampere (push) Successful in 5m30s Details build-prerelease / Build neuron-ada (push) Successful in 5m24s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details CI / Format (push) Successful in 35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details run_fused_gating helper consolidates the per-layer gating math: beta = sigmoid(b) g = -exp(a_log) * softplus(a + dt_bias) CUDA path issues a single launch via fused_gdn_gating_cuda; cpu path falls back to the original per-op Rust sequence. Replaces ~10 candle launches per linear-attention layer (sigmoid + 2× to_dtype + exp + neg + broadcast_add + softplus + 2× unsqueeze + broadcast_mul) across both single-GPU and TP forward paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:52:38 +03:00
rob thijssen	09c945f81e	feat(stage-8d-4): dispatch chunked_gated_delta_rule_recurrence at prefill Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 44s Details CI / Clippy (push) Failing after 52s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details run_delta_rule_cuda now picks between the per-token kernel and the BT=64 chunked variant based on seq_len. Threshold = 64 matches mistralrs. Prefill on Qwen3.6-27B (typical seq_len in the hundreds) drops from one block-launch per token to one per 64-token chunk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:50:30 +03:00
rob thijssen	05dc0bad18	feat(stage-8d-3): wire causal_conv1d_update/full CUDA kernels Some checks failed CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 38s Details build-prerelease / Build cortex binary (push) Has started running Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Replaces the per-layer conv1d + silu sequence in both single-GPU and TP linear-attention forward paths with a shared run_causal_conv1d helper that dispatches to: - causal_conv1d_update for decode (seq_len=1 with existing conv_state) - causal_conv1d_full for prefill / fresh start (zero-pads internally) Both kernels fuse the depthwise conv + SiLU into a single launch — 4× fewer cuda launches per linear-attention layer vs the candle conv1d + candle_nn::ops::silu combo. Falls back to the original Rust path on cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:49:41 +03:00
rob thijssen	10c151efa5	feat(stage-8d-5): wire gated_delta_rule_recurrence kernel into tp_qwen3_5 Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m36s Details CI / Test (push) Successful in 4m39s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details TP per-token Rust loop replaced with shared run_delta_rule dispatch from arch/qwen3_5/linear_attn.rs. Both single-GPU and TP variants now use the cuda kernel when available, per-token Rust fallback otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:44:12 +03:00
rob thijssen	44ae927e38	feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_5 Some checks failed build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 38s Details CI / Test (push) Failing after 45s Details CI / Clippy (push) Successful in 2m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Replaces the per-token Rust delta-rule loop in `arch/qwen3_5/linear_attn.rs::GatedDeltaNet::forward` with a single dispatch to the `gated_delta_rule_recurrence` kernel imported from mistralrs in `1ebbe87`. The kernel is V-tiled with compile-time BK (one block per (V-tile, batch*head), one thread per V-column, BK state floats in registers). For Qwen3.6's per-rank `(B=1, H=24, D_k=128, D_v=128)` shape this collapses ~6 candle tensor-op launches per token per layer (each ~50µs CUDA dispatch overhead, so ~300µs/token/layer × 48 linear- attention layers = 14ms in launch overhead alone) to a single kernel launch with full ILP / register residency. New free function `run_delta_rule`: - cuda branch (when q is on a CUDA device): flattens `(B, H, ...)` → `(BH, ...)`, dispatches the kernel via `crate::cuda::gdn::gated_delta_rule_recurrence_cuda`, reshapes outputs back to `(B, H, L, D_v)` and state to `(B, H, D_k, D_v)`. - cpu fallback: the original per-token Rust loop, unchanged. Keeps cargo test --workspace passing on hosts without cuda. Dispatch decision lives in the wrapper (`q.device().is_cuda()`). Build: `cargo build -p neuron --features cuda` compiles + links; clippy clean on both CPU and cuda paths. 32 lib tests still pass (none of them exercise this code path on cuda; smoke test for the TP variant is the deployed Tbilisi probe). Stage 8d-3 wires the conv1d kernels; 8d-4 the chunked prefill; 8d-5 the same wiring for `tp/tp_qwen3_5.rs`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:39:30 +03:00
rob thijssen	1ebbe87651	feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m23s Details build-prerelease / Build neuron-blackwell (push) Has started running Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Stage 8d (new): port the Gated DeltaNet CUDA kernels from EricLBuehler/mistral.rs to close the ~500x decode performance gap we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path vs ~37 T/s in mistralrs on the same hardware). This commit lays the build infrastructure with zero behavioural change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the qwen3_5 architecture and TP variant. Added: - `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile every `src/cuda/.cu` file into `libneuroncuda.a` under the `cuda` feature, then links it + `cudart`. Mirrors mistralrs's `mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same sm_<80 bf16 gate). - `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from upstream: `gated_delta_rule_recurrence` (V-tiled per-token decode) * `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill) * `causal_conv1d_update` (single-token conv decode) * `causal_conv1d_full` (multi-token conv prefill) * `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) * softplus(a + dt_bias)) - `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels, cudarc::CudaSlice::device_ptr boilerplate identical to upstream. - `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of upstream's ffi.rs covering only the five GDN kernels; MoE / SSM / top-k decls land here when we absorb those too). - `crates/neuron/src/cuda/mod.rs` — re-exports + module docs. Cargo wiring: `cudaforge` added as an optional build-dep, activated by the `cuda` feature. CPU build is unchanged (the `cuda/` module is fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the patched container compiles `gdn.cu` (1 of 1 kernels) and links clean. Licensing: upstream files preserve their MIT origin via per-file comment banners pointing to the mistralrs path. No behaviour-relevant edits to the .cu kernels — local diff against upstream is just the banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream; their structure (module path `crate::cuda::ffi::*`) matches identically so future kernel imports drop in unchanged. CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean inside the runner container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:34:11 +03:00
rob thijssen	70eb6af42b	feat(tp): cancellation-safe inference + structured tracing All checks were successful CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m14s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m44s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details CI / Test (push) Successful in 4m38s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m47s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m41s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Two changes addressing operator visibility into TP inference + the HTTP-cancellation poisoning chain: 1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When the HTTP client disconnects (curl --max-time, browser nav, etc.) the future returned from `chat_completion_tp` gets dropped, but the spawned task keeps running to completion — finishing every `pool.generate_step` / `pool.clear_kv_cache` to drain the worker pipes. The next inference request then finds a clean pool. Previously: dropped future left workers still processing the in-flight request, the next call's `ClearKvCache` recv would read the stale `GenerateStepOk` from the abandoned step ("rank N expected KvCacheCleared, got GenerateStepOk"). The drain-on- leader-error fix from `d1a4aad` covered Rust-side leader failures but not HTTP-layer cancellation, which is what we actually hit on the user's Qwen3.6 test. 2. Tracing throughout the TP path so journalctl shows where an inference spends its time without needing to surface harness internals via the HTTP error body: - `chat_completion_tp_inner` (now a free fn so it can run inside spawn): `info` at request start (prompt_len, max_new, temp, top_p, eos_id), `info` per major phase (prefill complete with elapsed_ms, decode complete with elapsed_ms + token count), `info` at completion (total_ms, finish_reason). `debug` for pool-lock acquisition + kv-cache clear timing. `trace` per decode step (next_token, step_ms). - `WorkerPool::generate_step` (leader side): `debug` at fan-out, `debug` after leader forward returns with elapsed_ms + ok flag, `debug` after drain with errors count + total_ms. - `WorkerPool::clear_kv_cache`: matching `debug` at fan-out + drain. - `worker::handle_generate_step`: `debug` at forward start + done with elapsed_ms, `warn` on forward failure with the full error. The default log filter is already `info,neuron=debug` so the operator gets every `info` and `debug` line by default; `trace` needs RUST_LOG=trace for per-step decode timing. Stage 7c-ii crash-detection is still future work; this is the minimum that makes the "where did the 120s go" question answerable from the logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 08:22:00 +03:00
rob thijssen	d1a4aad91d	fix(tp): always drain worker responses on leader failure All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 1m6s Details CI / Clippy (push) Successful in 2m56s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m40s Details CI / Test (push) Successful in 5m1s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 4m29s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details The TP-2 inference probe against Qwen3.6-27B surfaced: worker rank 1 ClearKvCache: expected KvCacheCleared, got GenerateStepOk Caused by pipe poisoning. The previous shape of `generate_step`: for w in workers { w.send_only(GenerateStep) } // 1. fan-out let logits = spawn_blocking(leader.forward)??; // 2. early return on err for w in workers { w.recv_only() } // 3. drain (skipped on 2's err) When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen before, an OOM, a downstream squeeze that didn't match the shape), the function bailed before step 3 — but workers had already written `GenerateStepOk` to their stdout pipes, since their forwards (and the NCCL collectives inside) completed independently of the leader's post-collective Rust-side work. The next call (typically `ClearKvCache` at the start of the next inference request) would then send a fresh request and read those stale replies as if they were the new operation's. Once a pipe is poisoned, every subsequent call surfaces the same shape of error even though nothing's actually broken. Fix: introduce two helpers in `tp/mod.rs`: - `drain_workers(workers, check)` — reads exactly one response from every worker regardless of individual outcomes. Returns `Vec<String>` of `rank N: detail` strings for any non-OK reply. - `combine_leader_workers(leader, worker_errs, op)` — folds the leader's `Result<Result<T>>` (the spawn_blocking shape) with the worker drain into a single `Result<T>`. Leader failure takes precedence but worker errors get appended so both halves surface. `generate_step` and `clear_kv_cache` now use this pattern. Worst case: both halves fail and the operator sees a combined error message; either way the pipes are always drained so the next call's recv matches the request it sent. Note: the model is still poisoned in the current state — the operator needs to either `POST /models/unload` + reload, or `systemctl restart neuron`, to recover. The fix prevents future desync; it doesn't repair existing stale pipe state. Stage 7c-ii crash detection was tracked as the canonical solution to this class of issue; this is the minimum-viable subset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 07:39:36 +03:00
rob thijssen	95dc8745eb	feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m37s Details CI / Test (push) Successful in 4m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m18s Details build-prerelease / Package cortex RPM (push) Successful in 7m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the Qwen3-Next architecture — plus the dispatch wiring needed to route a load through it on both the leader and the workers. Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for the full-attention layers + a new pattern for linear-attention): - TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size` V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`, `in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along the V-head dim. `out_proj` is row-parallel + AllReduce (the only collective inside the block). The recurrent state shards 1:1 with V-heads — no cross-rank sync inside the delta-rule loop. `in_proj_qkv` and `conv1d.weight` are FUSED tensors with three regions along dim 0 (`[first key_dim, second key_dim, value_dim]`). Standard uniform-slicing doesn't align with the head boundaries — rank 0 would end up with `[first half of K_0, full K_1, first half of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full tensor, narrow per-region per-rank, and `Tensor::cat` the three slices into a per-rank fused weight. Transient peak of one full tensor per layer during construction; net memory is properly per- rank after the full drops. - TpQwen3_5Attention: column-parallel `q_proj` (the widened `2 * num_heads * head_dim` output, including the gate half — shards along the head axis so both query AND gate halves stay consistent per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce. Otherwise mirrors `tp_qwen3.rs`'s attention. - TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types), TpQwen3_5Model (with `model.language_model.` prefix), and TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level). Dispatch wiring: - New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant. `WorkerPool::load_dense_shard` now dispatches on `model_type` from the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two downstream methods (`generate_step`, `clear_kv_cache`) thread this enum through — the inner forward+clear_kv_cache dispatch happens via the enum's pub methods. Adding another TP architecture later is one more enum variant + match arms. - Worker side gets a parallel `WorkerModel` enum + dispatch in `handle_load_dense_shard`, branching on the same `model_type`. - Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`. `TpLoadedModel.leader_model` retyped to the enum. Helpers in `arch/qwen3_5/linear_attn.rs`: - `softplus` and `repeat_interleave` made `pub(crate)` so the TP module reuses them rather than duplicating. Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated `Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial RoPE with `partial_rotary_factor` already correct). CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda` also clean inside the patched runner container. Single inflight risk to call out: tensor names. For full-attention layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.` and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*` — the same as the single-GPU path. lm_head sits at the top level (not under `language_model`) — consistent with the single-GPU path that validated against Qwen3.5-0.8B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:02:42 +03:00
rob thijssen	495d3f7c05	fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule All checks were successful build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 43s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build neuron-ampere (push) Successful in 4m46s Details build-prerelease / Build neuron-ada (push) Successful in 5m9s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s Details The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs: BF16`. Trace through the recurrent delta-rule loop: let q = (q.to_dtype(F32)? * scale)?; // F32 let k = k.to_dtype(F32)?; // F32 let v = v.to_dtype(F32)?; // F32 // g built from A_log/dt_bias // F32 // beta = sigmoid(b) // BF16 (sigmoid preserves dtype) ... let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?; ^^^^^^^^^^^^^ ^^^^^^^^^ F32 BF16 ← mismatch `g` was already F32 because it was constructed from `a_log.to_dtype(F32)` + `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from `sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16 and the multiplication tripped candle's dtype-mismatch check. Promote beta to F32 at the same point we promote q/k/v. Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on beast — load returned 200, then `POST /v1/chat/completions` returned the dtype error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 21:13:19 +03:00
rob thijssen	5c4c8e0eba	fix(qwen3_5): tensor names are under `model.language_model.`, not `model.` All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 4m50s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Build cortex binary (push) Successful in 4m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Qwen3-Next is a multimodal architecture whose text core sits under `model.language_model.` — sibling to `model.visual.` (vision tower) and to top-level `lm_head` / `mtp.`. Every text-side tensor in the safetensors files carries that prefix: model.language_model.embed_tokens.weight model.language_model.layers.{i}.{input,post_attention}_layernorm.weight model.language_model.layers.{i}.linear_attn.{in_proj_, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight} model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight model.language_model.norm.weight lm_head.weight (top-level; not under language_model) The single-pre-emptive fix is in Qwen3_5Model::load — derive a `text_vb = vb.pp("model.language_model")` once and walk embed_tokens / layers / norm from there. `lm_head` stays at the top-level VB; that path was already correct. The non-text tensors (`model.visual.`, `mtp.`) are ignored: we don't reference them, so the safetensors mmap is fine even though the bytes are loaded into the address space. After this, the load that was failing at "cannot find tensor model.embed_tokens.weight" should proceed to materialising the actual layer weights — where any further bugs will be substantive architecture issues rather than naming ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 16:48:16 +03:00
rob thijssen	07c44d5db1	fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m2s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m11s Details Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and the same applies to Qwen/Qwen3.6-27B): 1. Qwen3-Next config.json does NOT have a top-level `rope_theta`. It lives inside `rope_parameters: { rope_theta, partial_rotary_factor, rope_type, mrope_section, mrope_interleaved }`. Our TextConfig declared `rope_theta` as a non-optional top-level field, so the deserializer bailed with the misleading "missing field `rope_theta` at line 74 col 5". Replaced with a nested `RopeParameters` struct that mirrors the upstream shape. Defaults are conservative (rope_theta=10000, partial_rotary_factor=1.0) so a missing or partial block degrades to standard full-rotation RoPE rather than failing. 2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of the 256 head_dim values get RoPE applied — the rest pass through unchanged. Our RotaryEmbedding was building the inv_freq table for the full head_dim and rotating everything. Silently wrong for every full-attention layer. `RotaryEmbedding` now derives `rotary_dim` from `head_dim * partial_rotary_factor`, builds its cos/sin tables at that smaller size, and in `apply()` splits q/k into (rotate, pass) on the last dim, only `rope_slow`-rotates the rotate half, and re-concatenates. Mirrors the reference Python's `apply_rotary_pos_emb` exactly for the non-trivial `partial_rotary_factor` case. Tests updated: config-deserialise fixture uses the real `rope_parameters` shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The linear-attention forward-smoke test was already using full rotation which still works; just shifted to the nested struct. After this, the load that previously failed at "parse Qwen3-Next (qwen3_5) config.json: missing field rope_theta" should reach the actual safetensors materialisation step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 16:18:52 +03:00
rob thijssen	e7eb3dab6a	feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m50s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m41s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 4m58s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s Details Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's architecture). The four new modules wrap the substantive `linear_attn.rs` (landed previously) with the rest of the transformer: - `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is simplified to plain RoPE (the three position grids collapse to one for text-only inference); uses candle's `rope_slow` for the GLM-style rotate-half rotation. - `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False). - `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two Qwen3-Next quirks: - `q_proj` widened to `2 * num_heads * head_dim`; second half sigmoid'd and multiplied into the attention output before `o_proj`. - q_norm/k_norm use the `(1+w)*x` RmsNorm variant. - `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on `layer_types[i]` to either Full attention or GatedDeltaNet. `arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's `squeeze_to_vocab` handles either shape. Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now builds a `ShardedVarBuilder` instead of a plain VarBuilder. The sharded backend falls through to the unsharded path when `world_size=1`, so single-GPU load is zero-cost; this lets the forthcoming `tp_qwen3_5.rs` reuse the same load functions without a second copy. Verified: cargo build CPU + --features cuda inside the patched container; clippy clean on both; 32 lib tests still pass. The ForCausalLM forward no longer bails — but numerical correctness vs the Python reference hasn't been validated yet (that's the next step, with the Tbilisi probe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 15:52:33 +03:00
rob thijssen	180274548d	feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m48s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Implements the recurrent-path Gated DeltaNet block that occupies 48 of Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`). Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py` (`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`, `Qwen3_5RMSNormGated`, `l2norm`). Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules - `mod.rs` — Config + (still-stub) ForCausalLM - `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState - `rmsnorm.rs` — Qwen3_5RmsNorm `(1+w)x`, Qwen3_5RmsNormGated, l2norm Architecture pieces in this commit: - Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj (all bias=False); depthwise causal Conv1d (k=4) with state-aware prepend; SiLU; per-head reshape; L2norm on q,k. - Discretisation: g = -exp(A_log) softplus(a + dt_bias); beta = σ(b). All computed in f32 to avoid the -inf underflow in fp16 that the reference notes. - Delta rule (recurrent, per-token): state = exp(g_t) kv_mem = state^T · k_t delta = (v_t - kv_mem) beta_t state += outer(k_t, delta) out_t = state^T · q_t - Output: RMSNormGated(core_attn_out, z) reshape out_proj. State (`GatedDeltaNetState`) lives inline on the layer: - conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail. - recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the delta-rule outer-product memory. Cleared via `clear_kv_cache` at the start of every new request. Config extended with the qwen3_5-specific fields: - linear_num_value_heads (48 in Qwen3.6-27B) - linear_num_key_heads (16) - linear_key_head_dim (128) - linear_value_head_dim (128) - linear_conv_kernel_dim (4) - hidden_act ("silu") Performance note: this is the recurrent delta-rule (PyTorch's `torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L) prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`, chunk_size=64) is a follow-up perf optimisation; surface stays the same. 8 unit tests: - softplus small/large branches - l2norm hand-calc + zero-vector stability - repeat_interleave round-trip - forward_smoke on tiny dims (4-head fixture) — verifies shape + no NaN/Inf propagation through the f32-promotion pipeline. Doesn't validate numerical correctness against the Python reference; that requires a fixed-weight fixture and is the next step. cargo clippy CPU + --features cuda both clean; 32 lib tests pass. The ForCausalLM stub still bails on forward — wrapping attention/MLP/decoder layer + lm_head is the next sub-stage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:29:52 +03:00
rob thijssen	a70f317729	feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m14s Details CI / Test (push) Successful in 4m29s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m31s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Lays the wiring for the top-priority TP-2 target without doing the substantive architecture work yet. After this commit, attempting to load a Qwen3.6 (`model_type = "qwen3_5"`) model: - Passes config.json parse — the real upstream shape (text_config wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round- trips through a typed Config (unit test included). - Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a ModelArch::Qwen3_5Dense variant, registers it in the loaded set. - Fails on the first inference forward with a clear "Qwen3-Next forward not implemented yet (Stage 8c, TP-2 motivator)" — the point where the real architecture work begins. New layout: - `harness/arch/` for custom architectures candle-transformers doesn't ship. Each architecture is one module: Config + ForCausalLM + impl. - `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the open work: layer_types dispatch (full_attention vs linear_attention, the latter being the hard part with no candle precedent), attn_output_gate, text_config nesting, recurrent state lifecycle. - DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a branch that constructs the stub. TP-side gate: - New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the worker pool's `load_dense_shard` reconstructs the config as Qwen3 on every rank — silently misrouting a non-Qwen3 dense load through it would surface as a cryptic per-rank deserialise error. - TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else bails before the worker pool spawns and NCCL handshake costs are paid, with a marker pointing at the `tp_<family>.rs` module a contributor would need to add. qwen3_5 specifically lands here until its architecture is real. The naming choice: keep "qwen3_5" from the model's own config.json rather than mistralrs's "qwen3_next" — the latter ages poorly the moment Qwen ship another architecture revision. Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate); the previously-rejecting test for qwen3_5 swapped to a fictional arch so it stays meaningful as the supported set grows. 26 lib tests pass; cargo clippy CPU + --features cuda both clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:58:01 +03:00
rob thijssen	c6022aa6b9	feat(stage-8b): Llama + Qwen3 MoE families on the candle harness All checks were successful CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Successful in 2m6s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m54s Details CI / Test (push) Successful in 4m58s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 4m43s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Broadens the single-GPU dense and quantized paths to cover three non-Qwen3 architectures already shipped by candle-transformers. TP for these is a separate stage (each family would need its own tp_*.rs mirroring tp_qwen3.rs). `ModelArch` gains four variants: - LlamaDense (boxed — wraps Llama + an inline Cache + the config it takes to rebuild the cache, since candle::llama::Cache has no reset) - LlamaQuantized (candle_transformers::models::quantized_llama) - Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM) - Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE — takes an explicit compute dtype; F16 by default for best consumer-GPU throughput) The dispatch is method-based now: - `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>` with a shared `squeeze_to_vocab` normalising shape differences (qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families may differ again — the helper handles all of them). - `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs a Cache rebuild because its Cache has no in-place reset; the new `LlamaDense` wrapper holds the bits needed to do it. `run_inference` / `run_inference_streaming` collapse to a single dispatch path: no more per-variant match arms in the hot loop, and new architectures pick up streaming + non-streaming for free with zero changes outside `ModelArch`. DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"]. GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with no underscore matches llama.cpp's general.architecture convention). Stage 8a's diagnostic auto-reports the new supported set. The `LlamaDense` variant is boxed because the wrapper's inline Cache + Config makes it 544 bytes vs ~300 for everything else (clippy::large_enum_variant). Verified: cargo test --workspace passes 66 tests; cargo clippy CPU and `--features cuda` both clean (the cuda check ran inside the locally-built `neuron-build-local` container with the math_functions.h patch applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:36:22 +03:00
rob thijssen	9e31d8deca	feat(stage-8a): pre-flight architecture check for dense model loads Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Clippy (push) Successful in 2m21s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details A request to load Qwen/Qwen3.6-27B (model_type "qwen3_5") on the dense path was failing deep inside serde with: missing field `vocab_size` at line 140 column 1 …because Qwen3.6 wraps its actual hyperparameters under `text_config`, so none of `qwen3::Config`'s expected top-level fields are present. The error gave no hint that the architecture was the problem. `check_dense_config_supported` parses `config.json` as an untyped JSON Value, inspects `model_type` (with `architectures` as bonus context), and bails cleanly when it's not in the supported set (currently `["qwen3"]`). The error names the rejected type, the supported set, and points at the files a contributor needs to touch to extend coverage — both the single-process `ModelArch` variants in `candle.rs` and the TP analogue in `tp_qwen3.rs`. Wired into both load paths: - `load_arch_dense` (single-GPU), before the typed deserialize. - `load_tp`, before spawning the worker pool — TP loads of an unsupported arch now fail before NCCL/init costs are paid. 4 unit tests cover the accept/reject/missing-field/malformed cases. Bonus: makes Stage 8b/8c work easier — adding a new architecture is now a `DENSE_SUPPORTED_MODEL_TYPES` edit + ModelArch variant + load branch, with the diagnostic auto-correctly listing the supported set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:27:29 +03:00

1 2

79 Commits