cortex

Author	SHA1	Message	Date
rob thijssen	e267f583e1	chore(neuron): rustfmt drift in is_device_fault test Some checks failed CI / Format (push) Successful in 32s Details build-prerelease / Resolve version stamps (push) Successful in 58s Details CI / Clippy (push) Failing after 3m43s Details CI / Test (push) Successful in 5m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m48s Details build-prerelease / Build neuron-blackwell (push) Successful in 6m10s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 7m41s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details One assert! call grew past the line limit after the previous commits; cargo fmt --all picked it up. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:55 +03:00
rob thijssen	e23d5011d0	feat(helexa-acp): scaffold ACP bridge with provider trait + OpenAI chat Adds a new workspace crate `helexa-acp` (binary, Apache-2.0) — the start of "the missing ACP binary" for multi-endpoint LLM setups mixing public APIs, private LAN deployments, and various wire formats. Today it speaks OpenAI /v1/chat/completions; the Provider trait is the seam that lets OpenAI Responses, Anthropic /v1/messages, and other wire formats slot in later without touching the agent loop. The crate is intentionally self-contained — no dependencies on the other workspace crates (cortex-core, cortex-gateway, neuron) — so a future migration to a dedicated GitHub repo is a Cargo.toml-only change. All deps come from crates.io. This commit lands: * `config.rs` — TOML config at $XDG_CONFIG_HOME/helexa-acp/config.toml with multi-endpoint support (each `[[endpoints]]` declares its name, base_url, wire_api, default_model, optional API key / api_key_env). Falls back to env-only single-endpoint config when no TOML exists (HELEXA_ACP_BASE_URL, HELEXA_ACP_MODEL, etc.). The `endpoint:model` selector syntax is validated and tested. * `provider/mod.rs` — `Provider` trait + provider-agnostic types (`CompletionRequest`, `CompletionEvent`, `Message`, `ToolCall`, `ToolSpec`, `Role`, `UsageStats`). Agent loop consumes these without knowing the wire format on the other side. * `provider/openai_chat.rs` — `OpenAIChatProvider` impl. Compatible with cortex, LM Studio, Ollama (compat mode), OpenRouter, OpenAI itself. Streams via reqwest + eventsource-stream + async-stream. Surfaces text deltas, reasoning deltas (for models that emit `reasoning_content`), tool-call lifecycle (start, args-delta, completion), usage, finish reason. Cancellation-token aware. * `main.rs` — tokio + stderr-only tracing-subscriber + Stdio transport. Builds a provider per configured endpoint at startup, surfacing config mistakes before the editor even initializes. Currently responds to `initialize`; everything else stubs to `not implemented yet` until the agent loop lands in the next commit. 12 unit tests pass — encoder shape, decoder shape (text-only, tool-call progressive, cancellation, malformed-chunk recovery), config parsing (multi-endpoint TOML, env fallback, validation). The `#![allow(dead_code)]` on `provider/mod.rs` is temporary — the agent loop in the next commit reads every field. It's noted in the module-level docstring so the next reader knows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 08:13:47 +03:00
rob thijssen	249b2e5c98	fix(neuron): only poison the model on actual device faults Some checks failed build-prerelease / Resolve version stamps (push) Successful in 38s Details CI / Clippy (push) Successful in 2m22s Details CI / Test (push) Successful in 4m55s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 5m49s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 8m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m0s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CI / Format (push) Failing after 33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Previously every inference Err — shape mismatch, NaN logits, tokenizer error, missing handle — marked the model poisoned and rejected every subsequent request until an operator unload+reloaded. The benjy incident on 2026-05-27 showed how this misfires: a concurrency bug produced a `broadcast_add: shape mismatch` error that had nothing to do with CUDA, but the model was taken down anyway. Add `is_device_fault(err_chain: &str)` — a conservative classifier that returns false only for errors we know are pre-kernel / CPU-side (shape mismatches, NaN logits, tokenize/detokenize, missing handle, DecodeStream, empty prompt). Everything else defaults to true so a genuine driver fault still poisons. Applied at all six poisoning sites: - chat_completion CUDA worker path - chat_completion CPU spawn_blocking path - chat_completion_stream CUDA worker path - chat_completion_stream CPU spawn_blocking path - chat_completion_tp non-streaming wrapper - chat_completion_tp_stream spawned task Each site now logs either "model marked poisoned" (device fault) or "model NOT marked poisoned" (non-device) so the journal makes the classification visible. Tests cover the known non-device patterns and a couple of real CUDA driver messages. Pairs with the inference_lock commit (`c59da83`): together they eliminate both the cause of the spurious-poisoning we just observed (the shape mismatch) AND the over-reaction to it (the unconditional poison). Each fix is independently useful but the combination is what makes the system actually robust to concurrent agent workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:57:48 +03:00
rob thijssen	c59da83636	fix(neuron): serialise single-GPU inference per loaded model Two concurrent chat_completion requests against the same single-GPU model could interleave their `clear_kv_cache → forward(chunk0) → forward(chunk1) → ...` sequences. The device-worker channel serialises individual jobs but not the sequence boundary, so the cache could end up holding tokens from one request while another's mask was sized for its own prompt — producing a shape mismatch mid-prefill. Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories` and `memorize solutions` extensions fired 4ms apart against Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV cache, and request a08b4a's chunk 0 forward produced scores of shape [1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add failed, both requests bubbled the error up, both flipped the model to poisoned. Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring the TpLoadedModel.pool lock that the TP path already held. Acquire it at the start of `chat_completion` and inside the spawned task of `chat_completion_stream` (so the role chunk goes out immediately and only the inference work queues behind the lock). The CPU branch uses `blocking_lock` from inside spawn_blocking; the CUDA branch uses async `.lock().await` inside tokio::spawn. Throughput impact: zero. The GPU was already serialised at the device-worker channel — multiple requests just produced corrupt KV cache state instead of clean serial throughput. The lock makes the existing serialisation honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:54:04 +03:00
rob thijssen	f05882369d	fix(neuron): don't poison the model on tokio JoinError panics All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build cortex binary (push) Successful in 4m28s Details build-prerelease / Package cortex RPM (push) Successful in 1m28s Details build-prerelease / Build neuron-ampere (push) Successful in 8m25s Details build-prerelease / Build neuron-ada (push) Successful in 8m54s Details CI / Test (push) Successful in 4m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s Details CUDA driver failures propagate as Err through `?` and become `Ok(Err(InferenceError::Other(_)))` from the spawned task — those are real device faults and still poison the model. Tokio JoinError is different: it fires on Rust-level panic (tokenizer bug, sampler bug, serialisation, the UTF-8 slice that landed in commit `bd04d7f` before the fix) or task cancellation. Those don't touch the device context, so failing the one request without tearing down the model is correct. Two sites changed: - chat_completion's CPU spawn_blocking handler — JoinError no longer sets loaded.poisoned. - chat_completion_tp's tokio::spawn wrapper — JoinError no longer sets tp_for_marker.poisoned. The inner-Err case still does. Each path logs the cause (panicked / was cancelled / ended abnormally) explicitly so the journal makes the new behaviour obvious — search for "model NOT marked poisoned" to find these events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:02:52 +03:00
rob thijssen	bd04d7f580	fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji) across multiple tokens, the previous "decode the cumulative token list, byte-slice the delta against a stored prefix" pattern would panic with 'start byte index N is not a char boundary; it is inside <emoji>'. The race: at step N the tokenizer renders the partial bytes as U+FFFD (3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4 bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the codepoint in step N+1's `full` string, and `&str[start..]` panics. Replace with tokenizers' `DecodeStream::step(id)` which maintains an internal byte buffer across token boundaries and only emits when a clean codepoint completes. Applied at all three SSE emission sites: - stream_inference_via_worker (single-GPU CUDA stream) - chat_completion_tp_stream's spawned task (TP stream) - run_inference_streaming (CPU stream) The shared emit helper splits into emit_delta (async, mpsc::send) and emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its existing send semantics. The old emit_chunk helper that did the unsafe full-decode-and-slice is removed entirely. Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call response after a long agent-zero session; the spawned TP stream task panicked at candle.rs:2648. The model itself stayed healthy (no CUDA fault), only the one streaming request died. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 18:01:24 +03:00
rob thijssen	1e13889392	feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m9s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m7s Details build-prerelease / Build neuron-ampere (push) Successful in 12m16s Details build-prerelease / Build neuron-ada (push) Successful in 12m30s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload cycle observed on beast under agent-zero workloads. Three changes, all keyed off env-driven knobs so an operator can tune without a rebuild: 1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The initial forward is split into N-token windows, each with a monotonically growing offset. KV cache accumulates across chunks exactly as it would under one big prefill; only the final chunk's logits are kept for sampling. Activation memory now scales with chunk size instead of prompt length, so a 13 k-token prompt stops holding tens of GB of intermediate activations live at once. Wired into all six prefill call sites: - run_inference / run_inference_streaming (CPU path) - run_inference_via_worker / stream_inference_via_worker (CUDA single-GPU through device worker) - chat_completion_tp_inner / chat_completion_tp_stream (TP via WorkerPool) Three helpers — chunked_prefill_local, chunked_prefill_via_worker, chunked_prefill_tp — own the loop shape so the chunking semantics stay identical across paths. Per-chunk debug log shows progress. 2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384). Requests above the cap return a structured 400 with `code: prompt_too_long` rather than going through the prefill and discovering the limit by OOMing partway through. New InferenceError::PromptTooLong variant. 3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500). If `vram_free_mb` is below the threshold at request start (e.g. another concurrent request is mid-prefill), reject with a clean 503 + `code: insufficient_vram` rather than starting work that will OOM. New InferenceError::InsufficientVram variant. CPU loads (vram=0 sentinel) skip this check. All three gates fire BEFORE any device work, so a rejected request costs ~one tokenisation pass and never touches the worker thread — poison cascades from rejected work are now impossible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:46:54 +03:00
rob thijssen	6e1c1dd0fc	ci: retry clippy + test up to 3 times on spurious sccache failures All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 5m7s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-blackwell (push) Successful in 11m2s Details build-prerelease / Build neuron-ada (push) Successful in 12m23s Details build-prerelease / Build neuron-ampere (push) Successful in 12m26s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details sccache occasionally fails mid-compile with race-condition errors that clear on a re-run without any code changes. Rather than tracking that down right now, wrap the two affected steps in a bash loop that retries up to three times with a 5-second pause. Real failures still surface; they just take ~10s longer to fail. fmt is left as a single invocation — it's a one-shot syntactic check, not a build, and isn't subject to the same sccache races. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:55:18 +03:00
rob thijssen	35876954cd	chore(neuron): default tracing filter to info (was info,neuron=debug) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m43s Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 5m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Production deployments that want neuron-internal debug detail (e.g. trim_device_pool's per-clear-kv line, slab inserts/drops) override RUST_LOG explicitly via systemd. Defaulting to debug for the whole neuron target produced a lot of journal volume that wasn't useful in the common case. beast already sets RUST_LOG=debug in /etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:47:30 +03:00
rob thijssen	740299bd9d	chore(neuron/beast): switch default-model quant from q5k to q6k Some checks failed CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m22s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m35s Details CI / Test (push) Successful in 5m8s Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m16s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details q5k produced NaN logits on Qwen/Qwen3.6-27B under candle TP=2 (sampler fell over with "logits unhealthy nan: 248320/248320"). q6k is the quant that worked well in production under mistral.rs on the same hardware, so it's the right baseline for verifying the mempool-trim fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:18 +03:00
rob thijssen	cdf0f4e66d	fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync returns memory to the device's default mempool, not to the OS), so mem_get_info under-reports free VRAM between requests. With Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw ~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into CUBLAS_STATUS_INTERNAL_ERROR depending on quant. Add a cuda-gated trim_device_pool helper that, after each successful clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool, 0) against the device's default mempool. Failures (no async-alloc support, transient driver errors) are non-fatal and log at debug. The before/after free-VRAM delta is logged so an operator can correlate the trim with the next request's prefill VRAM. ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors correctly; the leak was strictly at the cudarc pool layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:36:13 +03:00
rob thijssen	c4954e0eed	docs: per-device worker thread architecture (phase 5 of refactor) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m18s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details CI / Test (push) Successful in 5m10s Details build-prerelease / Build cortex binary (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 5m16s Details build-prerelease / Build neuron-ada (push) Successful in 4m58s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 10m36s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Closes the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. CLAUDE.md: - New "Per-device worker thread (neuron)" section under Key design decisions, covering the three load-bearing properties (context locality, drop safety, poisoning blast radius), the CPU-fallback exception, and pointers to the canonical narrative in crates/neuron/src/harness/device_worker/mod.rs's module doc-comment. - New 2026-05-27 addendum dating the migration and naming the four PR commits (Phase 1: `081b532`, Phase 2: `b179204`, Phase 3: `76ab24d`, Phase 4: `b4f3576`). Same convention as the 2026-04-15 and 2026-05-18 addenda. README.md: - One paragraph in "Node setup" noting the per-device thread pattern with a pointer to CLAUDE.md and the device_worker module. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:15:43 +03:00
rob thijssen	b4f3576d82	refactor(neuron): phase 4 — model loads move onto the device worker All checks were successful build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 37s Details CI / Clippy (push) Successful in 2m25s Details CI / Test (push) Successful in 4m40s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Package cortex RPM (push) Successful in 1m20s Details build-prerelease / Build neuron-ampere (push) Successful in 5m7s Details build-prerelease / Build neuron-ada (push) Successful in 5m19s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:24:38 +03:00
rob thijssen	76ab24d98c	refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker Some checks failed CI / Format (push) Successful in 29s Details build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Test (push) Failing after 58s Details CI / Clippy (push) Successful in 2m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m1s Details build-prerelease / Package cortex RPM (push) Successful in 1m30s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl` / `Job::Tp` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:16:02 +03:00
rob thijssen	b179204fd3	refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker Some checks failed build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 34s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Resolve version stamps (push) Successful in 3m41s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build cortex binary (push) Successful in 12m20s Details build-prerelease / Build neuron-ada (push) Successful in 5m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details Second slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two spawn_blocking sites in `chat_completion` and `chat_completion_stream` now route through the device worker thread on CUDA loads. CPU loads keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's no context to own and the channel hop would only add latency. What this phase changes: - `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>` slab and a `next_handle` counter for minting opaque handles. - `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`, plus a new `arch_handle: Option<ArchHandle>` field. The two are mutually exclusive: CUDA loads set `arch_handle = Some(_)` after transferring the boxed arch into the worker's slab; CPU loads keep `arch = Some(_)` for the legacy spawn_blocking path. - New `run_inference_via_worker` and `stream_inference_via_worker` drive the prefill + decode loop by sending `Job::ForwardLogits` per step; the worker copies the resulting `[vocab]` logits to a CPU-side `Vec<f32>` before reply, so the async caller never holds a device-resident tensor. `apply_repeat_penalty` and `LogitsProcessor::sample` run on a CPU candle tensor; no context binding side-effects on tokio worker threads. - `logits_health_slice(&[f32])` complements the existing `logits_health(&Tensor)` so the new worker paths can compute health stats directly from the CPU vec. - `unload_model` for the single-GPU CUDA path now sends `Job::DropArch { handle }` to the worker so the `Box<ModelArch>` drops on the thread that allocated its CUDA tensors. The `Drop` runs with the bound context, freeing memory on the right context. What this phase doesn't touch (yet): - TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3. - Single-GPU model load — still spawn_blocking, followed by a `Job::TransferIn` to move the freshly-built `ModelArch` into the worker slab. Phase 4 moves the load itself onto the worker thread and eliminates the bootstrap TransferIn. - The `device_vram_mb` / `cuda_mem_mb` helpers — still present and used by the construction-time logs running inside spawn_blocking loads. Phase 4 cleanup folds them into `dispatch.rs`. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:55:08 +03:00
rob thijssen	081b532387	refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it Some checks failed CI / Format (push) Successful in 31s Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Clippy (push) Failing after 59s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m30s Details CI / Test (push) Successful in 4m47s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details build-prerelease / Package cortex RPM (push) Successful in 1m32s Details build-prerelease / Build neuron-ampere (push) Successful in 5m16s Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details First slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the infrastructure for a dedicated OS thread per CUDA device that owns the device's `CudaContext` for the daemon's lifetime, and routes the 8 async-context `device_vram_mb()` call sites in candle.rs through it. What this phase changes: - New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs). `DeviceWorkerHandle::spawn(idx)` creates a named OS thread (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel. Replies cross back via `tokio::sync::oneshot::Sender` so async callers await without parking a tokio worker. - Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward, ClearKv, NCCL init/sanity, and load variants. - `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at load time by a new `CandleHarness::ensure_device_worker(idx)` method that lazily spawns + caches one worker per device index. - Per-model `query_vram()` convenience method on both struct types so the 8 call sites in chat_completion / chat_completion_stream / chat_completion_tp_inner / chat_completion_tp_stream become `loaded.query_vram().await` (or `tp.query_vram().await`) — same field values logged, just sourced from the owner thread instead of the caller thread. What this phase doesn't touch (yet): - Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`. Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP forward + NCCL bring-up; Phase 4 moves the loads and deletes the now- unused `device_vram_mb` / `cuda_mem_mb` helpers. - Public API — unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical shapes. Tests: - 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query → shutdown round-trip, thread naming, post-shutdown submit returns `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across a Shutdown. CPU build (the only one CI runs) is enough to exercise channel mechanics. - All 37 lib tests + all integration tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:40:34 +03:00
rob thijssen	7c19da9361	feat(neuron): construction-complete vram/config dump + logits health + per-step vram All checks were successful CI / Format (push) Successful in 40s Details build-prerelease / Resolve version stamps (push) Successful in 45s Details CI / Clippy (push) Successful in 2m27s Details build-prerelease / Build cortex binary (push) Successful in 4m24s Details build-prerelease / Build neuron-blackwell (push) Successful in 4m0s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Successful in 5m10s Details build-prerelease / Build neuron-ada (push) Successful in 4m56s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details CI / Test (push) Successful in 4m24s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B incident from "guess at KV cache / quant sizes" into "read the journal": 1. Construction-complete summary in TpQwen3_5ForCausalLM::load and TpQwen3ForCausalLM::load. After the last "after layer N" log fires, each rank emits a single info line with: free_mb/total_mb (the number that drops by ~9 GB between per-layer and first-request on beast, with no inference traffic), every resolved config knob (vocab_size, hidden_size, num_layers, head_dim, num_kv_heads, max_position_embeddings), and a per-token KV-cache byte estimate. For Qwen3-Next also includes the linear/full-attention layer split so the hybrid architecture's cache cost is unambiguous. 2. Logits health snapshot on sample failure. Today the failure logs "A weight is negative, too large or not a valid number" with no context — was it a NaN cascade, an Inf, a negative weight? `logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts plus finite_min/max/mean on the failure path (zero cost on the success path) and emits a warn line just before the wrapper's terminal "failed, model marked poisoned" log. Wired into both the prefill and decode sample sites of the non-streaming AND streaming TP chat paths. 3. VRAM snapshot at prefill complete + every decode step. The "prefill complete" info line now carries vram_free_mb so the activations + KV growth from the prefill itself is visible. The per-step trace line gets vram_free_mb too, so an operator running with RUST_LOG=trace can watch headroom shrink token by token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:04:55 +03:00
rob thijssen	24e20dcb5c	feat(catalogue,gateway): model aliases (helexa/small, helexa/balanced, helexa/large) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m21s Details CI / Test (push) Successful in 4m40s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m38s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m20s Details build-prerelease / Build neuron-ada (push) Successful in 4m45s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 9m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Operators can now define tier aliases in models.toml: [aliases] "helexa/small" = "Qwen/Qwen3-1.7B" "helexa/balanced" = "Qwen/Qwen3-8B" "helexa/large" = "Qwen/Qwen3.6-27B" A client request for `model: "helexa/small"` is resolved to the concrete model id at routing time. The gateway also rewrites the proxied body's `model` field to the concrete id so neuron sees a name that matches its loaded handle (otherwise the harness rejects the request). Motivated by the finger-in-the-wind benchmark: same "what's the capital of Georgia" probe runs in 2.5s on the 1.7B vs 6.7s on the 27B with identical correctness. Aliases let clients pick a latency tier without hardcoding model ids, and let operators swap targets without changing client code. Changes: * cortex-core: `ModelCatalogue` gains `aliases: HashMap<String, String>` + `resolve_alias(&str) -> &str`. Unit tests cover the basic resolution + TOML round-trip. * cortex-gateway: * `RouteDecision` gains `resolved_model_id: String`. `router::resolve` consumes aliases at entry and threads the concrete id through. * Handlers (chat_completions, completions, anthropic_messages streaming + non-streaming) rewrite the body's `model` field with `rewrite_model_in_body` before proxying, using the resolved id for metrics labels, LRU touch, and the body itself. * `/v1/models` (Pass 4) emits each alias as its own entry mirroring the target's `loaded` flag, feasible_on, and locations — clients browsing the endpoint see both names and can pick either. * `models.toml` declares the three tier aliases; `models.example.toml` documents the section as opt-in. * Integration tests verify: end-to-end alias→concrete request flow, alias surfacing in /v1/models, and no-op fall-through for non-alias model ids. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 16:10:41 +03:00
rob thijssen	becf61b9c1	feat(script): validate-neuron.sh waits for /health activation=ready All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m48s Details CI / Test (push) Successful in 5m2s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 5m11s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m25s Details build-prerelease / Build neuron-ada (push) Successful in 4m58s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 6m50s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds wait_for_ready() that polls /health until activation.state flips to "ready" (or the NEURON_LOAD_TIMEOUT deadline). Inserted between probe_health and the is_loaded/trigger_load step. Before this, running validate-neuron.sh right after deploy.sh raced the background pre-warm and failed in ~9 ms with "neuron not reachable" (the pre-2026-05-26 build) or with a partial-load error (the new build, where the listener binds before default_models finishes). The poll prints the in_progress model on each tick so an operator watching the log can see which model is delaying readiness. Backs off from 2s to 10s after the first few iterations so a long TP load doesn't spam. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:26:21 +03:00
rob thijssen	b9e7a76a7a	feat(gateway): surface mid-prewarm models as Loading on /v1/models The poller now fetches /health alongside /models on each neuron and stashes the activation snapshot on NodeState. The /v1/models handler gains a Pass 3 that synthesises Loading locations from each neuron's activation.in_progress and activation.pending lists, so a catalogued model that's mid-prewarm surfaces as `status: "loading"` rather than appearing absent (loaded=false, locations=[]). Without this, a client polling /v1/models during a beast restart sees Qwen3.6-27B disappear for the ~5 minutes the q5k load takes, then reappear. Now it stays visible the whole time with a clear status. Adds ModelStatus::Loading to cortex-core. The router's per-node priority loop gets an explicit (no-op) arm: Loading models aren't routable yet, and falling through to the catalogue cold-load path is the existing race — no worse than before, but tagged as a known follow-up needing neuron-side in-flight tracking on /models/load. New test_poller_captures_activation_from_health exercises the full round-trip: mock neuron with empty /models but a pre_warming /health → poller writes node.activation. Common test helpers gain spawn_mock_neuron_with_models_and_health and default_health_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:26:12 +03:00
rob thijssen	800498f530	feat(neuron): bind listener before pre-warm, surface activation in /health Some checks failed build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m26s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Test (push) Successful in 4m44s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m29s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two coupled changes addressing the 2026-05-26 validate-neuron failure where a fresh deploy of beast had /health unreachable for ~5 minutes while Qwen3.6-27B q5k materialised, even though systemd reported the unit as active. 1. main.rs no longer awaits load_default_models before binding axum. The listener binds first; pre-warm runs in a spawned background task that holds a read lock on the harness registry for the duration of its sequential load loop. Concurrent on-demand /models/load and /v1/chat/completions traffic still flow. 2. /health gains an `activation` field carrying: state pre_warming \| ready pending model ids queued but not started in_progress model id currently loading (Option) completed model ids loaded successfully this activation failed [{model_id, error}] for failed entries The field is `#[serde(default)]` so a pre-change cortex polling a new neuron — or vice versa — keeps working. `ActivationTracker` (new module `neuron::activation`) owns the RwLock-wrapped state; load_default_models takes a tracker reference and updates it per-model. NeuronState holds an Arc clone for the /health handler. Tests updated to construct trackers and assert state transitions (empty noop, two failures → ready with both in `failed`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 15:18:04 +03:00
rob thijssen	d3f2d50749	feat(deploy): per-host neuron config + pre-warm headline models All checks were successful CI / Format (push) Successful in 39s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m17s Details CI / Test (push) Successful in 4m57s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 4m52s Details build-prerelease / Package cortex RPM (push) Successful in 1m22s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds asset/neuron/{beast,benjy,quadbrat}.toml — per-host neuron.toml files keyed by the first dot-component of the host. deploy.sh now rsyncs the matching file to /etc/neuron/neuron.toml on each neuron and stops+starts the service so default_models is re-read. Headline model per host (drives /v1/models output immediately after a clean deploy): beast Qwen/Qwen3.6-27B (q5k, tp=2, devices=[0,1]) benjy Qwen/Qwen3-8B (bf16, devices=[0]) quadbrat Qwen/Qwen3-1.7B (bf16, devices=[0]) Removes the need to follow deploy.sh with `validate-neuron.sh beast Qwen/Qwen3.6-27B q5k 2` to surface the 27B in the catalogue — the neuron loads it itself on activation. The neuron loop now mirrors the cortex flow (stop → install/upgrade → sync config → start) so config-only changes pick up on subsequent deploys; previously a no-package-change deploy would silently leave the host on the old default_models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 14:05:54 +03:00
rob thijssen	2740e61a23	fix(neuron,candle): name lifetime on acquire_pool_lock All checks were successful build-prerelease / Resolve version stamps (push) Successful in 46s Details CI / Format (push) Successful in 46s Details CI / Clippy (push) Successful in 2m15s Details CI / Test (push) Successful in 5m8s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-ampere (push) Successful in 5m25s Details build-prerelease / Build neuron-ada (push) Successful in 5m3s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 7m41s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Lifetime elision fails when a function has two reference parameters and returns a borrow: rustc can't infer whether the MutexGuard's lifetime ties to `pool` or `model_id`. The non-CUDA build skipped this code path (cfg-gated), so the error only surfaced on the GPU build at https://git.lair.cafe/helexa/cortex/actions/runs/162. The guard borrows the pool, so name the lifetime on `pool` and the return type. `model_id` keeps its independent (elided) lifetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:37:32 +03:00
rob thijssen	67f79c868f	fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain Some checks failed build-prerelease / Resolve version stamps (push) Successful in 42s Details CI / Format (push) Successful in 43s Details CI / Clippy (push) Successful in 2m46s Details build-prerelease / Build neuron-blackwell (push) Failing after 3m32s Details CI / Test (push) Successful in 4m25s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two failure modes from the 2026-05-26 beast incident: 1. `unload_all_models` looped through models calling `unload_model`, logging individual failures at warn. The cumulative effect was a single warn line for the failed unload then "shutdown complete" — no signal that the model was actually still loaded. Now each unload is bounded by a 20s timeout, failures escalate to error, and a summary "leaving N model(s) loaded" line fires when anything is stuck so the operator knows the OS will reclaim VRAM after exit. 2. Returning `Ok(())` from `main` after the unload sweep dropped the tokio runtime, which then waited indefinitely on a CUDA-stuck spawn_blocking thread (the journal's "Stack trace of thread 2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec fired 2 minutes later, SIGABRT, core dump. Replacing the return with `std::process::exit(0)` skips the runtime drain and hands the OS a clean exit code; stuck threads get reaped with the process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:30:06 +03:00
rob thijssen	fc6ef0ee0f	feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups Once a CUDA driver error has hit a forward or kv-cache call, the device's context is unrecoverable in-process — subsequent kernels can hang (the failure mode seen on beast on 2026-05-26), return garbage, or trip another illegal-address. The harness now marks the model poisoned on any forward / spawn_blocking / TP-task failure, refuses further inference against it with a clear "unload and reload" error, and surfaces `status: "poisoned"` on `/models` so an operator running `curl beast:13131/models` (or cortex polling) can see the bad state. Without this, a single OOM on a too-large prefill quietly turned every subsequent request into a stuck wait on the pool lock; with it, the first request fails fast with the driver error in the journal and the client gets a usable 5xx instead of a hung connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:28:42 +03:00
rob thijssen	1385979e3d	feat(neuron,candle): log per-device VRAM at chat_completion start Every "starting" log line now carries vram_free_mb / vram_total_mb for the request's serving device (the leader device on TP). On the 2026-05-26 incident this would have made the 14k-token prefill OOM diagnosable from the first log line: with ~412 MB free, that prompt was never going to fit, and the operator could have caught the imbalance before the CUDA context got poisoned. `device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is kept separate to avoid coupling the inference path to the TP module. TpLoadedModel gains a `leader_device: Device` clone so the request path reads the device without locking the leader model (which would contend with an in-flight forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:26:23 +03:00
rob thijssen	0a1cfcd4d0	feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings Every chat completion path (single-GPU + TP, streaming + non-streaming) now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber prefixes every event with that span so `grep req_id=…` over journalctl reconstructs one request even when dozens overlap. Every path also emits a terminal log line on both success ("done", with prompt_tokens/completion_tokens/finish_reason/total_ms) and failure ("failed", with full anyhow chain + total_ms). Failures used to vanish silently — a request that hit a CUDA OOM left "starting" in the journal and no further trace. New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await` in both TP paths. It warns at 2s ("still waiting on pool lock") and re-warns every 2s thereafter, so queued requests stuck behind a deadlocked holder are visible immediately instead of looking like idle silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:25:11 +03:00
rob thijssen	ea0e0f7911	fix(neuron,tp): log leader forward errors with full context Worker rank failures were already surfaced at WARN, but the leader's own forward Result::Err was silently coerced to a `leader_ok=false` bool. When the leader and a worker both fail together — the typical shape of a CUDA OOM cascading into an illegal-address — the journal showed only the worker side and an operator had to guess what hit rank 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 12:22:30 +03:00
rob thijssen	aa88d37509	fix(gateway): full observability + stop leaking upstream bodies All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 42s Details CI / Clippy (push) Successful in 2m27s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details CI / Test (push) Successful in 4m42s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 4m53s Details build-prerelease / Build neuron-ada (push) Successful in 5m7s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m3s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s Details Comprehensive sweep across cortex-gateway's request handling. Every failure path now emits exactly one structured warn (or error) event on the cortex side with the wire-level detail an operator needs; the API response carries only a generic message plus, where useful, the upstream status code. proxy.rs::forward_request: - warn on network failure (network error, target URL). - warn on upstream non-2xx (status, target URL). Streaming body still passes through to the client; we just can't snippet without breaking the stream. - warn on response-build failure. - ProxyError::into_response no longer interpolates the inner error into the API body — generic "upstream request failed" / "failed to build response" instead. handlers.rs::chat_completions, handlers.rs::completions: - warn on missing model field, with handler= label. - warn on route resolve failure with model + error chain. The user-facing 404 keeps the RouteError Display string (which is short, informative, and contains no internal detail beyond the model id and config'd node names). handlers.rs::anthropic_messages: - warn on invalid Anthropic body, on translated-OpenAI serialise failure (which is internal), on route resolve, on upstream network error, on upstream non-2xx (with 512-char body snippet for parse errors), on upstream body read, on response parse. - All warns share consistent field shape: handler, model, node, url, status / error / body as applicable. - API response messages are now uniformly generic. - Adds an info-level "proxying request" log on the non-streaming path so successful proxies are also visible. handlers.rs::proxy_with_metrics: - still calls e.into_response() but proxy::forward_request already warn'd at the wire layer, so no double-log here. Tests: - All 32 existing unit tests + 22 gateway integration tests + 4 new router tests pass. - Tests that asserted on the "no healthy nodes" / "not found" strings still match because RouteError messages are preserved in the 404 user-facing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:17:26 +03:00
rob thijssen	0f00f72b47	fix(router,handlers): strip trailing slash from rewritten URL + log upstream failures Some checks failed build-prerelease / Resolve version stamps (push) Successful in 32s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m41s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Two coupled bugs surfaced after `9b0ed0b`: 1. url::Url::parse("http://host:port").to_string() normalises the empty path to "/", so rewrite_loopback_host was returning "http://beast:13131/". Downstream callers then did format!("{endpoint}/v1/chat/completions") and produced a double-slash path that neuron's axum router 404'd with an empty body. Strip the trailing slash in the rewriter so the endpoint is a clean base string for concatenation. 2. The anthropic_messages handler returned the upstream's empty body to the API caller as `"upstream error: "` with no journal log on the cortex side. Operators had no way to see what happened. Add warn-level tracing on both upstream failure paths (network error and non-2xx) with model, node, target URL, status, and a 512-char body snippet. The API response now carries just `"upstream returned <status>"` — the implementation detail lives in the log. Updates the two existing rewrite tests for the no-trailing-slash output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 07:10:39 +03:00
rob thijssen	9b0ed0b57f	fix(router): rewrite loopback inference URLs to use neuron's host Some checks failed CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 41s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m34s Details CI / Clippy (push) Successful in 7m25s Details build-prerelease / Build neuron-ampere (push) Successful in 4m57s Details build-prerelease / Build cortex binary (push) Successful in 4m15s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m46s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m6s Details CI / Test (push) Failing after 4m34s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details Neuron hardcodes its bind_url as `http://localhost:13131` (it can't reliably know its own externally-resolvable name). When cortex runs on a different host than the neuron it's routing to, blindly proxying to that URL hits localhost on the cortex box instead of the neuron. Cortex already knows each neuron's reachable host from cortex.toml. After fetching the inference URL from `/models/{id}/endpoint`, if the host is a loopback name (localhost / 127.0.0.1 / 0.0.0.0 / ::1), swap it for the configured neuron host. Preserve the port and path from neuron's URL so a future harness serving inference on a different port than the management API still works. Adds `url` (already a transitive dep via reqwest) as a direct dep for the URL parsing. Tests cover: localhost rewrite, distinct inference port preservation, non-loopback passthrough, malformed input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:23:47 +03:00
rob thijssen	dc2a803266	fix(rpm): migrate legacy helexa-cortex firewalld service to `cortex` Some checks failed build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 1m1s Details CI / Clippy (push) Successful in 3m12s Details CI / Test (push) Successful in 4m31s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m52s Details build-prerelease / Package cortex RPM (push) Successful in 1m18s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Adds a %posttrans scriptlet to cortex.spec that: - Removes the stale /etc/firewalld/services/helexa-cortex.xml left behind by an older packaging stream that named the service `helexa-cortex` and (in some build streams) carried wrong port numbers (9301/9302/9304). - Walks every active firewalld zone; for any zone where the legacy helexa-cortex service was enabled, swaps it out for the new `cortex` service (which the RPM ships at /usr/lib/firewalld/services/cortex.xml with the right 31313/31314 ports). - Reloads firewalld so the change takes effect without operator intervention. Operators on whom this happened were silently dropping inbound connections to cortex on 31313 — the active zone advertised a helexa-cortex service that listed unrelated ports, masking the correctly-defined vendor cortex service. helexa-neuron is unaffected: that spec already ships the vendor service as helexa-neuron.xml (namespaced from day one) and no stale /etc override files exist in the fleet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 06:12:51 +03:00
rob thijssen	e71181499e	feat(stage-8e-3): quantize lm_head in TP Qwen3-Next All checks were successful build-prerelease / Resolve version stamps (push) Successful in 42s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m43s Details build-prerelease / Build cortex binary (push) Successful in 4m25s Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m23s Details build-prerelease / Build neuron-ada (push) Successful in 4m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m42s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details CI / Format (push) Successful in 30s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the load spec has quant set and tie_word_embeddings is false, lm_head's (vocab_size, hidden_size) weight is quantized in-situ at load time along with all the per-layer linears. The non-tied case on Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2 bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of decode latency from the per-token logits matmul. Tied case (tie_word_embeddings=true) keeps the lm_head plain even when quant is set — quantizing the shared tensor would corrupt the embedding lookup, and the tied case already gets the memory win from only holding one copy. This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path. The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's the bottleneck for a model that actually needs TP at consumer scale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:53:14 +03:00
rob thijssen	ee663e5e99	fix(stage-8e-2e): bump quant prefill threshold to M > 64 Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details CI / Format (push) Successful in 34s Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Clippy (push) Successful in 2m20s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details The M > 8 threshold from 8e-2d activated forward_via_f16 on the test case (M=30) and slightly regressed prefill (143 -> 133 T/s). The dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms) eats the cuBLAS GEMM speedup at small M. Move the crossover to M > 64 so short prefills (typical for the validate probe) stay on the GGUF GEMV kernel where per-call cost is comparable but the dequant tax is zero. Long prefills still get the dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the fixed dequant cost. Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that needs either a dequant cache (gives back the ISQ memory win) or a fused dequant+gemm kernel. Both larger projects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:50:45 +03:00
rob thijssen	34f9b77d9d	feat(stage-8e-2d): route quantized matmul by M (prefill vs decode) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 41s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m40s Details build-prerelease / Build cortex binary (push) Successful in 4m20s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m58s Details build-prerelease / Build neuron-ampere (push) Successful in 5m14s Details build-prerelease / Package cortex RPM (push) Successful in 9m25s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details MaybeQuantLinear::forward picks between two QMatMul paths: - M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost is fixed per call, so it's amortised across the M tokens. - M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel on the quantized blocks directly. Requires f32 inputs so we still cast in/out at the boundary in that arm. Earlier 8e-2c sent everything through the GGUF GEMV kernel, which is excellent at GEMV (decode) but doesn't have a real batched GEMM path — prefill regressed ~4x. This restores prefill to roughly the bf16 cuBLAS GEMM throughput while keeping the decode gain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 21:15:32 +03:00
rob thijssen	f084aaab8e	fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul All checks were successful CI / Format (push) Successful in 33s Details build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Clippy (push) Successful in 2m18s Details CI / Test (push) Successful in 4m26s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m41s Details build-prerelease / Build cortex binary (push) Successful in 4m22s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 5m12s Details build-prerelease / Build neuron-ada (push) Successful in 4m41s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s Details candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly GGUF dequantize accumulates in f32. The model dtype flowing into MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with "unexpected dtype, expected: F32, got: BF16". Wrap the Quant arm to cast the activation to f32 before the matmul and cast the result back to the input dtype. The cast is a single launch on the activation tensor (small relative to weight traffic); it's the price of in-situ GGUF-style quantization, and what mistralrs does inside its own Linear wrapper. The Plain arm is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 20:05:19 +03:00
rob thijssen	68a606a79c	fix(stage-8e-2b): allow quant on the TP load path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m29s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m50s Details build-prerelease / Build cortex binary (push) Successful in 8m37s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Build neuron-ada (push) Successful in 4m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details The pre-existing guard in candle.rs rejected any spec.quant on the TP path with "GGUF quantized models are not supported in the TP path" — written when quant only ever meant GGUF. With 8e-1/8e-2 in, quant != None on the TP path triggers in-situ quantization of the loaded safetensors shards. resolve_dense_files only looks for safetensors so a GGUF-source-file model with TP still errors out cleanly downstream. validate-neuron.sh: rebuild the load payload incrementally so tp_size > 1 + non-empty quant produces both fields. Same script now covers all four combos (single/TP × dense/ISQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 19:17:14 +03:00
rob thijssen	4aa71902d0	feat(stage-8e-2): plumb quant config from ModelSpec to TP load path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m7s Details CI / Test (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m47s Details build-prerelease / Build neuron-ampere (push) Successful in 5m17s Details build-prerelease / Build neuron-ada (push) Successful in 5m14s Details build-prerelease / Build cortex binary (push) Successful in 18m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s Details - LoadDenseShard RPC gains an optional `quant` string field. - WorkerPool::load_dense_shard takes a `quant: Option<String>`, passes it via the RPC to workers and via parse_quant_string to the leader's local load. - The Qwen3-Next TP load chain (ForCausalLM → Model → DecoderLayer → Attention / GatedDeltaNet / MLP) takes `quant: Option<GgmlDType>` end-to-end, calling Column/RowParallelLinear::load_with_quant. - The fused in_proj_qkv inside TpQwen3_5GatedDeltaNet is now a MaybeQuantLinear so it also picks up quantization. - parse_quant_string accepts q4_0/q4_1/q5_0/q5_1/q8_0/q8_1, q2k..q8k (with or without underscore), and f16/bf16/f32. Empty / None means no quantization. Callers from candle.rs forward spec.quant through pool.load_dense_shard. This means a `quant = "q5k"` in models.toml now flows end-to-end to a QTensor-backed QMatMul for every per-rank linear in the Qwen3-Next TP path. Leaves lm_head and the small replicated bias/log tensors in their loaded dtype (Stage 8e-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 18:03:36 +03:00
rob thijssen	bef159b21c	feat(stage-8e-1): MaybeQuantLinear primitive + parallel-linear quant variants Some checks failed build-prerelease / Resolve version stamps (push) Successful in 37s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m31s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details CI / Format (push) Waiting to run Details CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Introduces MaybeQuantLinear, which wraps either a plain candle Linear or a candle QMatMul backed by a freshly-quantized QTensor. Forward dispatches identically through the Module trait so downstream code doesn't care which arm is active. ColumnParallelLinear and RowParallelLinear gain `load_with_quant` methods. The existing `load` methods stay as backward-compatible no-quantization wrappers — no churn at the 27 existing call sites. This is the foundation for in-situ quantization at load time. Wiring the user-facing quant config and switching call sites to load_with_quant follow in stages 8e-2 / 8e-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 17:55:26 +03:00
rob thijssen	8d7b099b36	feat(stage-8d-7): direct safetensors fused-region loader Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details CI / Format (push) Successful in 35s Details build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Clippy (push) Successful in 2m18s Details CI / Test (push) Successful in 4m28s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details Replaces load_fused_qkv_slice_2d/_3d with reads from a separate MmapedSafetensors handle. Each per-rank fused tensor is built by reading the three region byte-slices directly from the mmap, concatenating them host-side, and uploading as one device allocation — no full-fused-tensor device materialisation. The prior approach allocated a ~100 MB transient device tensor per linear-attention layer; on Qwen3.6-27B with 48 linear-attn layers that's ~4.8 GB of allocator churn during load — enough to fragment the cuda caching allocator on a tight-VRAM 32 GB consumer GPU, which is what triggered the layer-22 up_proj OOM seen on beast. Threading: MmapedSafetensors flows worker → ForCausalLM → Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs) and worker (worker.rs) construct their own mmap; Linux's page cache shares the underlying pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 17:49:35 +03:00
rob thijssen	89d98d1fb2	diag(stage-8d-6): per-layer VRAM logging in TP load path All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 33s Details CI / Clippy (push) Successful in 2m14s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m59s Details CI / Test (push) Successful in 4m58s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m26s Details build-prerelease / Build neuron-ampere (push) Successful in 4m52s Details build-prerelease / Build neuron-ada (push) Successful in 5m11s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s Details Wraps each TpQwen3_5DecoderLayer::load in a with_context that captures free/total VRAM on failure, plus an info-level log after every layer that succeeds. Uses cudarc::driver::result::mem_get_info — same API mistralrs uses. Diagnostic only: forward path is unchanged. Helps distinguish true VRAM exhaustion from allocator fragmentation when loading large models at BF16 on 2x consumer GPUs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:54:05 +03:00
rob thijssen	cc95fe28d9	feat(stage-8d-5b): wire fused_gdn_gating CUDA kernel All checks were successful build-prerelease / Resolve version stamps (push) Successful in 1m45s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m40s Details build-prerelease / Build cortex binary (push) Successful in 4m27s Details build-prerelease / Package cortex RPM (push) Successful in 1m24s Details build-prerelease / Build neuron-ampere (push) Successful in 5m30s Details build-prerelease / Build neuron-ada (push) Successful in 5m24s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s Details CI / Format (push) Successful in 35s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details run_fused_gating helper consolidates the per-layer gating math: beta = sigmoid(b) g = -exp(a_log) * softplus(a + dt_bias) CUDA path issues a single launch via fused_gdn_gating_cuda; cpu path falls back to the original per-op Rust sequence. Replaces ~10 candle launches per linear-attention layer (sigmoid + 2× to_dtype + exp + neg + broadcast_add + softplus + 2× unsqueeze + broadcast_mul) across both single-GPU and TP forward paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:52:38 +03:00
rob thijssen	09c945f81e	feat(stage-8d-4): dispatch chunked_gated_delta_rule_recurrence at prefill Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 31s Details CI / Format (push) Successful in 44s Details CI / Clippy (push) Failing after 52s Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details run_delta_rule_cuda now picks between the per-token kernel and the BT=64 chunked variant based on seq_len. Threshold = 64 matches mistralrs. Prefill on Qwen3.6-27B (typical seq_len in the hundreds) drops from one block-launch per token to one per 64-token chunk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:50:30 +03:00
rob thijssen	05dc0bad18	feat(stage-8d-3): wire causal_conv1d_update/full CUDA kernels Some checks failed CI / Clippy (push) Waiting to run Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 38s Details build-prerelease / Build cortex binary (push) Has started running Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Replaces the per-layer conv1d + silu sequence in both single-GPU and TP linear-attention forward paths with a shared run_causal_conv1d helper that dispatches to: - causal_conv1d_update for decode (seq_len=1 with existing conv_state) - causal_conv1d_full for prefill / fresh start (zero-pads internally) Both kernels fuse the depthwise conv + SiLU into a single launch — 4× fewer cuda launches per linear-attention layer vs the candle conv1d + candle_nn::ops::silu combo. Falls back to the original Rust path on cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:49:41 +03:00
rob thijssen	10c151efa5	feat(stage-8d-5): wire gated_delta_rule_recurrence kernel into tp_qwen3_5 Some checks failed build-prerelease / Package cortex RPM (push) Blocked by required conditions Details build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m36s Details CI / Test (push) Successful in 4m39s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m34s Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details TP per-token Rust loop replaced with shared run_delta_rule dispatch from arch/qwen3_5/linear_attn.rs. Both single-GPU and TP variants now use the cuda kernel when available, per-token Rust fallback otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:44:12 +03:00
rob thijssen	44ae927e38	feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_5 Some checks failed build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Format (push) Successful in 38s Details CI / Test (push) Failing after 45s Details CI / Clippy (push) Successful in 2m16s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Has been cancelled Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details build-prerelease / Build neuron-blackwell (push) Has been cancelled Details Replaces the per-token Rust delta-rule loop in `arch/qwen3_5/linear_attn.rs::GatedDeltaNet::forward` with a single dispatch to the `gated_delta_rule_recurrence` kernel imported from mistralrs in `1ebbe87`. The kernel is V-tiled with compile-time BK (one block per (V-tile, batch*head), one thread per V-column, BK state floats in registers). For Qwen3.6's per-rank `(B=1, H=24, D_k=128, D_v=128)` shape this collapses ~6 candle tensor-op launches per token per layer (each ~50µs CUDA dispatch overhead, so ~300µs/token/layer × 48 linear- attention layers = 14ms in launch overhead alone) to a single kernel launch with full ILP / register residency. New free function `run_delta_rule`: - cuda branch (when q is on a CUDA device): flattens `(B, H, ...)` → `(BH, ...)`, dispatches the kernel via `crate::cuda::gdn::gated_delta_rule_recurrence_cuda`, reshapes outputs back to `(B, H, L, D_v)` and state to `(B, H, D_k, D_v)`. - cpu fallback: the original per-token Rust loop, unchanged. Keeps cargo test --workspace passing on hosts without cuda. Dispatch decision lives in the wrapper (`q.device().is_cuda()`). Build: `cargo build -p neuron --features cuda` compiles + links; clippy clean on both CPU and cuda paths. 32 lib tests still pass (none of them exercise this code path on cuda; smoke test for the TP variant is the deployed Tbilisi probe). Stage 8d-3 wires the conv1d kernels; 8d-4 the chunked prefill; 8d-5 the same wiring for `tp/tp_qwen3_5.rs`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:39:30 +03:00
rob thijssen	1ebbe87651	feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only Some checks failed build-prerelease / Build cortex binary (push) Blocked by required conditions Details CI / Test (push) Waiting to run Details build-prerelease / Resolve version stamps (push) Successful in 29s Details CI / Format (push) Successful in 40s Details CI / Clippy (push) Successful in 2m23s Details build-prerelease / Build neuron-blackwell (push) Has started running Details build-prerelease / Build neuron-ampere (push) Has been cancelled Details build-prerelease / Build neuron-ada (push) Has been cancelled Details build-prerelease / Package cortex RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled Details CI / Build cortex SRPM (push) Has been cancelled Details CI / Build neuron SRPM (push) Has been cancelled Details CI / Publish cortex to COPR (push) Has been cancelled Details CI / Publish neuron to COPR (push) Has been cancelled Details CI / Bump version in source (push) Has been cancelled Details Stage 8d (new): port the Gated DeltaNet CUDA kernels from EricLBuehler/mistral.rs to close the ~500x decode performance gap we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path vs ~37 T/s in mistralrs on the same hardware). This commit lays the build infrastructure with zero behavioural change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the qwen3_5 architecture and TP variant. Added: - `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile every `src/cuda/.cu` file into `libneuroncuda.a` under the `cuda` feature, then links it + `cudart`. Mirrors mistralrs's `mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same sm_<80 bf16 gate). - `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from upstream: `gated_delta_rule_recurrence` (V-tiled per-token decode) * `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill) * `causal_conv1d_update` (single-token conv decode) * `causal_conv1d_full` (multi-token conv prefill) * `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) * softplus(a + dt_bias)) - `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels, cudarc::CudaSlice::device_ptr boilerplate identical to upstream. - `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of upstream's ffi.rs covering only the five GDN kernels; MoE / SSM / top-k decls land here when we absorb those too). - `crates/neuron/src/cuda/mod.rs` — re-exports + module docs. Cargo wiring: `cudaforge` added as an optional build-dep, activated by the `cuda` feature. CPU build is unchanged (the `cuda/` module is fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the patched container compiles `gdn.cu` (1 of 1 kernels) and links clean. Licensing: upstream files preserve their MIT origin via per-file comment banners pointing to the mistralrs path. No behaviour-relevant edits to the .cu kernels — local diff against upstream is just the banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream; their structure (module path `crate::cuda::ffi::*`) matches identically so future kernel imports drop in unchanged. CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean inside the runner container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 11:34:11 +03:00
rob thijssen	70eb6af42b	feat(tp): cancellation-safe inference + structured tracing All checks were successful CI / Format (push) Successful in 30s Details build-prerelease / Resolve version stamps (push) Successful in 35s Details CI / Clippy (push) Successful in 2m14s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m44s Details build-prerelease / Build cortex binary (push) Successful in 4m13s Details CI / Test (push) Successful in 4m38s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m47s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m41s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Two changes addressing operator visibility into TP inference + the HTTP-cancellation poisoning chain: 1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When the HTTP client disconnects (curl --max-time, browser nav, etc.) the future returned from `chat_completion_tp` gets dropped, but the spawned task keeps running to completion — finishing every `pool.generate_step` / `pool.clear_kv_cache` to drain the worker pipes. The next inference request then finds a clean pool. Previously: dropped future left workers still processing the in-flight request, the next call's `ClearKvCache` recv would read the stale `GenerateStepOk` from the abandoned step ("rank N expected KvCacheCleared, got GenerateStepOk"). The drain-on- leader-error fix from `d1a4aad` covered Rust-side leader failures but not HTTP-layer cancellation, which is what we actually hit on the user's Qwen3.6 test. 2. Tracing throughout the TP path so journalctl shows where an inference spends its time without needing to surface harness internals via the HTTP error body: - `chat_completion_tp_inner` (now a free fn so it can run inside spawn): `info` at request start (prompt_len, max_new, temp, top_p, eos_id), `info` per major phase (prefill complete with elapsed_ms, decode complete with elapsed_ms + token count), `info` at completion (total_ms, finish_reason). `debug` for pool-lock acquisition + kv-cache clear timing. `trace` per decode step (next_token, step_ms). - `WorkerPool::generate_step` (leader side): `debug` at fan-out, `debug` after leader forward returns with elapsed_ms + ok flag, `debug` after drain with errors count + total_ms. - `WorkerPool::clear_kv_cache`: matching `debug` at fan-out + drain. - `worker::handle_generate_step`: `debug` at forward start + done with elapsed_ms, `warn` on forward failure with the full error. The default log filter is already `info,neuron=debug` so the operator gets every `info` and `debug` line by default; `trace` needs RUST_LOG=trace for per-step decode timing. Stage 7c-ii crash-detection is still future work; this is the minimum that makes the "where did the 120s go" question answerable from the logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 08:22:00 +03:00
rob thijssen	d1a4aad91d	fix(tp): always drain worker responses on leader failure All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 1m6s Details CI / Clippy (push) Successful in 2m56s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m40s Details CI / Test (push) Successful in 5m1s Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m19s Details build-prerelease / Build neuron-ampere (push) Successful in 4m29s Details build-prerelease / Build neuron-ada (push) Successful in 4m51s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details The TP-2 inference probe against Qwen3.6-27B surfaced: worker rank 1 ClearKvCache: expected KvCacheCleared, got GenerateStepOk Caused by pipe poisoning. The previous shape of `generate_step`: for w in workers { w.send_only(GenerateStep) } // 1. fan-out let logits = spawn_blocking(leader.forward)??; // 2. early return on err for w in workers { w.recv_only() } // 3. drain (skipped on 2's err) When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen before, an OOM, a downstream squeeze that didn't match the shape), the function bailed before step 3 — but workers had already written `GenerateStepOk` to their stdout pipes, since their forwards (and the NCCL collectives inside) completed independently of the leader's post-collective Rust-side work. The next call (typically `ClearKvCache` at the start of the next inference request) would then send a fresh request and read those stale replies as if they were the new operation's. Once a pipe is poisoned, every subsequent call surfaces the same shape of error even though nothing's actually broken. Fix: introduce two helpers in `tp/mod.rs`: - `drain_workers(workers, check)` — reads exactly one response from every worker regardless of individual outcomes. Returns `Vec<String>` of `rank N: detail` strings for any non-OK reply. - `combine_leader_workers(leader, worker_errs, op)` — folds the leader's `Result<Result<T>>` (the spawn_blocking shape) with the worker drain into a single `Result<T>`. Leader failure takes precedence but worker errors get appended so both halves surface. `generate_step` and `clear_kv_cache` now use this pattern. Worst case: both halves fail and the operator sees a combined error message; either way the pipes are always drained so the next call's recv matches the request it sent. Note: the model is still poisoned in the current state — the operator needs to either `POST /models/unload` + reload, or `systemctl restart neuron`, to recover. The fix prevents future desync; it doesn't repair existing stale pipe state. Stage 7c-ii crash detection was tracked as the canonical solution to this class of issue; this is the minimum-viable subset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 07:39:36 +03:00
rob thijssen	95dc8745eb	feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m37s Details CI / Test (push) Successful in 4m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m18s Details build-prerelease / Package cortex RPM (push) Successful in 7m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the Qwen3-Next architecture — plus the dispatch wiring needed to route a load through it on both the leader and the workers. Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for the full-attention layers + a new pattern for linear-attention): - TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size` V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`, `in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along the V-head dim. `out_proj` is row-parallel + AllReduce (the only collective inside the block). The recurrent state shards 1:1 with V-heads — no cross-rank sync inside the delta-rule loop. `in_proj_qkv` and `conv1d.weight` are FUSED tensors with three regions along dim 0 (`[first key_dim, second key_dim, value_dim]`). Standard uniform-slicing doesn't align with the head boundaries — rank 0 would end up with `[first half of K_0, full K_1, first half of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full tensor, narrow per-region per-rank, and `Tensor::cat` the three slices into a per-rank fused weight. Transient peak of one full tensor per layer during construction; net memory is properly per- rank after the full drops. - TpQwen3_5Attention: column-parallel `q_proj` (the widened `2 * num_heads * head_dim` output, including the gate half — shards along the head axis so both query AND gate halves stay consistent per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce. Otherwise mirrors `tp_qwen3.rs`'s attention. - TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types), TpQwen3_5Model (with `model.language_model.` prefix), and TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level). Dispatch wiring: - New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant. `WorkerPool::load_dense_shard` now dispatches on `model_type` from the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two downstream methods (`generate_step`, `clear_kv_cache`) thread this enum through — the inner forward+clear_kv_cache dispatch happens via the enum's pub methods. Adding another TP architecture later is one more enum variant + match arms. - Worker side gets a parallel `WorkerModel` enum + dispatch in `handle_load_dense_shard`, branching on the same `model_type`. - Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`. `TpLoadedModel.leader_model` retyped to the enum. Helpers in `arch/qwen3_5/linear_attn.rs`: - `softplus` and `repeat_interleave` made `pub(crate)` so the TP module reuses them rather than duplicating. Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated `Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial RoPE with `partial_rotary_factor` already correct). CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda` also clean inside the patched runner container. Single inflight risk to call out: tensor names. For full-attention layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.` and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*` — the same as the single-GPU path. lm_head sits at the top level (not under `language_model`) — consistent with the single-GPU path that validated against Qwen3.5-0.8B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:02:42 +03:00

1 2 3 4 5

203 Commits