1818dfb337d3c03c1362cdd0b84994d0c624ce6f
79 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
957f704efa
|
feat(neuron): OpenAI Responses API + ci cuda-check runner label
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / CUDA type-check (push) Failing after 11s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 32s
CI / Clippy (push) Successful in 2m31s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m42s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
Step 2 of the Responses rollout: native `/v1/responses` endpoint on
neuron that consumes the same InferenceEvent stream as
`/v1/chat/completions` but emits it as the Responses API's named
SSE event family. No gateway-side translation.
## Surface
- `cortex-core::responses` envelope types: `ResponsesRequest`,
`ResponsesInput` (text | items), `ResponsesInputItem` (message |
function_call | function_call_output | reasoning),
`ResponsesContentPart` (input_text | input_image | output_text),
`ResponsesResponse`, `ResponsesOutputItem`, `ResponsesUsage`. Plus
a `events::*` constant module so the projector and the wire shape
stay in sync without string-typos.
- `neuron::wire::openai_responses`:
- `request_to_chat(req)` flattens Responses input + instructions
into a `ChatCompletionRequest` the candle harness already
understands. Text-only Parts collapse to a string; mixed
text+image Parts go to chat's content-array shape; reasoning
items drop; function_call / function_call_output round-trip
via tool_calls / tool_call_id metadata so the surface is
consistent for the day the harness emits tool calls.
- `project_responses_stream(rx, meta)` reads InferenceEvents
and emits the eight named events that compose a Responses
stream: response.created → output_item.added → content_part.added
→ output_text.delta×N → output_text.done → content_part.done
→ output_item.done → response.completed. Synthesises start
frames if the producer skips Start (poisoned model, early
disconnect) so the stream stays coherent.
- `build_response(meta, text, reason, usage)` for the
non-streaming path.
- `CandleHarness::inference_stream(req)` extracted from
`chat_completion_stream`, returning a typed `InferenceStream`
(event receiver + id/created/model_id metadata). Both
`chat_completion_stream` and the new `responses_stream` are now
thin wrappers that pick their wire projection. TP path got the
same treatment (`chat_completion_tp_stream` → `inference_tp_stream`).
- `POST /v1/responses` route on neuron. Non-streaming returns one
buffered `ResponsesResponse`; streaming returns axum SSE with
both event names and JSON data per frame (Responses, unlike
chat completions, uses named `event:` lines). Reused
`inference_error_response` helper hoisted out so the chat and
responses handlers share the InferenceError → HTTP mapping.
## CI
Also bundles the `cuda-check` runner-label fix from feedback on
commit
|
|||
|
6927286cab
|
fix(neuron): clone id/model_id before TP spawn so wire projector can use them
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-ampere RPM (push) Blocked by required conditions
build-prerelease / Package helexa-neuron-blackwell RPM (push) Blocked by required conditions
CI / Format (push) Successful in 39s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m34s
CI / Test (push) Successful in 5m40s
build-prerelease / Build cortex binary (push) Successful in 5m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 5m49s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 7m38s
build-prerelease / Build neuron-ada (push) Successful in 5m34s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
The Step 1 refactor moved the InferenceEvent receiver wrap to *after* the orchestration spawn in chat_completion_tp_stream, but the spawn moves both `id` and `model_id` into its async closure (used heavily by acquire_pool_lock, NCCL ops, and tracing). Result: borrowck error E0382 use-of-moved-value on the wire_chat::project_chat_stream call. The non-CUDA build doesn't exercise this branch (it lives behind `#[cfg(feature = "cuda")]`) which is why the workspace clippy/test gate passed locally and on the regular CI workflow. The RPM build workflow, which compiles with --features cuda, caught it (run 244 jobs 2/3/4 against beast / ampere / ada respectively, all the same error). Fix: snapshot `id` and `model_id` into `projector_id` / `projector_model_id` before the spawn, use those at the projector call site. The originals stay free to be moved into the closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
|||
|
302ccfb982
|
refactor(neuron): introduce InferenceEvent + wire projection layer
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 3m28s
build-prerelease / Build neuron-blackwell (push) Failing after 6m4s
build-prerelease / Build neuron-ampere (push) Failing after 7m20s
CI / Test (push) Successful in 7m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ada (push) Failing after 4m57s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been skipped
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been skipped
Step 1 of the OpenAI Responses API rollout. Pure refactor — no new
endpoints, no behaviour change on the wire. Lays the seam for
emitting Responses-shaped streaming events from the same harness
output as chat completions in Step 2.
- New `neuron::wire` module tree:
- `wire::event::InferenceEvent` — format-agnostic enum
(Start, TextDelta, ReasoningDelta, Finish) the candle harness
now emits as its native streaming currency.
- `wire::event::FinishReason` — typed reason that maps cleanly
onto OpenAI `finish_reason`, OpenAI Responses `status`, and
Anthropic `stop_reason` strings.
- `wire::openai_chat::project_chat_stream` — async task that
consumes an InferenceEvent receiver and produces a
ChatCompletionChunk receiver, stamping per-request metadata
(id, created, model_id) onto every chunk. Output matches the
pre-refactor wire shape bit-for-bit.
- candle.rs refactored to emit InferenceEvent on its internal
channel through all three streaming paths (CPU
run_inference_streaming, CUDA single-GPU stream_inference_via_worker,
CUDA TP chat_completion_tp_stream). The streaming functions lost
their id/created/model_id parameters since wire-format metadata
now lives in the projector.
- emit_delta + emit_delta_blocking simplified to single-purpose
TextDelta emitters with no wire-format coupling.
- chat_completion_stream wraps the InferenceEvent receiver in
wire_chat::project_chat_stream before returning so the
/v1/chat/completions HTTP handler keeps consuming
ChatCompletionChunks unchanged. External signature preserved.
Also fixes a pre-existing helexa-acp test race (three modules each
declared their own static LOCK for HOME mutation, so cross-module
parallelism flaked tests that read HOME at runtime). Consolidated
onto a single crate-wide path_util::ENV_LOCK.
122 helexa-acp tests + 44 neuron tests pass (5 new wire projection
tests). fmt + clippy --workspace -- -D warnings clean. Ran helexa-acp
suite 3x to confirm the env race is closed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|||
|
abbedf8d8a
|
chore(neuron): bump default max_tokens from 512 to 8192
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 44s
CI / Format (push) Successful in 45s
CI / Clippy (push) Successful in 2m41s
build-prerelease / Build neuron-blackwell (push) Successful in 5m35s
build-prerelease / Build cortex binary (push) Successful in 4m32s
CI / Test (push) Successful in 5m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 8m6s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
512 is too low for any modern coding model — clients that don't explicitly set max_tokens get clipped responses with no diagnostic. Bump the fallback at all four inference call sites (single-GPU streaming + non-streaming, TP leader + non-leader) to 8192, which fits comfortably within Qwen3-class context windows after a typical agent prompt and lines up with what helexa-acp / a0 / curl clients reasonably expect. Clients that explicitly set max_tokens (now including helexa-acp via HELEXA_ACP_MAX_TOKENS / per-endpoint TOML) override this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
e267f583e1
|
chore(neuron): rustfmt drift in is_device_fault test
Some checks failed
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 58s
CI / Clippy (push) Failing after 3m43s
CI / Test (push) Successful in 5m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m48s
build-prerelease / Build neuron-blackwell (push) Successful in 6m10s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 7m41s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 9m18s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
One assert! call grew past the line limit after the previous commits; cargo fmt --all picked it up. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
249b2e5c98
|
fix(neuron): only poison the model on actual device faults
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 38s
CI / Clippy (push) Successful in 2m22s
CI / Test (push) Successful in 4m55s
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-blackwell (push) Successful in 5m49s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 8m7s
build-prerelease / Build neuron-ada (push) Successful in 5m0s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CI / Format (push) Failing after 33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Previously every inference Err — shape mismatch, NaN logits, tokenizer
error, missing handle — marked the model poisoned and rejected every
subsequent request until an operator unload+reloaded. The benjy
incident on 2026-05-27 showed how this misfires: a concurrency bug
produced a `broadcast_add: shape mismatch` error that had nothing to
do with CUDA, but the model was taken down anyway.
Add `is_device_fault(err_chain: &str)` — a conservative classifier
that returns false only for errors we know are pre-kernel / CPU-side
(shape mismatches, NaN logits, tokenize/detokenize, missing handle,
DecodeStream, empty prompt). Everything else defaults to true so a
genuine driver fault still poisons.
Applied at all six poisoning sites:
- chat_completion CUDA worker path
- chat_completion CPU spawn_blocking path
- chat_completion_stream CUDA worker path
- chat_completion_stream CPU spawn_blocking path
- chat_completion_tp non-streaming wrapper
- chat_completion_tp_stream spawned task
Each site now logs either "model marked poisoned" (device fault) or
"model NOT marked poisoned" (non-device) so the journal makes the
classification visible. Tests cover the known non-device patterns and
a couple of real CUDA driver messages.
Pairs with the inference_lock commit (
|
|||
|
c59da83636
|
fix(neuron): serialise single-GPU inference per loaded model
Two concurrent chat_completion requests against the same single-GPU model could interleave their `clear_kv_cache → forward(chunk0) → forward(chunk1) → ...` sequences. The device-worker channel serialises individual jobs but not the sequence boundary, so the cache could end up holding tokens from one request while another's mask was sized for its own prompt — producing a shape mismatch mid-prefill. Observed on benjy 2026-05-27 18:41:05: agent-zero's `memorize memories` and `memorize solutions` extensions fired 4ms apart against Qwen/Qwen3-8B (a0's utility model). Both prefilled into the same KV cache, and request a08b4a's chunk 0 forward produced scores of shape [1, 32, 512, 1024] against a mask of [1, 1, 512, 512] — broadcast_add failed, both requests bubbled the error up, both flipped the model to poisoned. Add `LoadedModel.inference_lock: tokio::sync::Mutex<()>`, mirroring the TpLoadedModel.pool lock that the TP path already held. Acquire it at the start of `chat_completion` and inside the spawned task of `chat_completion_stream` (so the role chunk goes out immediately and only the inference work queues behind the lock). The CPU branch uses `blocking_lock` from inside spawn_blocking; the CUDA branch uses async `.lock().await` inside tokio::spawn. Throughput impact: zero. The GPU was already serialised at the device-worker channel — multiple requests just produced corrupt KV cache state instead of clean serial throughput. The lock makes the existing serialisation honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
f05882369d
|
fix(neuron): don't poison the model on tokio JoinError panics
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m18s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m28s
build-prerelease / Build neuron-ampere (push) Successful in 8m25s
build-prerelease / Build neuron-ada (push) Successful in 8m54s
CI / Test (push) Successful in 4m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m5s
CUDA driver failures propagate as Err through `?` and become
`Ok(Err(InferenceError::Other(_)))` from the spawned task — those are
real device faults and still poison the model. Tokio JoinError is
different: it fires on Rust-level panic (tokenizer bug, sampler bug,
serialisation, the UTF-8 slice that landed in commit
|
|||
|
bd04d7f580
|
fix(neuron): stream tokens via DecodeStream to avoid UTF-8 panic
When BPE byte-fallback splits a multi-byte UTF-8 char (e.g. an emoji) across multiple tokens, the previous "decode the cumulative token list, byte-slice the delta against a stored prefix" pattern would panic with 'start byte index N is not a char boundary; it is inside <emoji>'. The race: at step N the tokenizer renders the partial bytes as U+FFFD (3 bytes); at step N+1 it can decode the complete codepoint (e.g. 4 bytes for 🌫). `decoded_prefix.len()` from step N then lands inside the codepoint in step N+1's `full` string, and `&str[start..]` panics. Replace with tokenizers' `DecodeStream::step(id)` which maintains an internal byte buffer across token boundaries and only emits when a clean codepoint completes. Applied at all three SSE emission sites: - stream_inference_via_worker (single-GPU CUDA stream) - chat_completion_tp_stream's spawned task (TP stream) - run_inference_streaming (CPU stream) The shared emit helper splits into emit_delta (async, mpsc::send) and emit_delta_blocking (sync, mpsc::blocking_send) so each path keeps its existing send semantics. The old emit_chunk helper that did the unsafe full-decode-and-slice is removed entirely. Observed on beast 2026-05-27 17:49:55 — model emitted 🌫 in a tool-call response after a long agent-zero session; the spawned TP stream task panicked at candle.rs:2648. The model itself stayed healthy (no CUDA fault), only the one streaming request died. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
1e13889392
|
feat(neuron): chunked prefill + VRAM/prompt-length pre-flight checks
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 5m9s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 5m1s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-blackwell (push) Successful in 11m7s
build-prerelease / Build neuron-ampere (push) Successful in 12m16s
build-prerelease / Build neuron-ada (push) Successful in 12m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m3s
Prevents the OOM-during-prefill → poisoned-context → 5-minute-reload
cycle observed on beast under agent-zero workloads. Three changes,
all keyed off env-driven knobs so an operator can tune without a
rebuild:
1. Chunked prefill (NEURON_PREFILL_CHUNK_TOKENS, default 512). The
initial forward is split into N-token windows, each with a
monotonically growing offset. KV cache accumulates across chunks
exactly as it would under one big prefill; only the final chunk's
logits are kept for sampling. Activation memory now scales with
chunk size instead of prompt length, so a 13 k-token prompt stops
holding tens of GB of intermediate activations live at once.
Wired into all six prefill call sites:
- run_inference / run_inference_streaming (CPU path)
- run_inference_via_worker / stream_inference_via_worker (CUDA
single-GPU through device worker)
- chat_completion_tp_inner / chat_completion_tp_stream (TP via
WorkerPool)
Three helpers — chunked_prefill_local, chunked_prefill_via_worker,
chunked_prefill_tp — own the loop shape so the chunking semantics
stay identical across paths. Per-chunk debug log shows progress.
2. Max prompt length (NEURON_MAX_PROMPT_TOKENS, default 16384).
Requests above the cap return a structured 400 with
`code: prompt_too_long` rather than going through the prefill and
discovering the limit by OOMing partway through. New
InferenceError::PromptTooLong variant.
3. Minimum free VRAM gate (NEURON_MIN_FREE_VRAM_MB, default 1500).
If `vram_free_mb` is below the threshold at request start (e.g.
another concurrent request is mid-prefill), reject with a clean
503 + `code: insufficient_vram` rather than starting work that
will OOM. New InferenceError::InsufficientVram variant. CPU loads
(vram=0 sentinel) skip this check.
All three gates fire BEFORE any device work, so a rejected request
costs ~one tokenisation pass and never touches the worker thread —
poison cascades from rejected work are now impossible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
35876954cd
|
chore(neuron): default tracing filter to info (was info,neuron=debug)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m17s
CI / Test (push) Successful in 4m43s
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 5m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
Production deployments that want neuron-internal debug detail (e.g. trim_device_pool's per-clear-kv line, slab inserts/drops) override RUST_LOG explicitly via systemd. Defaulting to debug for the whole neuron target produced a lot of journal volume that wasn't useful in the common case. beast already sets RUST_LOG=debug in /etc/systemd/system/neuron.service.d/local.conf, so beast's verbosity is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
cdf0f4e66d
|
fix(neuron): trim cudarc mempool after clear_kv_cache to release VRAM
cudarc's stream-ordered memory pool retains freed blocks (cuMemFreeAsync returns memory to the device's default mempool, not to the OS), so mem_get_info under-reports free VRAM between requests. With Qwen/Qwen3.6-27B TP=2, the second consecutive chat completion saw ~4.5 GB of "missing" free VRAM and either OOMed or tripped cuBLAS into CUBLAS_STATUS_INTERNAL_ERROR depending on quant. Add a cuda-gated trim_device_pool helper that, after each successful clear_kv_cache, synchronizes the context and calls cuMemPoolTrimTo(pool, 0) against the device's default mempool. Failures (no async-alloc support, transient driver errors) are non-fatal and log at debug. The before/after free-VRAM delta is logged so an operator can correlate the trim with the next request's prefill VRAM. ConcatKvCache::reset() in candle-nn 0.10.2 already drops its tensors correctly; the leak was strictly at the cudarc pool layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
b4f3576d82
|
refactor(neuron): phase 4 — model loads move onto the device worker
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m25s
CI / Test (push) Successful in 4m40s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Package cortex RPM (push) Successful in 1m20s
build-prerelease / Build neuron-ampere (push) Successful in 5m7s
build-prerelease / Build neuron-ada (push) Successful in 5m19s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Final structural slice of the per-device CUDA context-ownership refactor. The four remaining spawn_blocking sites that did CUDA work on the leader are gone: - Single-GPU GGUF load (`load_arch_gguf` spawn_blocking) → `Job::LoadGguf` dispatched on the worker. - Single-GPU dense load (`load_arch_dense` spawn_blocking) → `Job::LoadDense` on the worker. - TP shard load (`WorkerPool::load_dense_shard` spawn_blocking) → `Job::TpLoadShard`. The dispatch handler reads `state.nccl.comm()` directly — no cross-thread `Arc<Comm>` transfer, no `SendComm` wrapper for this path. The Phase 2 / Phase 3 bridges that moved freshly-built models across the channel boundary (`Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm`) are removed. Models are now constructed on the worker thread directly; the slab gets populated by `insert_arch` / the inline `tp_models.insert` in dispatch handlers. What this phase preserves: - CPU loads still use `tokio::task::spawn_blocking` against `Arc<Mutex<ModelArch>>`. There's no CUDA context to own on CPU and channel overhead would only add latency. Four `spawn_blocking` references remain in `candle.rs` (load_arch_gguf, load_arch_dense, chat_completion, chat_completion_stream) and all are deliberate CPU-only fallback. - Public API unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical signatures. What this phase removes: - `SendComm` wrapper is no longer used in the load path (the Phase 3 bridge that justified it). It remains in `nccl_state.rs` for the Phase 1–3 era and any future cross-thread Comm move; consider deleting in a follow-up. - `Job::TransferIn`, `Job::TransferInTp`, `Job::CloneLeaderComm` and their handle convenience methods deleted. - The leader_device parameter on `load_dense_shard` is now `_` — unused since the worker has its own bound device. Removing the arg outright is a public-API change; keeping the underscore prefix preserves the signature and signals deadness without churn. Helper relocation: - `LlamaDense::from_parts` is a new pub(crate) constructor so the worker-thread loader can build a `LlamaDense` without going through the original `load_arch_dense` async function. - `check_dense_config_supported` is bumped to `pub(crate)` for the same reason. Sweep verified: `grep -rn spawn_blocking crates/neuron/src/harness/` returns only CPU-fallback hits in `candle.rs` + doc-comment references to the old design. All four leader-side CUDA `spawn_blocking` sites are gone. fmt + clippy clean; 37 lib tests + all integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
76ab24d98c
|
refactor(neuron): phase 3 — TP forward + NCCL state move onto device worker
Some checks failed
CI / Format (push) Successful in 29s
build-prerelease / Resolve version stamps (push) Successful in 32s
CI / Test (push) Failing after 58s
CI / Clippy (push) Successful in 2m31s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m13s
build-prerelease / Build neuron-blackwell (push) Successful in 4m1s
build-prerelease / Package cortex RPM (push) Successful in 1m30s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Third slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The leader's `NcclState`, every `Comm::all_reduce` issued by the TP layers, the leader-side KV cache reset, and the TP forward step itself now all run on the per-device worker thread — the same OS thread that bound the leader's `CudaContext` at startup. What this phase changes: - `Job` gains `NcclInit`, `NcclSanity`, `CloneLeaderComm` (Phase 3 bridge — Phase 4 removes), `TransferInTp`, `DropTp`, `TpClearKv`, `TpForwardLogits`. Plus a new `TpHandle(u64)` opaque key. - `DeviceWorkerState` gains `nccl: NcclState` and `tp_models: HashMap<TpHandle, Box<TpLeaderModel>>` (+ counter). - `WorkerPool` loses its `leader_nccl` field; gains a `leader_worker: Arc<DeviceWorkerHandle>` passed at construction. `init_nccl`, `nccl_sanity_check`, `load_dense_shard`, `generate_step`, `clear_kv_cache` all route their leader-side ops through `Job::Nccl*` / `Job::Tp*` instead of spawn_blocking against a Mutex-wrapped state. `generate_step` returns `Vec<f32>` instead of a device-resident `Tensor` — the worker copies logits to CPU before reply so the async caller can sample on a CPU candle tensor with zero device-context touch. - `TpLoadedModel.leader_model: Arc<Mutex<TpLeaderModel>>` → opaque `leader_handle: TpHandle`. The boxed `TpLeaderModel` lives in the worker thread's slab; both the model's CUDA tensors and the embedded `Arc<Comm>` clones release on the same thread that allocated them (the Drop semantics constraint cudarc forces). - `Job::CloneLeaderComm` is a Phase 3 bridge: the TP shard load still runs in spawn_blocking and needs the leader's `Arc<Comm>` to build the row-parallel layers' AllReduce ops. The Job clones the Comm out of the worker's NcclState and ships it back as `SendComm`. Phase 4 deletes this bridge when the load itself moves onto the worker. - `Job::NcclInit` and `Job::NcclSanity` are ungated by `cuda` so the no-cuda `NcclState` stubs (which reply with `cuda_feature_not_enabled`) still flow through the same channel uniformly; the cuda-only TP variants (CloneLeaderComm, Transfer/Drop/Clear/Forward Tp) remain gated. What this phase doesn't touch (yet): - TP shard load itself — still spawn_blocking, bridged via `CloneLeaderComm`. Phase 4 moves it to `Job::TpLoadShard` and reads `state.nccl.comm()` directly inside the worker. - Single-GPU model loads — still spawn_blocking, transferred via `Job::TransferIn`. Phase 4 moves them. - `device_vram_mb` / `cuda_mem_mb` / `log_construction_complete` helpers — still present, used inside spawn_blocking load closures. Phase 4 cleanup folds them into `dispatch.rs`. `tp/mod.rs::WorkerPool::spawn` gained a required `leader_worker: Arc<DeviceWorkerHandle>` argument. Three external callers were updated: `CandleHarness::load_tp` (passes the cached device worker), `main.rs::tp_smoke` (spawns a fresh worker), and the two `tp_worker_lifecycle*.rs` integration tests. Public API unchanged. fmt + clippy clean; 37 lib tests + all integration tests pass. CUDA-only TP integration smoke deferred to the next deploy on beast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
b179204fd3
|
refactor(neuron): phase 2 — single-GPU forward + clear_kv route through device worker
Some checks failed
build-prerelease / Package helexa-neuron-ada RPM (push) Blocked by required conditions
CI / Format (push) Successful in 34s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Resolve version stamps (push) Successful in 3m41s
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m20s
build-prerelease / Build cortex binary (push) Successful in 12m20s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
Second slice of the per-device CUDA context-ownership refactor planned at
~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. The two
spawn_blocking sites in `chat_completion` and `chat_completion_stream`
now route through the device worker thread on CUDA loads. CPU loads
keep the existing spawn_blocking + `Arc<Mutex<ModelArch>>` path; there's
no context to own and the channel hop would only add latency.
What this phase changes:
- `Job` gains `TransferIn`, `DropArch`, `ClearKv`, `ForwardLogits`. The
worker's dispatch state grows a `HashMap<ArchHandle, Box<ModelArch>>`
slab and a `next_handle` counter for minting opaque handles.
- `LoadedModel.arch: Arc<Mutex<ModelArch>>` → `Option<Arc<Mutex<>>>`,
plus a new `arch_handle: Option<ArchHandle>` field. The two are
mutually exclusive: CUDA loads set `arch_handle = Some(_)` after
transferring the boxed arch into the worker's slab; CPU loads keep
`arch = Some(_)` for the legacy spawn_blocking path.
- New `run_inference_via_worker` and `stream_inference_via_worker`
drive the prefill + decode loop by sending `Job::ForwardLogits` per
step; the worker copies the resulting `[vocab]` logits to a
CPU-side `Vec<f32>` before reply, so the async caller never holds a
device-resident tensor. `apply_repeat_penalty` and
`LogitsProcessor::sample` run on a CPU candle tensor; no context
binding side-effects on tokio worker threads.
- `logits_health_slice(&[f32])` complements the existing
`logits_health(&Tensor)` so the new worker paths can compute
health stats directly from the CPU vec.
- `unload_model` for the single-GPU CUDA path now sends
`Job::DropArch { handle }` to the worker so the `Box<ModelArch>`
drops on the thread that allocated its CUDA tensors. The `Drop` runs
with the bound context, freeing memory on the right context.
What this phase doesn't touch (yet):
- TP forward, TP load, NCCL bring-up — still on spawn_blocking. Phase 3.
- Single-GPU model load — still spawn_blocking, followed by a
`Job::TransferIn` to move the freshly-built `ModelArch` into the
worker slab. Phase 4 moves the load itself onto the worker thread
and eliminates the bootstrap TransferIn.
- The `device_vram_mb` / `cuda_mem_mb` helpers — still present and
used by the construction-time logs running inside spawn_blocking
loads. Phase 4 cleanup folds them into `dispatch.rs`.
Public API unchanged. fmt + clippy clean; 37 lib tests + all
integration tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
081b532387
|
refactor(neuron): phase 1 — per-device worker thread, VRAM queries route through it
Some checks failed
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Failing after 59s
build-prerelease / Build neuron-blackwell (push) Successful in 3m30s
CI / Test (push) Successful in 4m47s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m17s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
build-prerelease / Build neuron-ampere (push) Successful in 5m16s
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
First slice of the per-device CUDA context-ownership refactor planned at ~/.claude/plans/plan-the-per-device-worker-abstract-micali.md. Adds the infrastructure for a dedicated OS thread per CUDA device that owns the device's `CudaContext` for the daemon's lifetime, and routes the 8 async-context `device_vram_mb()` call sites in candle.rs through it. What this phase changes: - New module `harness/device_worker/` (mod.rs, jobs.rs, dispatch.rs). `DeviceWorkerHandle::spawn(idx)` creates a named OS thread (`cuda-dev-N`), binds `CudaContext::new(idx)` once at startup, and enters a dispatch loop reading `Job`s off a `std::sync::mpsc` channel. Replies cross back via `tokio::sync::oneshot::Sender` so async callers await without parking a tokio worker. - Two Job variants: `QueryVram` and `Shutdown`. Phases 2–4 add Forward, ClearKv, NCCL init/sanity, and load variants. - `LoadedModel` and `TpLoadedModel` gain a `worker` field populated at load time by a new `CandleHarness::ensure_device_worker(idx)` method that lazily spawns + caches one worker per device index. - Per-model `query_vram()` convenience method on both struct types so the 8 call sites in chat_completion / chat_completion_stream / chat_completion_tp_inner / chat_completion_tp_stream become `loaded.query_vram().await` (or `tp.query_vram().await`) — same field values logged, just sourced from the owner thread instead of the caller thread. What this phase doesn't touch (yet): - Forward, kv-cache clear, model load, NCCL — still on `spawn_blocking`. Phase 2 moves the single-GPU forward + clear; Phase 3 moves the TP forward + NCCL bring-up; Phase 4 moves the loads and deletes the now- unused `device_vram_mb` / `cuda_mem_mb` helpers. - Public API — unchanged. `Harness::load_model`, `chat_completion`, HTTP routes all keep identical shapes. Tests: - 5 new unit tests in `device_worker/mod.rs::tests` cover spawn → query → shutdown round-trip, thread naming, post-shutdown submit returns `Gone`, poisoned flag fast-rejects, and concurrent jobs drain across a Shutdown. CPU build (the only one CI runs) is enough to exercise channel mechanics. - All 37 lib tests + all integration tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
7c19da9361
|
feat(neuron): construction-complete vram/config dump + logits health + per-step vram
All checks were successful
CI / Format (push) Successful in 40s
build-prerelease / Resolve version stamps (push) Successful in 45s
CI / Clippy (push) Successful in 2m27s
build-prerelease / Build cortex binary (push) Successful in 4m24s
build-prerelease / Build neuron-blackwell (push) Successful in 4m0s
build-prerelease / Package cortex RPM (push) Successful in 1m18s
build-prerelease / Build neuron-ampere (push) Successful in 5m10s
build-prerelease / Build neuron-ada (push) Successful in 4m56s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
CI / Test (push) Successful in 4m24s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Three additive diagnostics that turn the 2026-05-27 q5k Qwen3.6-27B incident from "guess at KV cache / quant sizes" into "read the journal": 1. Construction-complete summary in TpQwen3_5ForCausalLM::load and TpQwen3ForCausalLM::load. After the last "after layer N" log fires, each rank emits a single info line with: free_mb/total_mb (the number that drops by ~9 GB between per-layer and first-request on beast, with no inference traffic), every resolved config knob (vocab_size, hidden_size, num_layers, head_dim, num_kv_heads, max_position_embeddings), and a per-token KV-cache byte estimate. For Qwen3-Next also includes the linear/full-attention layer split so the hybrid architecture's cache cost is unambiguous. 2. Logits health snapshot on sample failure. Today the failure logs "A weight is negative, too large or not a valid number" with no context — was it a NaN cascade, an Inf, a negative weight? `logits_health(&logits)` computes nan/pos_inf/neg_inf/neg counts plus finite_min/max/mean on the failure path (zero cost on the success path) and emits a warn line just before the wrapper's terminal "failed, model marked poisoned" log. Wired into both the prefill and decode sample sites of the non-streaming AND streaming TP chat paths. 3. VRAM snapshot at prefill complete + every decode step. The "prefill complete" info line now carries vram_free_mb so the activations + KV growth from the prefill itself is visible. The per-step trace line gets vram_free_mb too, so an operator running with RUST_LOG=trace can watch headroom shrink token by token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
800498f530
|
feat(neuron): bind listener before pre-warm, surface activation in /health
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m26s
build-prerelease / Build neuron-blackwell (push) Successful in 3m34s
CI / Test (push) Successful in 4m44s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m29s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two coupled changes addressing the 2026-05-26 validate-neuron failure
where a fresh deploy of beast had /health unreachable for ~5 minutes
while Qwen3.6-27B q5k materialised, even though systemd reported the
unit as active.
1. main.rs no longer awaits load_default_models before binding axum.
The listener binds first; pre-warm runs in a spawned background
task that holds a read lock on the harness registry for the
duration of its sequential load loop. Concurrent on-demand
/models/load and /v1/chat/completions traffic still flow.
2. /health gains an `activation` field carrying:
state pre_warming | ready
pending model ids queued but not started
in_progress model id currently loading (Option)
completed model ids loaded successfully this activation
failed [{model_id, error}] for failed entries
The field is `#[serde(default)]` so a pre-change cortex polling a
new neuron — or vice versa — keeps working.
`ActivationTracker` (new module `neuron::activation`) owns the
RwLock-wrapped state; load_default_models takes a tracker reference
and updates it per-model. NeuronState holds an Arc clone for the
/health handler.
Tests updated to construct trackers and assert state transitions
(empty noop, two failures → ready with both in `failed`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
2740e61a23
|
fix(neuron,candle): name lifetime on acquire_pool_lock
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 46s
CI / Format (push) Successful in 46s
CI / Clippy (push) Successful in 2m15s
CI / Test (push) Successful in 5m8s
build-prerelease / Build cortex binary (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-ampere (push) Successful in 5m25s
build-prerelease / Build neuron-ada (push) Successful in 5m3s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 7m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Lifetime elision fails when a function has two reference parameters and returns a borrow: rustc can't infer whether the MutexGuard's lifetime ties to `pool` or `model_id`. The non-CUDA build skipped this code path (cfg-gated), so the error only surfaced on the GPU build at https://git.lair.cafe/helexa/cortex/actions/runs/162. The guard borrows the pool, so name the lifetime on `pool` and the return type. `model_id` keeps its independent (elided) lifetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
67f79c868f
|
fix(neuron,shutdown): time-bound unloads, fast-exit past tokio drain
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 42s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m46s
build-prerelease / Build neuron-blackwell (push) Failing after 3m32s
CI / Test (push) Successful in 4m25s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Two failure modes from the 2026-05-26 beast incident: 1. `unload_all_models` looped through models calling `unload_model`, logging individual failures at warn. The cumulative effect was a single warn line for the failed unload then "shutdown complete" — no signal that the model was actually still loaded. Now each unload is bounded by a 20s timeout, failures escalate to error, and a summary "leaving N model(s) loaded" line fires when anything is stuck so the operator knows the OS will reclaim VRAM after exit. 2. Returning `Ok(())` from `main` after the unload sweep dropped the tokio runtime, which then waited indefinitely on a CUDA-stuck spawn_blocking thread (the journal's "Stack trace of thread 2951308" — spinning on `cuCtxGetCurrent`). systemd's TimeoutStopSec fired 2 minutes later, SIGABRT, core dump. Replacing the return with `std::process::exit(0)` skips the runtime drain and hands the OS a clean exit code; stuck threads get reaped with the process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
fc6ef0ee0f
|
feat(neuron,candle): detect CUDA context poisoning and refuse follow-ups
Once a CUDA driver error has hit a forward or kv-cache call, the device's context is unrecoverable in-process — subsequent kernels can hang (the failure mode seen on beast on 2026-05-26), return garbage, or trip another illegal-address. The harness now marks the model poisoned on any forward / spawn_blocking / TP-task failure, refuses further inference against it with a clear "unload and reload" error, and surfaces `status: "poisoned"` on `/models` so an operator running `curl beast:13131/models` (or cortex polling) can see the bad state. Without this, a single OOM on a too-large prefill quietly turned every subsequent request into a stuck wait on the pool lock; with it, the first request fails fast with the driver error in the journal and the client gets a usable 5xx instead of a hung connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
1385979e3d
|
feat(neuron,candle): log per-device VRAM at chat_completion start
Every "starting" log line now carries vram_free_mb / vram_total_mb for the request's serving device (the leader device on TP). On the 2026-05-26 incident this would have made the 14k-token prefill OOM diagnosable from the first log line: with ~412 MB free, that prompt was never going to fit, and the operator could have caught the imbalance before the CUDA context got poisoned. `device_vram_mb` mirrors the existing helper in tp_qwen3_5.rs and is kept separate to avoid coupling the inference path to the TP module. TpLoadedModel gains a `leader_device: Device` clone so the request path reads the device without locking the leader model (which would contend with an in-flight forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
0a1cfcd4d0
|
feat(neuron,candle): req_id spans, terminal failure logs, pool-lock warnings
Every chat completion path (single-GPU + TP, streaming + non-streaming)
now opens an `info_span!("chat", req_id=…, model=…)`. The fmt subscriber
prefixes every event with that span so `grep req_id=…` over journalctl
reconstructs one request even when dozens overlap.
Every path also emits a terminal log line on both success ("done", with
prompt_tokens/completion_tokens/finish_reason/total_ms) and failure
("failed", with full anyhow chain + total_ms). Failures used to vanish
silently — a request that hit a CUDA OOM left "starting" in the journal
and no further trace.
New `acquire_pool_lock` helper replaces the bare `tp.pool.lock().await`
in both TP paths. It warns at 2s ("still waiting on pool lock") and
re-warns every 2s thereafter, so queued requests stuck behind a
deadlocked holder are visible immediately instead of looking like idle
silence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
ea0e0f7911
|
fix(neuron,tp): log leader forward errors with full context
Worker rank failures were already surfaced at WARN, but the leader's own forward Result::Err was silently coerced to a `leader_ok=false` bool. When the leader and a worker both fail together — the typical shape of a CUDA OOM cascading into an illegal-address — the journal showed only the worker side and an operator had to guess what hit rank 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
e71181499e
|
feat(stage-8e-3): quantize lm_head in TP Qwen3-Next
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 42s
build-prerelease / Build neuron-blackwell (push) Successful in 3m43s
build-prerelease / Build cortex binary (push) Successful in 4m25s
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ampere (push) Successful in 5m23s
build-prerelease / Build neuron-ada (push) Successful in 4m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m42s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
CI / Format (push) Successful in 30s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
TpQwen3_5ForCausalLM::lm_head is now a MaybeQuantLinear. When the load spec has quant set and tie_word_embeddings is false, lm_head's (vocab_size, hidden_size) weight is quantized in-situ at load time along with all the per-layer linears. The non-tied case on Qwen3.6-27B saves ~1.7 GB per rank vs bf16 (248320 x 5120 x 2 bytes = 2.42 GB -> ~700 MB at Q5K) and shaves a small amount of decode latency from the per-token logits matmul. Tied case (tie_word_embeddings=true) keeps the lm_head plain even when quant is set — quantizing the shared tensor would corrupt the embedding lookup, and the tied case already gets the memory win from only holding one copy. This is the last MaybeQuantLinear hookup in the Qwen3-Next TP path. The dense Qwen3 path (tp_qwen3.rs) is unchanged — defer until it's the bottleneck for a model that actually needs TP at consumer scale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
ee663e5e99
|
fix(stage-8e-2e): bump quant prefill threshold to M > 64
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
CI / Format (push) Successful in 34s
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Clippy (push) Successful in 2m20s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
The M > 8 threshold from 8e-2d activated forward_via_f16 on the test case (M=30) and slightly regressed prefill (143 -> 133 T/s). The dequant cost (~30 MB f16 per linear * ~480 calls per prefill = ~200 ms) eats the cuBLAS GEMM speedup at small M. Move the crossover to M > 64 so short prefills (typical for the validate probe) stay on the GGUF GEMV kernel where per-call cost is comparable but the dequant tax is zero. Long prefills still get the dequant-then-cuBLAS-GEMM path where the GEMM scaling amortises the fixed dequant cost. Doesn't close the gap to mistralrs's 423 T/s on Q5K prefill — that needs either a dequant cache (gives back the ISQ memory win) or a fused dequant+gemm kernel. Both larger projects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
34f9b77d9d
|
feat(stage-8e-2d): route quantized matmul by M (prefill vs decode)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 41s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m40s
build-prerelease / Build cortex binary (push) Successful in 4m20s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m58s
build-prerelease / Build neuron-ampere (push) Successful in 5m14s
build-prerelease / Package cortex RPM (push) Successful in 9m25s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
MaybeQuantLinear::forward picks between two QMatMul paths: - M > 8 (prefill): QMatMul::forward_via_f16 dequantises the weight once into f16 and runs a real cuBLAS-backed GEMM. The dequant cost is fixed per call, so it's amortised across the M tokens. - M <= 8 (decode): QMatMul::forward uses candle's GGUF GEMV kernel on the quantized blocks directly. Requires f32 inputs so we still cast in/out at the boundary in that arm. Earlier 8e-2c sent everything through the GGUF GEMV kernel, which is excellent at GEMV (decode) but doesn't have a real batched GEMM path — prefill regressed ~4x. This restores prefill to roughly the bf16 cuBLAS GEMM throughput while keeping the decode gain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
f084aaab8e
|
fix(stage-8e-2c): cast bf16/f16 activations to f32 around QMatMul
All checks were successful
CI / Format (push) Successful in 33s
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m26s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Build cortex binary (push) Successful in 4m22s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 5m12s
build-prerelease / Build neuron-ada (push) Successful in 4m41s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m59s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m5s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m48s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
candle's QTensor::cuda_fwd requires f32 inputs — its on-the-fly GGUF dequantize accumulates in f32. The model dtype flowing into MaybeQuantLinear::forward is bf16, so QMatMul::forward errored with "unexpected dtype, expected: F32, got: BF16". Wrap the Quant arm to cast the activation to f32 before the matmul and cast the result back to the input dtype. The cast is a single launch on the activation tensor (small relative to weight traffic); it's the price of in-situ GGUF-style quantization, and what mistralrs does inside its own Linear wrapper. The Plain arm is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
68a606a79c
|
fix(stage-8e-2b): allow quant on the TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 8m37s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Build neuron-ada (push) Successful in 4m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m57s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 12m35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
The pre-existing guard in candle.rs rejected any spec.quant on the TP path with "GGUF quantized models are not supported in the TP path" — written when quant only ever meant GGUF. With 8e-1/8e-2 in, quant != None on the TP path triggers in-situ quantization of the loaded safetensors shards. resolve_dense_files only looks for safetensors so a GGUF-source-file model with TP still errors out cleanly downstream. validate-neuron.sh: rebuild the load payload incrementally so tp_size > 1 + non-empty quant produces both fields. Same script now covers all four combos (single/TP × dense/ISQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
4aa71902d0
|
feat(stage-8e-2): plumb quant config from ModelSpec to TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m7s
CI / Test (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m47s
build-prerelease / Build neuron-ampere (push) Successful in 5m17s
build-prerelease / Build neuron-ada (push) Successful in 5m14s
build-prerelease / Build cortex binary (push) Successful in 18m31s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
- LoadDenseShard RPC gains an optional `quant` string field. - WorkerPool::load_dense_shard takes a `quant: Option<String>`, passes it via the RPC to workers and via parse_quant_string to the leader's local load. - The Qwen3-Next TP load chain (ForCausalLM → Model → DecoderLayer → Attention / GatedDeltaNet / MLP) takes `quant: Option<GgmlDType>` end-to-end, calling Column/RowParallelLinear::load_with_quant. - The fused in_proj_qkv inside TpQwen3_5GatedDeltaNet is now a MaybeQuantLinear so it also picks up quantization. - parse_quant_string accepts q4_0/q4_1/q5_0/q5_1/q8_0/q8_1, q2k..q8k (with or without underscore), and f16/bf16/f32. Empty / None means no quantization. Callers from candle.rs forward spec.quant through pool.load_dense_shard. This means a `quant = "q5k"` in models.toml now flows end-to-end to a QTensor-backed QMatMul for every per-rank linear in the Qwen3-Next TP path. Leaves lm_head and the small replicated bias/log tensors in their loaded dtype (Stage 8e-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
bef159b21c
|
feat(stage-8e-1): MaybeQuantLinear primitive + parallel-linear quant variants
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 37s
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Build neuron-blackwell (push) Successful in 3m31s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
CI / Format (push) Waiting to run
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Introduces MaybeQuantLinear, which wraps either a plain candle Linear or a candle QMatMul backed by a freshly-quantized QTensor. Forward dispatches identically through the Module trait so downstream code doesn't care which arm is active. ColumnParallelLinear and RowParallelLinear gain `load_with_quant` methods. The existing `load` methods stay as backward-compatible no-quantization wrappers — no churn at the 27 existing call sites. This is the foundation for in-situ quantization at load time. Wiring the user-facing quant config and switching call sites to load_with_quant follow in stages 8e-2 / 8e-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
8d7b099b36
|
feat(stage-8d-7): direct safetensors fused-region loader
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
CI / Format (push) Successful in 35s
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Clippy (push) Successful in 2m18s
CI / Test (push) Successful in 4m28s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Build cortex binary (push) Successful in 4m13s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
Replaces load_fused_qkv_slice_2d/_3d with reads from a separate MmapedSafetensors handle. Each per-rank fused tensor is built by reading the three region byte-slices directly from the mmap, concatenating them host-side, and uploading as one device allocation — no full-fused-tensor device materialisation. The prior approach allocated a ~100 MB transient device tensor per linear-attention layer; on Qwen3.6-27B with 48 linear-attn layers that's ~4.8 GB of allocator churn during load — enough to fragment the cuda caching allocator on a tight-VRAM 32 GB consumer GPU, which is what triggered the layer-22 up_proj OOM seen on beast. Threading: MmapedSafetensors flows worker → ForCausalLM → Model → DecoderLayer → GatedDeltaNet::load. Both leader (mod.rs) and worker (worker.rs) construct their own mmap; Linux's page cache shares the underlying pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
89d98d1fb2
|
diag(stage-8d-6): per-layer VRAM logging in TP load path
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 33s
CI / Clippy (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 3m59s
CI / Test (push) Successful in 4m58s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m26s
build-prerelease / Build neuron-ampere (push) Successful in 4m52s
build-prerelease / Build neuron-ada (push) Successful in 5m11s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m52s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Wraps each TpQwen3_5DecoderLayer::load in a with_context that captures free/total VRAM on failure, plus an info-level log after every layer that succeeds. Uses cudarc::driver::result::mem_get_info — same API mistralrs uses. Diagnostic only: forward path is unchanged. Helps distinguish true VRAM exhaustion from allocator fragmentation when loading large models at BF16 on 2x consumer GPUs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
cc95fe28d9
|
feat(stage-8d-5b): wire fused_gdn_gating CUDA kernel
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 1m45s
build-prerelease / Build neuron-blackwell (push) Successful in 3m40s
build-prerelease / Build cortex binary (push) Successful in 4m27s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Build neuron-ampere (push) Successful in 5m30s
build-prerelease / Build neuron-ada (push) Successful in 5m24s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m49s
CI / Format (push) Successful in 35s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m7s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
run_fused_gating helper consolidates the per-layer gating math: beta = sigmoid(b) g = -exp(a_log) * softplus(a + dt_bias) CUDA path issues a single launch via fused_gdn_gating_cuda; cpu path falls back to the original per-op Rust sequence. Replaces ~10 candle launches per linear-attention layer (sigmoid + 2× to_dtype + exp + neg + broadcast_add + softplus + 2× unsqueeze + broadcast_mul) across both single-GPU and TP forward paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
09c945f81e
|
feat(stage-8d-4): dispatch chunked_gated_delta_rule_recurrence at prefill
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 31s
CI / Format (push) Successful in 44s
CI / Clippy (push) Failing after 52s
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
run_delta_rule_cuda now picks between the per-token kernel and the BT=64 chunked variant based on seq_len. Threshold = 64 matches mistralrs. Prefill on Qwen3.6-27B (typical seq_len in the hundreds) drops from one block-launch per token to one per 64-token chunk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
05dc0bad18
|
feat(stage-8d-3): wire causal_conv1d_update/full CUDA kernels
Some checks failed
CI / Clippy (push) Waiting to run
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 38s
build-prerelease / Build cortex binary (push) Has started running
build-prerelease / Build neuron-blackwell (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Replaces the per-layer conv1d + silu sequence in both single-GPU and TP linear-attention forward paths with a shared run_causal_conv1d helper that dispatches to: - causal_conv1d_update for decode (seq_len=1 with existing conv_state) - causal_conv1d_full for prefill / fresh start (zero-pads internally) Both kernels fuse the depthwise conv + SiLU into a single launch — 4× fewer cuda launches per linear-attention layer vs the candle conv1d + candle_nn::ops::silu combo. Falls back to the original Rust path on cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
10c151efa5
|
feat(stage-8d-5): wire gated_delta_rule_recurrence kernel into tp_qwen3_5
Some checks failed
build-prerelease / Package cortex RPM (push) Blocked by required conditions
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m21s
build-prerelease / Build neuron-blackwell (push) Successful in 3m36s
CI / Test (push) Successful in 4m39s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m34s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
TP per-token Rust loop replaced with shared run_delta_rule dispatch from arch/qwen3_5/linear_attn.rs. Both single-GPU and TP variants now use the cuda kernel when available, per-token Rust fallback otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
44ae927e38
|
feat(stage-8d-2): wire gated_delta_rule_recurrence kernel into qwen3_5
Some checks failed
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Format (push) Successful in 38s
CI / Test (push) Failing after 45s
CI / Clippy (push) Successful in 2m16s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-blackwell (push) Has been cancelled
Replaces the per-token Rust delta-rule loop in
`arch/qwen3_5/linear_attn.rs::GatedDeltaNet::forward` with a single
dispatch to the `gated_delta_rule_recurrence` kernel imported from
mistralrs in
|
|||
|
1ebbe87651
|
feat(stage-8d-1): import mistralrs GDN CUDA kernels — build infra only
Some checks failed
build-prerelease / Build cortex binary (push) Blocked by required conditions
CI / Test (push) Waiting to run
build-prerelease / Resolve version stamps (push) Successful in 29s
CI / Format (push) Successful in 40s
CI / Clippy (push) Successful in 2m23s
build-prerelease / Build neuron-blackwell (push) Has started running
build-prerelease / Build neuron-ampere (push) Has been cancelled
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package cortex RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
CI / Build cortex SRPM (push) Has been cancelled
CI / Build neuron SRPM (push) Has been cancelled
CI / Publish cortex to COPR (push) Has been cancelled
CI / Publish neuron to COPR (push) Has been cancelled
CI / Bump version in source (push) Has been cancelled
Stage 8d (new): port the Gated DeltaNet CUDA kernels from
EricLBuehler/mistral.rs to close the ~500x decode performance gap
we measured on Qwen3.6-27B TP-2 (~12s/token in our pure-candle path
vs ~37 T/s in mistralrs on the same hardware).
This commit lays the build infrastructure with zero behavioural
change. Subsequent commits (8d-2 .. 8d-5) wire each kernel into the
qwen3_5 architecture and TP variant.
Added:
- `crates/neuron/build.rs` — uses `cudaforge::KernelBuilder` to compile
every `src/cuda/*.cu` file into `libneuroncuda.a` under the `cuda`
feature, then links it + `cudart`. Mirrors mistralrs's
`mistralrs-core/build.rs` setup verbatim (same NVCC flag set, same
sm_<80 bf16 gate).
- `crates/neuron/src/cuda/gdn.cu` — five kernels ported verbatim from
upstream:
* `gated_delta_rule_recurrence` (V-tiled per-token decode)
* `chunked_gated_delta_rule_recurrence` (BT=64 chunked prefill)
* `causal_conv1d_update` (single-token conv decode)
* `causal_conv1d_full` (multi-token conv prefill)
* `fused_gdn_gating` (beta = sigmoid(b); g = -exp(A_log) *
softplus(a + dt_bias))
- `crates/neuron/src/cuda/gdn.rs` — Rust wrappers around the kernels,
cudarc::CudaSlice::device_ptr boilerplate identical to upstream.
- `crates/neuron/src/cuda/ffi.rs` — `extern "C"` decls (subset of
upstream's ffi.rs covering only the five GDN kernels; MoE / SSM /
top-k decls land here when we absorb those too).
- `crates/neuron/src/cuda/mod.rs` — re-exports + module docs.
Cargo wiring: `cudaforge` added as an optional build-dep, activated
by the `cuda` feature. CPU build is unchanged (the `cuda/` module is
fully `#[cfg(feature = "cuda")]`). The cuda feature build inside the
patched container compiles `gdn.cu` (1 of 1 kernels) and links
clean.
Licensing: upstream files preserve their MIT origin via per-file
comment banners pointing to the mistralrs path. No behaviour-relevant
edits to the .cu kernels — local diff against upstream is just the
banner. The `.rs` wrappers and `ffi.rs` subset are also from upstream;
their structure (module path `crate::cuda::ffi::*`) matches identically
so future kernel imports drop in unchanged.
CPU clippy + 32 lib tests pass; `cargo clippy --features cuda` clean
inside the runner container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
70eb6af42b
|
feat(tp): cancellation-safe inference + structured tracing
All checks were successful
CI / Format (push) Successful in 30s
build-prerelease / Resolve version stamps (push) Successful in 35s
CI / Clippy (push) Successful in 2m14s
build-prerelease / Build neuron-blackwell (push) Successful in 3m44s
build-prerelease / Build cortex binary (push) Successful in 4m13s
CI / Test (push) Successful in 4m38s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m47s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m54s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m41s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Two changes addressing operator visibility into TP inference + the
HTTP-cancellation poisoning chain:
1. `chat_completion_tp` now runs its body inside `tokio::spawn`. When
the HTTP client disconnects (curl --max-time, browser nav, etc.)
the future returned from `chat_completion_tp` gets dropped, but
the spawned task keeps running to completion — finishing every
`pool.generate_step` / `pool.clear_kv_cache` to drain the worker
pipes. The next inference request then finds a clean pool.
Previously: dropped future left workers still processing the
in-flight request, the next call's `ClearKvCache` recv would
read the stale `GenerateStepOk` from the abandoned step ("rank N
expected KvCacheCleared, got GenerateStepOk"). The drain-on-
leader-error fix from
|
|||
|
d1a4aad91d
|
fix(tp): always drain worker responses on leader failure
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 1m6s
CI / Clippy (push) Successful in 2m56s
build-prerelease / Build neuron-blackwell (push) Successful in 3m40s
CI / Test (push) Successful in 5m1s
build-prerelease / Build cortex binary (push) Successful in 4m36s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ampere (push) Successful in 4m29s
build-prerelease / Build neuron-ada (push) Successful in 4m51s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m9s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
The TP-2 inference probe against Qwen3.6-27B surfaced:
worker rank 1 ClearKvCache: expected KvCacheCleared, got
GenerateStepOk
Caused by pipe poisoning. The previous shape of `generate_step`:
for w in workers { w.send_only(GenerateStep) } // 1. fan-out
let logits = spawn_blocking(leader.forward)??; // 2. early return on err
for w in workers { w.recv_only() } // 3. drain (skipped on 2's err)
When step 2 returned `Err` (e.g. a dtype mismatch we hadn't seen
before, an OOM, a downstream squeeze that didn't match the shape),
the function bailed before step 3 — but workers had already written
`GenerateStepOk` to their stdout pipes, since their forwards (and
the NCCL collectives inside) completed independently of the leader's
post-collective Rust-side work.
The next call (typically `ClearKvCache` at the start of the *next*
inference request) would then send a fresh request and read those
stale replies as if they were the new operation's. Once a pipe is
poisoned, every subsequent call surfaces the same shape of error
even though nothing's actually broken.
Fix: introduce two helpers in `tp/mod.rs`:
- `drain_workers(workers, check)` — reads exactly one response from
every worker regardless of individual outcomes. Returns
`Vec<String>` of `rank N: detail` strings for any non-OK reply.
- `combine_leader_workers(leader, worker_errs, op)` — folds the
leader's `Result<Result<T>>` (the spawn_blocking shape) with the
worker drain into a single `Result<T>`. Leader failure takes
precedence but worker errors get appended so both halves surface.
`generate_step` and `clear_kv_cache` now use this pattern. Worst case:
both halves fail and the operator sees a combined error message;
either way the pipes are always drained so the next call's recv
matches the request it sent.
Note: the model is still poisoned in the current state — the
operator needs to either `POST /models/unload` + reload, or
`systemctl restart neuron`, to recover. The fix prevents *future*
desync; it doesn't repair existing stale pipe state.
Stage 7c-ii crash detection was tracked as the canonical solution to
this class of issue; this is the minimum-viable subset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
95dc8745eb
|
feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m13s
build-prerelease / Build neuron-blackwell (push) Successful in 3m37s
CI / Test (push) Successful in 4m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m26s
build-prerelease / Build neuron-ampere (push) Successful in 5m18s
build-prerelease / Package cortex RPM (push) Successful in 7m6s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the
Qwen3-Next architecture — plus the dispatch wiring needed to route a
load through it on both the leader and the workers.
Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for
the full-attention layers + a new pattern for linear-attention):
- TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size`
V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`,
`in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along
the V-head dim. `out_proj` is row-parallel + AllReduce (the only
collective inside the block). The recurrent state shards 1:1 with
V-heads — no cross-rank sync inside the delta-rule loop.
`in_proj_qkv` and `conv1d.weight` are FUSED tensors with three
regions along dim 0 (`[first key_dim, second key_dim, value_dim]`).
Standard uniform-slicing doesn't align with the head boundaries —
rank 0 would end up with `[first half of K_0, full K_1, first half
of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full
tensor, narrow per-region per-rank, and `Tensor::cat` the three
slices into a per-rank fused weight. Transient peak of one full
tensor per layer during construction; net memory is properly per-
rank after the full drops.
- TpQwen3_5Attention: column-parallel `q_proj` (the widened
`2 * num_heads * head_dim` output, including the gate half — shards
along the head axis so both query AND gate halves stay consistent
per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce.
Otherwise mirrors `tp_qwen3.rs`'s attention.
- TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types),
TpQwen3_5Model (with `model.language_model.*` prefix), and
TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level).
Dispatch wiring:
- New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant.
`WorkerPool::load_dense_shard` now dispatches on `model_type` from
the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two
downstream methods (`generate_step`, `clear_kv_cache`) thread this
enum through — the inner forward+clear_kv_cache dispatch happens
via the enum's pub methods. Adding another TP architecture later is
one more enum variant + match arms.
- Worker side gets a parallel `WorkerModel` enum + dispatch in
`handle_load_dense_shard`, branching on the same `model_type`.
- Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`.
`TpLoadedModel.leader_model` retyped to the enum.
Helpers in `arch/qwen3_5/linear_attn.rs`:
- `softplus` and `repeat_interleave` made `pub(crate)` so the TP
module reuses them rather than duplicating.
Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated
`Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial
RoPE with `partial_rotary_factor` already correct).
CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda`
also clean inside the patched runner container.
Single inflight risk to call out: tensor names. For full-attention
layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.*`
and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*`
— the same as the single-GPU path. lm_head sits at the top level (not
under `language_model`) — consistent with the single-GPU path that
validated against Qwen3.5-0.8B.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
495d3f7c05
|
fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build neuron-ampere (push) Successful in 4m46s
build-prerelease / Build neuron-ada (push) Successful in 5m9s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s
The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first
inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs:
BF16`. Trace through the recurrent delta-rule loop:
let q = (q.to_dtype(F32)? * scale)?; // F32
let k = k.to_dtype(F32)?; // F32
let v = v.to_dtype(F32)?; // F32
// g built from A_log/dt_bias // F32
// beta = sigmoid(b) // BF16 (sigmoid preserves dtype)
...
let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?;
^^^^^^^^^^^^^ ^^^^^^^^^
F32 BF16 ← mismatch
`g` was already F32 because it was constructed from `a_log.to_dtype(F32)`
+ `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from
`sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16
and the multiplication tripped candle's dtype-mismatch check.
Promote beta to F32 at the same point we promote q/k/v.
Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on
beast — load returned 200, then `POST /v1/chat/completions` returned
the dtype error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
5c4c8e0eba
|
fix(qwen3_5): tensor names are under model.language_model.*, not model.*
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 4m50s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Qwen3-Next is a multimodal architecture whose text core sits under
`model.language_model.*` — sibling to `model.visual.*` (vision tower)
and to top-level `lm_head` / `mtp.*`. Every text-side tensor in the
safetensors files carries that prefix:
model.language_model.embed_tokens.weight
model.language_model.layers.{i}.{input,post_attention}_layernorm.weight
model.language_model.layers.{i}.linear_attn.{in_proj_*, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight}
model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight
model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight
model.language_model.norm.weight
lm_head.weight (top-level; not under language_model)
The single-pre-emptive fix is in Qwen3_5Model::load — derive a
`text_vb = vb.pp("model.language_model")` once and walk
embed_tokens / layers / norm from there. `lm_head` stays at the
top-level VB; that path was already correct.
The non-text tensors (`model.visual.*`, `mtp.*`) are ignored: we
don't reference them, so the safetensors mmap is fine even though
the bytes are loaded into the address space.
After this, the load that was failing at
"cannot find tensor model.embed_tokens.weight" should proceed to
materialising the actual layer weights — where any further bugs
will be substantive architecture issues rather than naming ones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
07c44d5db1
|
fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m2s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m11s
Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and
the same applies to Qwen/Qwen3.6-27B):
1. Qwen3-Next config.json does NOT have a top-level `rope_theta`.
It lives inside `rope_parameters: { rope_theta, partial_rotary_factor,
rope_type, mrope_section, mrope_interleaved }`. Our TextConfig
declared `rope_theta` as a non-optional top-level field, so the
deserializer bailed with the misleading "missing field
`rope_theta` at line 74 col 5".
Replaced with a nested `RopeParameters` struct that mirrors the
upstream shape. Defaults are conservative (rope_theta=10000,
partial_rotary_factor=1.0) so a missing or partial block degrades
to standard full-rotation RoPE rather than failing.
2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of
the 256 head_dim values get RoPE applied — the rest pass through
unchanged. Our RotaryEmbedding was building the inv_freq table
for the full head_dim and rotating everything. Silently wrong
for every full-attention layer.
`RotaryEmbedding` now derives `rotary_dim` from
`head_dim * partial_rotary_factor`, builds its cos/sin tables at
that smaller size, and in `apply()` splits q/k into (rotate, pass)
on the last dim, only `rope_slow`-rotates the rotate half, and
re-concatenates. Mirrors the reference Python's
`apply_rotary_pos_emb` exactly for the non-trivial
`partial_rotary_factor` case.
Tests updated: config-deserialise fixture uses the real `rope_parameters`
shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The
linear-attention forward-smoke test was already using full rotation
which still works; just shifted to the nested struct.
After this, the load that previously failed at "parse Qwen3-Next
(qwen3_5) config.json: missing field rope_theta" should reach the
actual safetensors materialisation step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
e7eb3dab6a
|
feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 4m58s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's
architecture). The four new modules wrap the substantive
`linear_attn.rs` (landed previously) with the rest of the
transformer:
- `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is
simplified to plain RoPE (the three position grids collapse to one
for text-only inference); uses candle's `rope_slow` for the
GLM-style rotate-half rotation.
- `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False).
- `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two
Qwen3-Next quirks:
- `q_proj` widened to `2 * num_heads * head_dim`; second half
sigmoid'd and multiplied into the attention output before `o_proj`.
- q_norm/k_norm use the `(1+w)*x` RmsNorm variant.
- `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on
`layer_types[i]` to either Full attention or GatedDeltaNet.
`arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer
stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The
forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's
`squeeze_to_vocab` handles either shape.
Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now
builds a `ShardedVarBuilder` instead of a plain VarBuilder. The
sharded backend falls through to the unsharded path when
`world_size=1`, so single-GPU load is zero-cost; this lets the
forthcoming `tp_qwen3_5.rs` reuse the same load functions without a
second copy.
Verified: cargo build CPU + --features cuda inside the patched
container; clippy clean on both; 32 lib tests still pass. The
ForCausalLM forward no longer bails — but numerical correctness vs
the Python reference hasn't been validated yet (that's the next
step, with the Tbilisi probe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
180274548d
|
feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m17s
build-prerelease / Build neuron-blackwell (push) Successful in 3m48s
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Implements the recurrent-path Gated DeltaNet block that occupies 48 of
Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`).
Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py`
(`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`,
`Qwen3_5RMSNormGated`, `l2norm`).
Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules
- `mod.rs` — Config + (still-stub) ForCausalLM
- `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState
- `rmsnorm.rs` — Qwen3_5RmsNorm `(1+w)*x`, Qwen3_5RmsNormGated, l2norm
Architecture pieces in this commit:
- Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj
(all bias=False); depthwise causal Conv1d (k=4) with state-aware
prepend; SiLU; per-head reshape; L2norm on q,k.
- Discretisation: g = -exp(A_log) * softplus(a + dt_bias); beta = σ(b).
All computed in f32 to avoid the -inf underflow in fp16 that the
reference notes.
- Delta rule (recurrent, per-token):
state *= exp(g_t)
kv_mem = state^T · k_t
delta = (v_t - kv_mem) * beta_t
state += outer(k_t, delta)
out_t = state^T · q_t
- Output: RMSNormGated(core_attn_out, z) reshape out_proj.
State (`GatedDeltaNetState`) lives inline on the layer:
- conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail.
- recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the
delta-rule outer-product memory.
Cleared via `clear_kv_cache` at the start of every new request.
Config extended with the qwen3_5-specific fields:
- linear_num_value_heads (48 in Qwen3.6-27B)
- linear_num_key_heads (16)
- linear_key_head_dim (128)
- linear_value_head_dim (128)
- linear_conv_kernel_dim (4)
- hidden_act ("silu")
Performance note: this is the **recurrent** delta-rule (PyTorch's
`torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L)
prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`,
chunk_size=64) is a follow-up perf optimisation; surface stays the
same.
8 unit tests:
- softplus small/large branches
- l2norm hand-calc + zero-vector stability
- repeat_interleave round-trip
- forward_smoke on tiny dims (4-head fixture) — verifies shape +
no NaN/Inf propagation through the f32-promotion pipeline. Doesn't
validate numerical correctness against the Python reference; that
requires a fixed-weight fixture and is the next step.
cargo clippy CPU + --features cuda both clean; 32 lib tests pass.
The ForCausalLM stub still bails on forward — wrapping
attention/MLP/decoder layer + lm_head is the next sub-stage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
a70f317729
|
feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m29s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build cortex binary (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
Lays the wiring for the top-priority TP-2 target without doing the substantive architecture work yet. After this commit, attempting to load a Qwen3.6 (`model_type = "qwen3_5"`) model: - Passes config.json parse — the real upstream shape (text_config wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round- trips through a typed Config (unit test included). - Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a ModelArch::Qwen3_5Dense variant, registers it in the loaded set. - Fails on the first inference forward with a clear "Qwen3-Next forward not implemented yet (Stage 8c, TP-2 motivator)" — the point where the real architecture work begins. New layout: - `harness/arch/` for custom architectures candle-transformers doesn't ship. Each architecture is one module: Config + ForCausalLM + impl. - `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the open work: layer_types dispatch (full_attention vs linear_attention, the latter being the hard part with no candle precedent), attn_output_gate, text_config nesting, recurrent state lifecycle. - DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a branch that constructs the stub. TP-side gate: - New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the worker pool's `load_dense_shard` reconstructs the config as Qwen3 on every rank — silently misrouting a non-Qwen3 dense load through it would surface as a cryptic per-rank deserialise error. - TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else bails *before* the worker pool spawns and NCCL handshake costs are paid, with a marker pointing at the `tp_<family>.rs` module a contributor would need to add. qwen3_5 specifically lands here until its architecture is real. The naming choice: keep "qwen3_5" from the model's own config.json rather than mistralrs's "qwen3_next" — the latter ages poorly the moment Qwen ship another architecture revision. Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate); the previously-rejecting test for qwen3_5 swapped to a fictional arch so it stays meaningful as the supported set grows. 26 lib tests pass; cargo clippy CPU + --features cuda both clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
c6022aa6b9
|
feat(stage-8b): Llama + Qwen3 MoE families on the candle harness
All checks were successful
CI / Format (push) Successful in 31s
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Clippy (push) Successful in 2m6s
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 4m54s
CI / Test (push) Successful in 4m58s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 4m43s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m0s
Broadens the single-GPU dense and quantized paths to cover three non-Qwen3 architectures already shipped by candle-transformers. TP for these is a separate stage (each family would need its own tp_*.rs mirroring tp_qwen3.rs). `ModelArch` gains four variants: - LlamaDense (boxed — wraps Llama + an inline Cache + the config it takes to rebuild the cache, since candle::llama::Cache has no reset) - LlamaQuantized (candle_transformers::models::quantized_llama) - Qwen3MoeDense (candle::models::qwen3_moe::ModelForCausalLM) - Qwen3MoeQuantized (candle::models::quantized_qwen3_moe::GGUFQWenMoE — takes an explicit compute dtype; F16 by default for best consumer-GPU throughput) The dispatch is method-based now: - `ModelArch::forward(&mut self, input, offset) -> Result<Tensor>` with a shared `squeeze_to_vocab` normalising shape differences (qwen3 returns [B,1,V]; quantized_qwen3 returns [B,V]; new families may differ again — the helper handles all of them). - `ModelArch::clear_kv_cache(&mut self) -> Result<()>`. Llama needs a Cache rebuild because its Cache has no in-place reset; the new `LlamaDense` wrapper holds the bits needed to do it. `run_inference` / `run_inference_streaming` collapse to a single dispatch path: no more per-variant match arms in the hot loop, and new architectures pick up streaming + non-streaming for free with zero changes outside `ModelArch`. DENSE_SUPPORTED_MODEL_TYPES is now ["llama", "qwen3", "qwen3_moe"]. GGUF arch switch grows "qwen3moe" + "llama" branches (qwen3moe with no underscore matches llama.cpp's general.architecture convention). Stage 8a's diagnostic auto-reports the new supported set. The `LlamaDense` variant is boxed because the wrapper's inline Cache + Config makes it 544 bytes vs ~300 for everything else (clippy::large_enum_variant). Verified: cargo test --workspace passes 66 tests; cargo clippy CPU and `--features cuda` both clean (the cuda check ran inside the locally-built `neuron-build-local` container with the math_functions.h patch applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
|
9e31d8deca
|
feat(stage-8a): pre-flight architecture check for dense model loads
Some checks failed
CI / Format (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Clippy (push) Successful in 2m21s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m50s
build-prerelease / Build cortex binary (push) Successful in 4m28s
build-prerelease / Package cortex RPM (push) Successful in 1m24s
build-prerelease / Build neuron-ada (push) Has been cancelled
build-prerelease / Package helexa-neuron-ada RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-ampere RPM (push) Has been cancelled
build-prerelease / Package helexa-neuron-blackwell RPM (push) Has been cancelled
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Has been cancelled
build-prerelease / Build neuron-ampere (push) Has been cancelled
A request to load Qwen/Qwen3.6-27B (model_type "qwen3_5") on the
dense path was failing deep inside serde with:
missing field `vocab_size` at line 140 column 1
…because Qwen3.6 wraps its actual hyperparameters under `text_config`,
so none of `qwen3::Config`'s expected top-level fields are present.
The error gave no hint that the *architecture* was the problem.
`check_dense_config_supported` parses `config.json` as an untyped
JSON Value, inspects `model_type` (with `architectures` as bonus
context), and bails cleanly when it's not in the supported set
(currently `["qwen3"]`). The error names the rejected type, the
supported set, and points at the files a contributor needs to touch
to extend coverage — both the single-process `ModelArch` variants in
`candle.rs` and the TP analogue in `tp_qwen3.rs`.
Wired into both load paths:
- `load_arch_dense` (single-GPU), before the typed deserialize.
- `load_tp`, before spawning the worker pool — TP loads of an
unsupported arch now fail before NCCL/init costs are paid.
4 unit tests cover the accept/reject/missing-field/malformed cases.
Bonus: makes Stage 8b/8c work easier — adding a new architecture is
now a `DENSE_SUPPORTED_MODEL_TYPES` edit + ModelArch variant + load
branch, with the diagnostic auto-correctly listing the supported set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|