cortex

Author	SHA1	Message	Date
rob thijssen	95dc8745eb	feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 36s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m13s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m37s Details CI / Test (push) Successful in 4m49s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m26s Details build-prerelease / Build neuron-ampere (push) Successful in 5m18s Details build-prerelease / Package cortex RPM (push) Successful in 7m6s Details build-prerelease / Build neuron-ada (push) Successful in 5m13s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s Details Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the Qwen3-Next architecture — plus the dispatch wiring needed to route a load through it on both the leader and the workers. Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for the full-attention layers + a new pattern for linear-attention): - TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size` V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`, `in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along the V-head dim. `out_proj` is row-parallel + AllReduce (the only collective inside the block). The recurrent state shards 1:1 with V-heads — no cross-rank sync inside the delta-rule loop. `in_proj_qkv` and `conv1d.weight` are FUSED tensors with three regions along dim 0 (`[first key_dim, second key_dim, value_dim]`). Standard uniform-slicing doesn't align with the head boundaries — rank 0 would end up with `[first half of K_0, full K_1, first half of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full tensor, narrow per-region per-rank, and `Tensor::cat` the three slices into a per-rank fused weight. Transient peak of one full tensor per layer during construction; net memory is properly per- rank after the full drops. - TpQwen3_5Attention: column-parallel `q_proj` (the widened `2 * num_heads * head_dim` output, including the gate half — shards along the head axis so both query AND gate halves stay consistent per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce. Otherwise mirrors `tp_qwen3.rs`'s attention. - TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types), TpQwen3_5Model (with `model.language_model.` prefix), and TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level). Dispatch wiring: - New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant. `WorkerPool::load_dense_shard` now dispatches on `model_type` from the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two downstream methods (`generate_step`, `clear_kv_cache`) thread this enum through — the inner forward+clear_kv_cache dispatch happens via the enum's pub methods. Adding another TP architecture later is one more enum variant + match arms. - Worker side gets a parallel `WorkerModel` enum + dispatch in `handle_load_dense_shard`, branching on the same `model_type`. - Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`. `TpLoadedModel.leader_model` retyped to the enum. Helpers in `arch/qwen3_5/linear_attn.rs`: - `softplus` and `repeat_interleave` made `pub(crate)` so the TP module reuses them rather than duplicating. Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated `Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial RoPE with `partial_rotary_factor` already correct). CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda` also clean inside the patched runner container. Single inflight risk to call out: tensor names. For full-attention layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.` and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*` — the same as the single-GPU path. lm_head sits at the top level (not under `language_model`) — consistent with the single-GPU path that validated against Qwen3.5-0.8B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:02:42 +03:00
rob thijssen	495d3f7c05	fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule All checks were successful build-prerelease / Resolve version stamps (push) Successful in 40s Details CI / Format (push) Successful in 43s Details CI / Clippy (push) Successful in 2m20s Details CI / Test (push) Successful in 4m33s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m19s Details build-prerelease / Package cortex RPM (push) Successful in 1m25s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build neuron-ampere (push) Successful in 4m46s Details build-prerelease / Build neuron-ada (push) Successful in 5m9s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s Details The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs: BF16`. Trace through the recurrent delta-rule loop: let q = (q.to_dtype(F32)? * scale)?; // F32 let k = k.to_dtype(F32)?; // F32 let v = v.to_dtype(F32)?; // F32 // g built from A_log/dt_bias // F32 // beta = sigmoid(b) // BF16 (sigmoid preserves dtype) ... let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?; ^^^^^^^^^^^^^ ^^^^^^^^^ F32 BF16 ← mismatch `g` was already F32 because it was constructed from `a_log.to_dtype(F32)` + `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from `sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16 and the multiplication tripped candle's dtype-mismatch check. Promote beta to F32 at the same point we promote q/k/v. Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on beast — load returned 200, then `POST /v1/chat/completions` returned the dtype error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 21:13:19 +03:00
rob thijssen	5c4c8e0eba	fix(qwen3_5): tensor names are under `model.language_model.`, not `model.` All checks were successful build-prerelease / Resolve version stamps (push) Successful in 33s Details CI / Format (push) Successful in 35s Details CI / Clippy (push) Successful in 2m12s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m49s Details CI / Test (push) Successful in 4m27s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-ampere (push) Successful in 4m50s Details build-prerelease / Build neuron-ada (push) Successful in 5m12s Details build-prerelease / Build cortex binary (push) Successful in 4m14s Details build-prerelease / Package cortex RPM (push) Successful in 1m17s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s Details Qwen3-Next is a multimodal architecture whose text core sits under `model.language_model.` — sibling to `model.visual.` (vision tower) and to top-level `lm_head` / `mtp.`. Every text-side tensor in the safetensors files carries that prefix: model.language_model.embed_tokens.weight model.language_model.layers.{i}.{input,post_attention}_layernorm.weight model.language_model.layers.{i}.linear_attn.{in_proj_, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight} model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight model.language_model.norm.weight lm_head.weight (top-level; not under language_model) The single-pre-emptive fix is in Qwen3_5Model::load — derive a `text_vb = vb.pp("model.language_model")` once and walk embed_tokens / layers / norm from there. `lm_head` stays at the top-level VB; that path was already correct. The non-text tensors (`model.visual.`, `mtp.`) are ignored: we don't reference them, so the safetensors mmap is fine even though the bytes are loaded into the address space. After this, the load that was failing at "cannot find tensor model.embed_tokens.weight" should proceed to materialising the actual layer weights — where any further bugs will be substantive architecture issues rather than naming ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 16:48:16 +03:00
rob thijssen	07c44d5db1	fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 34s Details CI / Format (push) Successful in 36s Details CI / Clippy (push) Successful in 2m16s Details CI / Test (push) Successful in 4m37s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m51s Details build-prerelease / Package cortex RPM (push) Successful in 1m21s Details build-prerelease / Build neuron-ampere (push) Successful in 5m2s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m11s Details Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and the same applies to Qwen/Qwen3.6-27B): 1. Qwen3-Next config.json does NOT have a top-level `rope_theta`. It lives inside `rope_parameters: { rope_theta, partial_rotary_factor, rope_type, mrope_section, mrope_interleaved }`. Our TextConfig declared `rope_theta` as a non-optional top-level field, so the deserializer bailed with the misleading "missing field `rope_theta` at line 74 col 5". Replaced with a nested `RopeParameters` struct that mirrors the upstream shape. Defaults are conservative (rope_theta=10000, partial_rotary_factor=1.0) so a missing or partial block degrades to standard full-rotation RoPE rather than failing. 2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of the 256 head_dim values get RoPE applied — the rest pass through unchanged. Our RotaryEmbedding was building the inv_freq table for the full head_dim and rotating everything. Silently wrong for every full-attention layer. `RotaryEmbedding` now derives `rotary_dim` from `head_dim * partial_rotary_factor`, builds its cos/sin tables at that smaller size, and in `apply()` splits q/k into (rotate, pass) on the last dim, only `rope_slow`-rotates the rotate half, and re-concatenates. Mirrors the reference Python's `apply_rotary_pos_emb` exactly for the non-trivial `partial_rotary_factor` case. Tests updated: config-deserialise fixture uses the real `rope_parameters` shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The linear-attention forward-smoke test was already using full rotation which still works; just shifted to the nested struct. After this, the load that previously failed at "parse Qwen3-Next (qwen3_5) config.json: missing field rope_theta" should reach the actual safetensors materialisation step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 16:18:52 +03:00
rob thijssen	e7eb3dab6a	feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5 All checks were successful build-prerelease / Resolve version stamps (push) Successful in 37s Details CI / Format (push) Successful in 39s Details CI / Clippy (push) Successful in 2m19s Details CI / Test (push) Successful in 4m50s Details build-prerelease / Build cortex binary (push) Successful in 4m21s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build neuron-blackwell (push) Successful in 3m41s Details build-prerelease / Package cortex RPM (push) Successful in 1m27s Details build-prerelease / Build neuron-ampere (push) Successful in 4m58s Details build-prerelease / Build neuron-ada (push) Successful in 5m8s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s Details Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's architecture). The four new modules wrap the substantive `linear_attn.rs` (landed previously) with the rest of the transformer: - `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is simplified to plain RoPE (the three position grids collapse to one for text-only inference); uses candle's `rope_slow` for the GLM-style rotate-half rotation. - `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False). - `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two Qwen3-Next quirks: - `q_proj` widened to `2 * num_heads * head_dim`; second half sigmoid'd and multiplied into the attention output before `o_proj`. - q_norm/k_norm use the `(1+w)*x` RmsNorm variant. - `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on `layer_types[i]` to either Full attention or GatedDeltaNet. `arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's `squeeze_to_vocab` handles either shape. Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now builds a `ShardedVarBuilder` instead of a plain VarBuilder. The sharded backend falls through to the unsharded path when `world_size=1`, so single-GPU load is zero-cost; this lets the forthcoming `tp_qwen3_5.rs` reuse the same load functions without a second copy. Verified: cargo build CPU + --features cuda inside the patched container; clippy clean on both; 32 lib tests still pass. The ForCausalLM forward no longer bails — but numerical correctness vs the Python reference hasn't been validated yet (that's the next step, with the Tbilisi probe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 15:52:33 +03:00
rob thijssen	180274548d	feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet) All checks were successful build-prerelease / Resolve version stamps (push) Successful in 39s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m17s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m48s Details CI / Test (push) Successful in 5m1s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Build cortex binary (push) Successful in 4m36s Details build-prerelease / Package cortex RPM (push) Successful in 1m23s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 4m39s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s Details Implements the recurrent-path Gated DeltaNet block that occupies 48 of Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`). Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py` (`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`, `Qwen3_5RMSNormGated`, `l2norm`). Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules - `mod.rs` — Config + (still-stub) ForCausalLM - `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState - `rmsnorm.rs` — Qwen3_5RmsNorm `(1+w)x`, Qwen3_5RmsNormGated, l2norm Architecture pieces in this commit: - Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj (all bias=False); depthwise causal Conv1d (k=4) with state-aware prepend; SiLU; per-head reshape; L2norm on q,k. - Discretisation: g = -exp(A_log) softplus(a + dt_bias); beta = σ(b). All computed in f32 to avoid the -inf underflow in fp16 that the reference notes. - Delta rule (recurrent, per-token): state = exp(g_t) kv_mem = state^T · k_t delta = (v_t - kv_mem) beta_t state += outer(k_t, delta) out_t = state^T · q_t - Output: RMSNormGated(core_attn_out, z) reshape out_proj. State (`GatedDeltaNetState`) lives inline on the layer: - conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail. - recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the delta-rule outer-product memory. Cleared via `clear_kv_cache` at the start of every new request. Config extended with the qwen3_5-specific fields: - linear_num_value_heads (48 in Qwen3.6-27B) - linear_num_key_heads (16) - linear_key_head_dim (128) - linear_value_head_dim (128) - linear_conv_kernel_dim (4) - hidden_act ("silu") Performance note: this is the recurrent delta-rule (PyTorch's `torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L) prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`, chunk_size=64) is a follow-up perf optimisation; surface stays the same. 8 unit tests: - softplus small/large branches - l2norm hand-calc + zero-vector stability - repeat_interleave round-trip - forward_smoke on tiny dims (4-head fixture) — verifies shape + no NaN/Inf propagation through the f32-promotion pipeline. Doesn't validate numerical correctness against the Python reference; that requires a fixed-weight fixture and is the next step. cargo clippy CPU + --features cuda both clean; 32 lib tests pass. The ForCausalLM stub still bails on forward — wrapping attention/MLP/decoder layer + lm_head is the next sub-stage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:29:52 +03:00
rob thijssen	a70f317729	feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate All checks were successful build-prerelease / Resolve version stamps (push) Successful in 30s Details CI / Format (push) Successful in 38s Details CI / Clippy (push) Successful in 2m14s Details CI / Test (push) Successful in 4m29s Details build-prerelease / Build neuron-blackwell (push) Successful in 3m39s Details build-prerelease / Build cortex binary (push) Successful in 4m17s Details CI / Build cortex SRPM (push) Has been skipped Details CI / Publish cortex to COPR (push) Has been skipped Details CI / Build neuron SRPM (push) Has been skipped Details CI / Publish neuron to COPR (push) Has been skipped Details CI / Bump version in source (push) Has been skipped Details build-prerelease / Package cortex RPM (push) Successful in 1m31s Details build-prerelease / Build neuron-ampere (push) Successful in 5m13s Details build-prerelease / Build neuron-ada (push) Successful in 5m1s Details build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s Details build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s Details build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s Details build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s Details Lays the wiring for the top-priority TP-2 target without doing the substantive architecture work yet. After this commit, attempting to load a Qwen3.6 (`model_type = "qwen3_5"`) model: - Passes config.json parse — the real upstream shape (text_config wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round- trips through a typed Config (unit test included). - Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a ModelArch::Qwen3_5Dense variant, registers it in the loaded set. - Fails on the first inference forward with a clear "Qwen3-Next forward not implemented yet (Stage 8c, TP-2 motivator)" — the point where the real architecture work begins. New layout: - `harness/arch/` for custom architectures candle-transformers doesn't ship. Each architecture is one module: Config + ForCausalLM + impl. - `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the open work: layer_types dispatch (full_attention vs linear_attention, the latter being the hard part with no candle precedent), attn_output_gate, text_config nesting, recurrent state lifecycle. - DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a branch that constructs the stub. TP-side gate: - New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the worker pool's `load_dense_shard` reconstructs the config as Qwen3 on every rank — silently misrouting a non-Qwen3 dense load through it would surface as a cryptic per-rank deserialise error. - TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else bails before the worker pool spawns and NCCL handshake costs are paid, with a marker pointing at the `tp_<family>.rs` module a contributor would need to add. qwen3_5 specifically lands here until its architecture is real. The naming choice: keep "qwen3_5" from the model's own config.json rather than mistralrs's "qwen3_next" — the latter ages poorly the moment Qwen ship another architecture revision. Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate); the previously-rejecting test for qwen3_5 swapped to a fictional arch so it stays meaningful as the supported set grows. 26 lib tests pass; cargo clippy CPU + --features cuda both clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:58:01 +03:00

7 Commits