Commit Graph

7 Commits

Author SHA1 Message Date
95dc8745eb feat(stage-8c): TP-aware Qwen3-Next (tp_qwen3_5)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 36s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m13s
build-prerelease / Build neuron-blackwell (push) Successful in 3m37s
CI / Test (push) Successful in 4m49s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m26s
build-prerelease / Build neuron-ampere (push) Successful in 5m18s
build-prerelease / Package cortex RPM (push) Successful in 7m6s
build-prerelease / Build neuron-ada (push) Successful in 5m13s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 5m39s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Adds `harness/tp/tp_qwen3_5.rs` — the tensor-parallel variant of the
Qwen3-Next architecture — plus the dispatch wiring needed to route a
load through it on both the leader and the workers.

Architecture pieces (all per-rank, follow `tp_qwen3.rs` patterns for
the full-attention layers + a new pattern for linear-attention):

- TpQwen3_5GatedDeltaNet: V-head-dim sharded. `num_v_heads / world_size`
  V-heads per rank, `num_k_heads / world_size` K-heads. `in_proj_z`,
  `in_proj_b`, `in_proj_a`, `A_log`, `dt_bias` shard uniformly along
  the V-head dim. `out_proj` is row-parallel + AllReduce (the only
  collective inside the block). The recurrent state shards 1:1 with
  V-heads — no cross-rank sync inside the delta-rule loop.

  `in_proj_qkv` and `conv1d.weight` are FUSED tensors with three
  regions along dim 0 (`[first key_dim, second key_dim, value_dim]`).
  Standard uniform-slicing doesn't align with the head boundaries —
  rank 0 would end up with `[first half of K_0, full K_1, first half
  of V]`. New `load_fused_qkv_slice_{2d,3d}` helpers load the full
  tensor, narrow per-region per-rank, and `Tensor::cat` the three
  slices into a per-rank fused weight. Transient peak of one full
  tensor per layer during construction; net memory is properly per-
  rank after the full drops.

- TpQwen3_5Attention: column-parallel `q_proj` (the widened
  `2 * num_heads * head_dim` output, including the gate half — shards
  along the head axis so both query AND gate halves stay consistent
  per rank), `k_proj`, `v_proj`; row-parallel `o_proj` with AllReduce.
  Otherwise mirrors `tp_qwen3.rs`'s attention.

- TpQwen3_5MLP, TpQwen3_5DecoderLayer (dispatches on layer_types),
  TpQwen3_5Model (with `model.language_model.*` prefix), and
  TpQwen3_5ForCausalLM (with tied or separate `lm_head` at top level).

Dispatch wiring:

- New `tp::TpLeaderModel` enum holds either Qwen3 or Qwen3_5 variant.
  `WorkerPool::load_dense_shard` now dispatches on `model_type` from
  the config JSON and returns `Arc<Mutex<TpLeaderModel>>`. The two
  downstream methods (`generate_step`, `clear_kv_cache`) thread this
  enum through — the inner forward+clear_kv_cache dispatch happens
  via the enum's pub methods. Adding another TP architecture later is
  one more enum variant + match arms.

- Worker side gets a parallel `WorkerModel` enum + dispatch in
  `handle_load_dense_shard`, branching on the same `model_type`.

- Harness gate `TP_SUPPORTED_MODEL_TYPES` now `["qwen3", "qwen3_5"]`.
  `TpLoadedModel.leader_model` retyped to the enum.

Helpers in `arch/qwen3_5/linear_attn.rs`:
- `softplus` and `repeat_interleave` made `pub(crate)` so the TP
  module reuses them rather than duplicating.

Reuses unchanged: `Qwen3_5RmsNorm` (replicated weight), the gated
`Qwen3_5RmsNormGated` tail, `l2norm`, the `RotaryEmbedding` (partial
RoPE with `partial_rotary_factor` already correct).

CPU build + clippy + 32 lib tests pass; `cargo clippy --features cuda`
also clean inside the patched runner container.

Single inflight risk to call out: tensor names. For full-attention
layers the per-layer prefix is `model.language_model.layers.<i>.self_attn.*`
and for linear-attention layers `model.language_model.layers.<i>.linear_attn.*`
— the same as the single-GPU path. lm_head sits at the top level (not
under `language_model`) — consistent with the single-GPU path that
validated against Qwen3.5-0.8B.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:02:42 +03:00
495d3f7c05 fix(qwen3_5): promote beta to F32 alongside q/k/v in delta rule
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 40s
CI / Format (push) Successful in 43s
CI / Clippy (push) Successful in 2m20s
CI / Test (push) Successful in 4m33s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m19s
build-prerelease / Package cortex RPM (push) Successful in 1m25s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build neuron-ampere (push) Successful in 4m46s
build-prerelease / Build neuron-ada (push) Successful in 5m9s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m58s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m9s
The single-GPU dense load of Qwen/Qwen3.5-0.8B succeeded but the first
inference forward bombed with `dtype mismatch in mul, lhs: F32, rhs:
BF16`. Trace through the recurrent delta-rule loop:

  let q = (q.to_dtype(F32)? * scale)?;        // F32
  let k = k.to_dtype(F32)?;                    // F32
  let v = v.to_dtype(F32)?;                    // F32
  // g built from A_log/dt_bias                 // F32
  // beta = sigmoid(b)                          // BF16 (sigmoid preserves dtype)
  ...
  let delta = (v_t - kv_mem)?.broadcast_mul(&beta_col)?;
                ^^^^^^^^^^^^^                    ^^^^^^^^^
                F32                              BF16   ← mismatch

`g` was already F32 because it was constructed from `a_log.to_dtype(F32)`
+ `dt_bias.to_dtype(F32)` earlier in the function. `beta` came from
`sigmoid(b)` where `b` was the model dtype (BF16), so beta stayed BF16
and the multiplication tripped candle's dtype-mismatch check.

Promote beta to F32 at the same point we promote q/k/v.

Caught by the validate-neuron.sh probe against Qwen/Qwen3.5-0.8B on
beast — load returned 200, then `POST /v1/chat/completions` returned
the dtype error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 21:13:19 +03:00
5c4c8e0eba fix(qwen3_5): tensor names are under model.language_model.*, not model.*
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 33s
CI / Format (push) Successful in 35s
CI / Clippy (push) Successful in 2m12s
build-prerelease / Build neuron-blackwell (push) Successful in 3m49s
CI / Test (push) Successful in 4m27s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-ampere (push) Successful in 4m50s
build-prerelease / Build neuron-ada (push) Successful in 5m12s
build-prerelease / Build cortex binary (push) Successful in 4m14s
build-prerelease / Package cortex RPM (push) Successful in 1m17s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 59s
Qwen3-Next is a multimodal architecture whose text core sits under
`model.language_model.*` — sibling to `model.visual.*` (vision tower)
and to top-level `lm_head` / `mtp.*`. Every text-side tensor in the
safetensors files carries that prefix:

  model.language_model.embed_tokens.weight
  model.language_model.layers.{i}.{input,post_attention}_layernorm.weight
  model.language_model.layers.{i}.linear_attn.{in_proj_*, conv1d.weight, A_log, dt_bias, norm.weight, out_proj.weight}
  model.language_model.layers.{i}.self_attn.{q,k,v,o}_proj.weight + {q,k}_norm.weight
  model.language_model.layers.{i}.mlp.{gate,up,down}_proj.weight
  model.language_model.norm.weight
  lm_head.weight              (top-level; not under language_model)

The single-pre-emptive fix is in Qwen3_5Model::load — derive a
`text_vb = vb.pp("model.language_model")` once and walk
embed_tokens / layers / norm from there. `lm_head` stays at the
top-level VB; that path was already correct.

The non-text tensors (`model.visual.*`, `mtp.*`) are ignored: we
don't reference them, so the safetensors mmap is fine even though
the bytes are loaded into the address space.

After this, the load that was failing at
"cannot find tensor model.embed_tokens.weight" should proceed to
materialising the actual layer weights — where any further bugs
will be substantive architecture issues rather than naming ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 16:48:16 +03:00
07c44d5db1 fix(qwen3_5): nested rope_parameters + partial_rotary_factor=0.25
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 36s
CI / Clippy (push) Successful in 2m16s
CI / Test (push) Successful in 4m37s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m21s
build-prerelease / Build neuron-blackwell (push) Successful in 3m51s
build-prerelease / Package cortex RPM (push) Successful in 1m21s
build-prerelease / Build neuron-ampere (push) Successful in 5m2s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m40s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m11s
Two interlocked bugs surfaced trying to load Qwen/Qwen3.5-0.8B (and
the same applies to Qwen/Qwen3.6-27B):

1. Qwen3-Next config.json does NOT have a top-level `rope_theta`.
   It lives inside `rope_parameters: { rope_theta, partial_rotary_factor,
   rope_type, mrope_section, mrope_interleaved }`. Our TextConfig
   declared `rope_theta` as a non-optional top-level field, so the
   deserializer bailed with the misleading "missing field
   `rope_theta` at line 74 col 5".

   Replaced with a nested `RopeParameters` struct that mirrors the
   upstream shape. Defaults are conservative (rope_theta=10000,
   partial_rotary_factor=1.0) so a missing or partial block degrades
   to standard full-rotation RoPE rather than failing.

2. `partial_rotary_factor: 0.25` means only `head_dim * 0.25 = 64` of
   the 256 head_dim values get RoPE applied — the rest pass through
   unchanged. Our RotaryEmbedding was building the inv_freq table
   for the full head_dim and rotating everything. Silently wrong
   for every full-attention layer.

   `RotaryEmbedding` now derives `rotary_dim` from
   `head_dim * partial_rotary_factor`, builds its cos/sin tables at
   that smaller size, and in `apply()` splits q/k into (rotate, pass)
   on the last dim, only `rope_slow`-rotates the rotate half, and
   re-concatenates. Mirrors the reference Python's
   `apply_rotary_pos_emb` exactly for the non-trivial
   `partial_rotary_factor` case.

Tests updated: config-deserialise fixture uses the real `rope_parameters`
shape (matching the Qwen3.6-27B and Qwen3.5-0.8B configs). The
linear-attention forward-smoke test was already using full rotation
which still works; just shifted to the nested struct.

After this, the load that previously failed at "parse Qwen3-Next
(qwen3_5) config.json: missing field rope_theta" should reach the
actual safetensors materialisation step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 16:18:52 +03:00
e7eb3dab6a feat(stage-8c): full-attention layer + decoder + Model + ForCausalLM for qwen3_5
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 37s
CI / Format (push) Successful in 39s
CI / Clippy (push) Successful in 2m19s
CI / Test (push) Successful in 4m50s
build-prerelease / Build cortex binary (push) Successful in 4m21s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 3m41s
build-prerelease / Package cortex RPM (push) Successful in 1m27s
build-prerelease / Build neuron-ampere (push) Successful in 4m58s
build-prerelease / Build neuron-ada (push) Successful in 5m8s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m53s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m52s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 58s
Completes the single-GPU dense path for Qwen3-Next (Qwen3.6's
architecture). The four new modules wrap the substantive
`linear_attn.rs` (landed previously) with the rest of the
transformer:

- `arch/qwen3_5/rope.rs` — text-side rotary embedding. MRoPE is
  simplified to plain RoPE (the three position grids collapse to one
  for text-only inference); uses candle's `rope_slow` for the
  GLM-style rotate-half rotation.
- `arch/qwen3_5/mlp.rs` — Qwen3_5MLP (SwiGLU: gate/up/down, bias=False).
- `arch/qwen3_5/full_attn.rs` — Qwen3_5Attention with the two
  Qwen3-Next quirks:
  - `q_proj` widened to `2 * num_heads * head_dim`; second half
    sigmoid'd and multiplied into the attention output before `o_proj`.
  - q_norm/k_norm use the `(1+w)*x` RmsNorm variant.
- `arch/qwen3_5/decoder.rs` — Qwen3_5DecoderLayer dispatching on
  `layer_types[i]` to either Full attention or GatedDeltaNet.

`arch/qwen3_5/mod.rs` gets the real `Qwen3_5Model` (embedding + layer
stack + final norm) and `Qwen3_5ForCausalLM` (model + lm_head). The
forward returns `[B, 1, vocab]` to match `qwen3_dense`; the harness's
`squeeze_to_vocab` handles either shape.

Switch: `candle.rs::load_arch_dense` for `model_type=qwen3_5` now
builds a `ShardedVarBuilder` instead of a plain VarBuilder. The
sharded backend falls through to the unsharded path when
`world_size=1`, so single-GPU load is zero-cost; this lets the
forthcoming `tp_qwen3_5.rs` reuse the same load functions without a
second copy.

Verified: cargo build CPU + --features cuda inside the patched
container; clippy clean on both; 32 lib tests still pass. The
ForCausalLM forward no longer bails — but numerical correctness vs
the Python reference hasn't been validated yet (that's the next
step, with the Tbilisi probe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 15:52:33 +03:00
180274548d feat(stage-8c): linear-attention layer (Qwen3-Next GatedDeltaNet)
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 39s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m17s
build-prerelease / Build neuron-blackwell (push) Successful in 3m48s
CI / Test (push) Successful in 5m1s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build cortex binary (push) Successful in 4m36s
build-prerelease / Package cortex RPM (push) Successful in 1m23s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 4m39s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 2m55s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m57s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m43s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m4s
Implements the recurrent-path Gated DeltaNet block that occupies 48 of
Qwen3.6's 64 decoder layers (`layer_types[i] == "linear_attention"`).
Ported from `huggingface/transformers/models/qwen3_5/modeling_qwen3_5.py`
(`Qwen3_5GatedDeltaNet`, `torch_recurrent_gated_delta_rule`,
`Qwen3_5RMSNormGated`, `l2norm`).

Layout: `arch/qwen3_5.rs` becomes `arch/qwen3_5/` with submodules
- `mod.rs`         — Config + (still-stub) ForCausalLM
- `linear_attn.rs` — GatedDeltaNet + GatedDeltaNetState
- `rmsnorm.rs`     — Qwen3_5RmsNorm `(1+w)*x`, Qwen3_5RmsNormGated, l2norm

Architecture pieces in this commit:
- Block: in_proj_qkv + in_proj_z + in_proj_b + in_proj_a + out_proj
  (all bias=False); depthwise causal Conv1d (k=4) with state-aware
  prepend; SiLU; per-head reshape; L2norm on q,k.
- Discretisation: g = -exp(A_log) * softplus(a + dt_bias); beta = σ(b).
  All computed in f32 to avoid the -inf underflow in fp16 that the
  reference notes.
- Delta rule (recurrent, per-token):
    state *= exp(g_t)
    kv_mem = state^T · k_t
    delta  = (v_t - kv_mem) * beta_t
    state += outer(k_t, delta)
    out_t  = state^T · q_t
- Output: RMSNormGated(core_attn_out, z) reshape out_proj.

State (`GatedDeltaNetState`) lives inline on the layer:
- conv_state: (B, conv_dim, conv_kernel_size) — left-padded tail.
- recurrent_state: (B, num_v_heads, head_k_dim, head_v_dim) — the
  delta-rule outer-product memory.
Cleared via `clear_kv_cache` at the start of every new request.

Config extended with the qwen3_5-specific fields:
- linear_num_value_heads (48 in Qwen3.6-27B)
- linear_num_key_heads   (16)
- linear_key_head_dim    (128)
- linear_value_head_dim  (128)
- linear_conv_kernel_dim (4)
- hidden_act             ("silu")

Performance note: this is the **recurrent** delta-rule (PyTorch's
`torch_recurrent_gated_delta_rule`), correct for any seq_len but O(L)
prefill. The chunked algorithm (`torch_chunk_gated_delta_rule`,
chunk_size=64) is a follow-up perf optimisation; surface stays the
same.

8 unit tests:
- softplus small/large branches
- l2norm hand-calc + zero-vector stability
- repeat_interleave round-trip
- forward_smoke on tiny dims (4-head fixture) — verifies shape +
  no NaN/Inf propagation through the f32-promotion pipeline. Doesn't
  validate numerical correctness against the Python reference; that
  requires a fixed-weight fixture and is the next step.

cargo clippy CPU + --features cuda both clean; 32 lib tests pass.
The ForCausalLM stub still bails on forward — wrapping
attention/MLP/decoder layer + lm_head is the next sub-stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:29:52 +03:00
a70f317729 feat(stage-8c): scaffold qwen3_5 (Qwen3.6) — dispatch + stubs + TP gate
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 38s
CI / Clippy (push) Successful in 2m14s
CI / Test (push) Successful in 4m29s
build-prerelease / Build neuron-blackwell (push) Successful in 3m39s
build-prerelease / Build cortex binary (push) Successful in 4m17s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m31s
build-prerelease / Build neuron-ampere (push) Successful in 5m13s
build-prerelease / Build neuron-ada (push) Successful in 5m1s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m6s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m50s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m44s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
Lays the wiring for the top-priority TP-2 target without doing the
substantive architecture work yet. After this commit, attempting to
load a Qwen3.6 (`model_type = "qwen3_5"`) model:
- Passes config.json parse — the real upstream shape (text_config
  wrapper, layer_types, attn_output_gate, head_dim=256, etc.) round-
  trips through a typed Config (unit test included).
- Constructs a placeholder Qwen3_5ForCausalLM, attaches it to a
  ModelArch::Qwen3_5Dense variant, registers it in the loaded set.
- Fails on the first inference forward with a clear "Qwen3-Next
  forward not implemented yet (Stage 8c, TP-2 motivator)" — the
  point where the real architecture work begins.

New layout:
- `harness/arch/` for custom architectures candle-transformers doesn't
  ship. Each architecture is one module: Config + ForCausalLM + impl.
- `harness/arch/qwen3_5.rs` — the scaffold. Heavy doc comments on the
  open work: layer_types dispatch (full_attention vs linear_attention,
  the latter being the hard part with no candle precedent),
  attn_output_gate, text_config nesting, recurrent state lifecycle.
- DENSE_SUPPORTED_MODEL_TYPES adds "qwen3_5"; load_arch_dense gains a
  branch that constructs the stub.

TP-side gate:
- New `check_tp_arch_supported`: even though Llama / Qwen3 MoE pass
  the single-GPU dense check (DENSE_SUPPORTED_MODEL_TYPES), the
  worker pool's `load_dense_shard` reconstructs the config as Qwen3
  on every rank — silently misrouting a non-Qwen3 dense load through
  it would surface as a cryptic per-rank deserialise error.
- TP_SUPPORTED_MODEL_TYPES = ["qwen3"] (cuda-gated). Anything else
  bails *before* the worker pool spawns and NCCL handshake costs are
  paid, with a marker pointing at the `tp_<family>.rs` module a
  contributor would need to add. qwen3_5 specifically lands here
  until its architecture is real.

The naming choice: keep "qwen3_5" from the model's own config.json
rather than mistralrs's "qwen3_next" — the latter ages poorly the
moment Qwen ship another architecture revision.

Unit tests: 2 new for qwen3_5 (config deserialise + dispatch gate);
the previously-rejecting test for qwen3_5 swapped to a fictional
arch so it stays meaningful as the supported set grows. 26 lib tests
pass; cargo clippy CPU + --features cuda both clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:58:01 +03:00