All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / CUDA type-check (push) Successful in 31s
CI / Format (push) Successful in 42s
build-prerelease / Build cortex binary (push) Successful in 5m9s
build-prerelease / Build neuron-blackwell (push) Successful in 6m4s
build-prerelease / Package cortex RPM (push) Successful in 1m32s
CI / Test (push) Successful in 7m19s
build-prerelease / Build neuron-ampere (push) Successful in 8m40s
build-prerelease / Build neuron-ada (push) Successful in 5m17s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m0s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m1s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m53s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m14s
CI / Clippy (push) Successful in 2m29s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
Mirror Stage 3 into the tensor-parallel Qwen3.6 model: - TpQwen3_5Attention / DecoderLayer take (cos, sin) instead of a scalar offset and apply via apply_cos_sin. - TpQwen3_5Model gains the replicated rotary + rope_delta (reset in clear_kv_cache, settable). forward_inner builds the cos/sin once — interleaved M-RoPE from explicit position_ids (vision) or plain at offset+rope_delta (text/decode). forward() and forward_with_positions() delegate; the old single-shot forward_with_vision is gone. - prefill_with_images_chunked now computes get_rope_index over the whole prompt once, stores rope_delta on the base model, and slices the (3, prompt_len) position tensor per chunk — so every rank assigns image tokens their 14×14 grid coordinates and steps in lockstep (every chunk, text or image, carries the M-RoPE slice because the image shifts the surrounding text positions). Also build the position-id tensor as f32 directly (positions are small integers, exact in f32) to avoid an i64→f32 cast on the GPU. The TP forward is cuda-gated — CI CUDA type-check is the compile gate. Non-cuda build + clippy + full workspace tests green; rope math + the plain-RoPE-reduction invariant covered by unit tests. Completes the interleaved-M-RoPE work for the vision spatial misread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>