Files
cortex/doc/vision-qwen3_6-spec.md
rob thijssen 7df84fed8f
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m23s
build-prerelease / Build neuron-ampere (push) Successful in 7m56s
CI / Test (push) Successful in 7m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6
Stage A of the vision implementation plan
(doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding
that today's silent-drop failure mode (issue #3) needs — the
Qwen3.6 ViT loads from `model.visual.*`, runs forward producing
post-merger LM-side image embeddings, and routes through the
device worker via a new `Job::EncodeImage`. No LM splice yet —
that's Stage B.

Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision),
#13 (27B production deploy), #14 (dynamic resolution), #15
(numerical validation).

What landed:

- **A0 — investigation**: pulled config.json, preprocessor_config.json,
  chat_template.jinja, and safetensors index from beast's local
  Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with
  exact tensor shapes for every `model.visual.*` weight. Confirms
  27-block ViT with `hidden_size=1152`, `patch_size=16`,
  `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives
  in 2 of the 15 safetensors shards.

- **A1 — deps + scaffolding**: added `image = "0.25"` (default-
  features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to
  crates/neuron/Cargo.toml. Created `harness::preprocess` and
  `harness::arch::qwen3_5::vision` modules.

- **A2 — preprocess.rs**: `decode_data_uri` strips
  `data:image/...;base64,...` → image bytes → `image::DynamicImage`
  (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess`
  resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448),
  normalises to `[-1, 1]` per the model's mean/std=0.5, emits
  row-major `(3, H, W)` f32. 9 unit tests covering data URI parse,
  decode failure paths, grayscale-to-RGB promotion, and the
  exact-value normalisation contract.

- **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`,
  learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN +
  multi-head self-attention with fused QKV + GELU-tanh MLP +
  residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat →
  linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size).
  Includes the Conv3d→Conv2d fold trick documented at the top of
  the file — the published patch_embed.proj.weight is 5D
  `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static
  images we sum-collapse the temporal axis. Video would need real
  Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh`
  reference values from PyTorch.

- **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config`
  with optional `vision_config: Option<VisionConfig>` and
  `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision
  tower when present, exposes `has_vision()` and `vision()` so the
  HTTP layer can advertise capability and so the encode path can
  reach it.

- **A5 — device worker `Job::EncodeImage`**: new job variant carrying
  CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the
  tensor on the worker's device, calls `arch.encode_image(image)`,
  copies the result back to CPU as flat `Vec<f32>`. Keeps the
  "tensors don't escape the worker" invariant. Poisoned-worker
  drain path handles the new variant.

- **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_
  and_errors_on_unknown_handle` proves the channel/dispatch wiring
  works end-to-end via the CPU device worker (errors on unknown
  ArchHandle, which is the expected behaviour without a loaded
  model — real-weights validation happens in Stage B when the LM
  splice path exists).

CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
zero failures). New test counts: +9 in preprocess, +5 in vision,
+1 in device_worker.

Out of scope (deferred):
- LM-side splice of image embeddings at `<|image_pad|>` positions
  → Stage B.
- Streaming SSE for vision-bearing chat completions → Stage C.
- Reject `image_url` with HTTP 400 for non-vision models /
  advertise `capabilities` in /v1/models → Stage C.
- TP-vision (#12), 27B production deploy (#13), dynamic resolution
  (#14), numerical validation (#15).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 11:40:47 +03:00

10 KiB
Raw Blame History

Qwen3.6-27B vision specification (Stage A0)

Sourced from beast's local cache on 2026-06-01: /archive3/llm-cache/models--Qwen--Qwen3.6-27B/snapshots/6a9e13bd6fc8f0983b9b99948120bc37f49c13e9/.

Single source of truth for Stages AD of the vision plan in ~/.claude/plans/foamy-twirling-catmull.md. Umbrella issue: #3.


Top-level shape

The model is a unified text+vision architecture (Qwen3_5ForConditionalGeneration, model_type: qwen3_5) with three weight sections under a single safetensors index. Counts from model.safetensors.index.json:

Prefix Tensors Role
model.language_model.* 850 LM (currently loaded)
model.visual.* 333 Vision tower (currently filtered out at arch/qwen3_5/mod.rs:228-230)
mtp.* 15 Multi-token-prediction heads (filtered, out of scope)
lm_head.weight 1 LM head

Vision tensors live in shards model-00007-of-00015.safetensors and model-00008-of-00015.safetensors (2 of the 15 safetensors). Loading just these two for vision-tower-only smoke tests is feasible.

Vision tower architecture (model.visual.*)

From config.json::vision_config:

depth:                       27   (transformer blocks)
hidden_size:               1152   (vision token dim)
num_heads:                   16   (per-block self-attention)
intermediate_size:         4304   (MLP hidden)
patch_size:                  16   (16×16 spatial patches)
temporal_patch_size:          2   (video frame pairing; irrelevant for stills)
spatial_merge_size:           2   (2×2 spatial merge in the merger → 4 patches/LM token)
num_position_embeddings:   2304   (learned pos embed slots — max patch sequence length)
in_channels:                  3   (RGB)
hidden_act:    gelu_pytorch_tanh  (GELU with tanh approximation, not exact GELU)
out_hidden_size:           5120   (= LM hidden_size, merger output dim)
deepstack_visual_indexes:    []   (no deep-stack visual indexes)

Module inventory (per-block and global)

Global:

  • model.visual.patch_embed.proj.{weight, bias} — Conv2d (3 → 1152, kernel 16×16, stride 16). Turns image patches into tokens.
  • model.visual.pos_embed.weight — Learned position embedding, shape (2304, 1152).
  • model.visual.merger.{norm, linear_fc1, linear_fc2} — The projector that merges 2×2 patches and projects to LM hidden_size (1152 → 5120). All weights have biases.

Per block (×27, named model.visual.blocks.{0..26}):

  • norm1.{weight, bias}LayerNorm before attention (with bias — not RmsNorm).
  • attn.qkv.{weight, bias} — Fused QKV linear (1152 → 3·1152 = 3456).
  • attn.proj.{weight, bias} — Attention output projection (1152 → 1152).
  • norm2.{weight, bias} — LayerNorm before MLP.
  • mlp.linear_fc1.{weight, bias} — MLP up-projection (1152 → 4304).
  • mlp.linear_fc2.{weight, bias} — MLP down-projection (4304 → 1152).

Pattern matches a standard ViT block with pre-norm layout (norm → attn → residual, norm → MLP → residual). Activation between fc1/fc2 is GELU-tanh-approx per hidden_act. No attention masking inside the vision tower (all patches attend to each other).

Forward signature (target)

VisionTower::forward(
    patches: Tensor [N, in_channels, patch_size, patch_size],  # CPU-preprocessed RGB float patches
    grid_thw: Option<(usize, usize, usize)>,                   # (t, h, w) patch grid for position lookup
) -> Tensor [N / (spatial_merge_size²), out_hidden_size]      # = (N/4, 5120) for static images

Note: the merger consumes 4 spatially-adjacent patches and emits 1 LM token. So an image producing 64×64 = 4096 patches yields 1024 LM-side image tokens.

Image preprocessor (preprocessor_config.json)

{
    "size": { "longest_edge": 16777216, "shortest_edge": 65536 },
    "patch_size": 16,
    "temporal_patch_size": 2,
    "merge_size": 2,
    "image_mean": [0.5, 0.5, 0.5],
    "image_std":  [0.5, 0.5, 0.5],
    "processor_class": "Qwen3VLProcessor",
    "image_processor_type": "Qwen2VLImageProcessorFast"
}

Reading:

  • image_mean = image_std = 0.5 → normalisation is simply (x/255 - 0.5) / 0.5 = 2*x/255 - 1, mapping [0,255][-1, 1]. No imagenet-style mean/std.
  • size.{shortest_edge, longest_edge} are pixel counts, not edge lengths. The Qwen2VLImageProcessorFast recipe picks a resolution within [65,536 = 256², 16,777,216 = 4096²] total pixels, snapping h and w to multiples of patch_size × spatial_merge_size = 32 pixels.
  • Stage A ships fixed resolution: pick a target pixel count (e.g. 448×448 = 200,704 px → 28×28 patches → 14×14 LM tokens after merger). Variable resolution deferred to issue #14.

Chat template (chat_template.jinja)

Image insertion (lines 818 of the template):

{%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
    ...
    {{- '<|vision_start|><|image_pad|><|vision_end|>' }}

Per image, the template emits one <|image_pad|> token flanked by <|vision_start|> and <|vision_end|> sentinels. The runtime must:

  1. Render the template (preserving the single <|image_pad|> per image).
  2. For each image, replace its single <|image_pad|> with N copies, where N is the number of LM tokens that image produces after the vision tower + merger (= patches / spatial_merge_size²).
  3. Tokenize the expanded string → input_ids.
  4. At forward time, locate positions where input_ids == image_token_id (248056) and splice in the vision tower's merger output.

Token IDs (top of config.json):

  • vision_start_token_id: 248053
  • vision_end_token_id: 248054
  • image_token_id: 248056
  • video_token_id: 248057 (out of scope)
  • bos_token_id: 248044
  • eos_token_id: 248044, 248046 (per generation_config.json)

System messages cannot contain images (template raises). Other template-side details:

  • add_vision_id (jinja arg, default false): emits 'Picture N: ' prefixes when true.
  • preserve_thinking (jinja arg, default false): keeps <think> blocks from prior assistant turns in the rendered prompt.
  • enable_thinking (jinja arg, default true): emits <think>\n (or skips it) at the end of the generation prompt.

The existing chat-template renderer in crates/neuron/src/harness/chat_template.rs already passes MessageContent::Parts to the Jinja context as a Value::Array; the template's is iterable branch (line 6 of the template) handles them. The path is structurally in place — Stage B just needs to do the <|image_pad|> expansion + token-position-aware splice.

LM-side considerations

The LM's RoPE config uses multi-axis RoPE (MRoPE):

rope_parameters: {
    mrope_interleaved: true,
    mrope_section: [11, 11, 10],         # text + height + width components
    partial_rotary_factor: 0.25,
    rope_theta: 10000000,
    rope_type: "default"
}

MRoPE encodes spatial position alongside text position so the LM attention layers can reason about image-token spatial structure. The LM's existing forward path may or may not already implement this — the qwen3_5 module's doc-comment notes "numerical correctness vs the reference Python is not yet validated." Verifying MRoPE behaviour in the language model is out of Stage A scope (vision tower only) but will be required in Stage B (LM splice) and is tracked under the numerical-validation issue #15.

max_position_embeddings = 262144 (256 K context), so context-length limits are not a constraint for vision.

Iteration target decision

The vision tower has its own self-contained weight tree and is small (~333 tensors in 2 shards, hidden_size 1152 vs LM's 5120). For Stage A specifically (vision-tower-only smoke), we don't need a smaller iteration model — we can:

  • Build the Rust VisionTower struct against the spec above.
  • Run unit tests with random tensor weights matching the exact shapes → assert forward produces correct output shape with finite values.
  • Optionally: a CUDA-integration test that loads just the 2 vision shards from beast's cache (or on a smaller GPU like quadbrat's Ampere) and runs encode on a real image. Doesn't require loading the 27B LM at all.

This sidesteps the "develop against a smaller VL model" question for Stage A. Stage B (LM splice → end-to-end chat with vision) is where iteration speed becomes pressing; revisit there. The default scope pick 2a (smaller iteration model) is therefore deferred to Stage B planning — issue #13 covers deployment validation regardless.

Concrete Stage A1+ inputs

  • Add deps to crates/neuron/Cargo.toml:
    • image = "0.25"
    • base64 = "0.22"
  • Stage A2 preprocessor target resolution (fixed): 448×448 → 28×28 patches → 14×14 = 196 image tokens per image. This balances minimum-patch-count for cheap tests against the model's expected input range.
  • Stage A3 module structure: one VisionTower struct holding patch_embed: Conv2d, pos_embed: Embedding, blocks: Vec<VisionBlock>, merger: Merger. VisionBlock carries norm1, norm2, attn, mlp. Hand-roll using candle primitives.
  • Stage A4 weight loading: extend Qwen3_5ForCausalLM::new() to construct Some(VisionTower::new(vb.pp("model.visual"), config)) when vision_config is present in the parsed config.
  • Stage A5 worker job: Job::EncodeImage { handle, patches: Vec<f32>, patch_shape: (usize, usize, usize, usize, usize), reply: oneshot<Result<Vec<f32>>> }. Patch shape = (N, C, T, H, W) where T=1 for static images.

What this doc does NOT settle (deferred to issues)

  • Numerical correctness of VisionTower output vs Python transformers → issue #15.
  • Variable image resolution → issue #14.
  • TP-vision (multi-rank vision tower) → issue #12.
  • 27B production deployment → issue #13.