Stage A of the vision implementation plan (doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding that today's silent-drop failure mode (issue #3) needs — the Qwen3.6 ViT loads from `model.visual.*`, runs forward producing post-merger LM-side image embeddings, and routes through the device worker via a new `Job::EncodeImage`. No LM splice yet — that's Stage B. Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision), #13 (27B production deploy), #14 (dynamic resolution), #15 (numerical validation). What landed: - **A0 — investigation**: pulled config.json, preprocessor_config.json, chat_template.jinja, and safetensors index from beast's local Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with exact tensor shapes for every `model.visual.*` weight. Confirms 27-block ViT with `hidden_size=1152`, `patch_size=16`, `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives in 2 of the 15 safetensors shards. - **A1 — deps + scaffolding**: added `image = "0.25"` (default- features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to crates/neuron/Cargo.toml. Created `harness::preprocess` and `harness::arch::qwen3_5::vision` modules. - **A2 — preprocess.rs**: `decode_data_uri` strips `data:image/...;base64,...` → image bytes → `image::DynamicImage` (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess` resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448), normalises to `[-1, 1]` per the model's mean/std=0.5, emits row-major `(3, H, W)` f32. 9 unit tests covering data URI parse, decode failure paths, grayscale-to-RGB promotion, and the exact-value normalisation contract. - **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`, learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN + multi-head self-attention with fused QKV + GELU-tanh MLP + residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat → linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size). Includes the Conv3d→Conv2d fold trick documented at the top of the file — the published patch_embed.proj.weight is 5D `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static images we sum-collapse the temporal axis. Video would need real Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh` reference values from PyTorch. - **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config` with optional `vision_config: Option<VisionConfig>` and `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision tower when present, exposes `has_vision()` and `vision()` so the HTTP layer can advertise capability and so the encode path can reach it. - **A5 — device worker `Job::EncodeImage`**: new job variant carrying CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the tensor on the worker's device, calls `arch.encode_image(image)`, copies the result back to CPU as flat `Vec<f32>`. Keeps the "tensors don't escape the worker" invariant. Poisoned-worker drain path handles the new variant. - **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_ and_errors_on_unknown_handle` proves the channel/dispatch wiring works end-to-end via the CPU device worker (errors on unknown ArchHandle, which is the expected behaviour without a loaded model — real-weights validation happens in Stage B when the LM splice path exists). CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, zero failures). New test counts: +9 in preprocess, +5 in vision, +1 in device_worker. Out of scope (deferred): - LM-side splice of image embeddings at `<|image_pad|>` positions → Stage B. - Streaming SSE for vision-bearing chat completions → Stage C. - Reject `image_url` with HTTP 400 for non-vision models / advertise `capabilities` in /v1/models → Stage C. - TP-vision (#12), 27B production deploy (#13), dynamic resolution (#14), numerical validation (#15). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
10 KiB
Qwen3.6-27B vision specification (Stage A0)
Sourced from beast's local cache on 2026-06-01:
/archive3/llm-cache/models--Qwen--Qwen3.6-27B/snapshots/6a9e13bd6fc8f0983b9b99948120bc37f49c13e9/.
Single source of truth for Stages A–D of the vision plan in
~/.claude/plans/foamy-twirling-catmull.md. Umbrella issue:
#3.
Top-level shape
The model is a unified text+vision architecture (Qwen3_5ForConditionalGeneration,
model_type: qwen3_5) with three weight sections under a single safetensors
index. Counts from model.safetensors.index.json:
| Prefix | Tensors | Role |
|---|---|---|
model.language_model.* |
850 | LM (currently loaded) |
model.visual.* |
333 | Vision tower (currently filtered out at arch/qwen3_5/mod.rs:228-230) |
mtp.* |
15 | Multi-token-prediction heads (filtered, out of scope) |
lm_head.weight |
1 | LM head |
Vision tensors live in shards model-00007-of-00015.safetensors and
model-00008-of-00015.safetensors (2 of the 15 safetensors). Loading just
these two for vision-tower-only smoke tests is feasible.
Vision tower architecture (model.visual.*)
From config.json::vision_config:
depth: 27 (transformer blocks)
hidden_size: 1152 (vision token dim)
num_heads: 16 (per-block self-attention)
intermediate_size: 4304 (MLP hidden)
patch_size: 16 (16×16 spatial patches)
temporal_patch_size: 2 (video frame pairing; irrelevant for stills)
spatial_merge_size: 2 (2×2 spatial merge in the merger → 4 patches/LM token)
num_position_embeddings: 2304 (learned pos embed slots — max patch sequence length)
in_channels: 3 (RGB)
hidden_act: gelu_pytorch_tanh (GELU with tanh approximation, not exact GELU)
out_hidden_size: 5120 (= LM hidden_size, merger output dim)
deepstack_visual_indexes: [] (no deep-stack visual indexes)
Module inventory (per-block and global)
Global:
model.visual.patch_embed.proj.{weight, bias}— Conv2d (3 → 1152, kernel 16×16, stride 16). Turns image patches into tokens.model.visual.pos_embed.weight— Learned position embedding, shape(2304, 1152).model.visual.merger.{norm, linear_fc1, linear_fc2}— The projector that merges 2×2 patches and projects to LM hidden_size (1152 → 5120). All weights have biases.
Per block (×27, named model.visual.blocks.{0..26}):
norm1.{weight, bias}— LayerNorm before attention (with bias — not RmsNorm).attn.qkv.{weight, bias}— Fused QKV linear (1152 → 3·1152 = 3456).attn.proj.{weight, bias}— Attention output projection (1152 → 1152).norm2.{weight, bias}— LayerNorm before MLP.mlp.linear_fc1.{weight, bias}— MLP up-projection (1152 → 4304).mlp.linear_fc2.{weight, bias}— MLP down-projection (4304 → 1152).
Pattern matches a standard ViT block with pre-norm layout (norm → attn → residual, norm → MLP → residual). Activation between fc1/fc2 is GELU-tanh-approx per hidden_act. No attention masking inside the vision tower (all patches attend to each other).
Forward signature (target)
VisionTower::forward(
patches: Tensor [N, in_channels, patch_size, patch_size], # CPU-preprocessed RGB float patches
grid_thw: Option<(usize, usize, usize)>, # (t, h, w) patch grid for position lookup
) -> Tensor [N / (spatial_merge_size²), out_hidden_size] # = (N/4, 5120) for static images
Note: the merger consumes 4 spatially-adjacent patches and emits 1 LM token. So an image producing 64×64 = 4096 patches yields 1024 LM-side image tokens.
Image preprocessor (preprocessor_config.json)
{
"size": { "longest_edge": 16777216, "shortest_edge": 65536 },
"patch_size": 16,
"temporal_patch_size": 2,
"merge_size": 2,
"image_mean": [0.5, 0.5, 0.5],
"image_std": [0.5, 0.5, 0.5],
"processor_class": "Qwen3VLProcessor",
"image_processor_type": "Qwen2VLImageProcessorFast"
}
Reading:
image_mean = image_std = 0.5→ normalisation is simply(x/255 - 0.5) / 0.5 = 2*x/255 - 1, mapping[0,255]→[-1, 1]. No imagenet-style mean/std.size.{shortest_edge, longest_edge}are pixel counts, not edge lengths. TheQwen2VLImageProcessorFastrecipe picks a resolution within[65,536 = 256², 16,777,216 = 4096²]total pixels, snappinghandwto multiples ofpatch_size × spatial_merge_size = 32pixels.- Stage A ships fixed resolution: pick a target pixel count (e.g. 448×448 = 200,704 px → 28×28 patches → 14×14 LM tokens after merger). Variable resolution deferred to issue #14.
Chat template (chat_template.jinja)
Image insertion (lines 8–18 of the template):
{%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
...
{{- '<|vision_start|><|image_pad|><|vision_end|>' }}
Per image, the template emits one <|image_pad|> token flanked by <|vision_start|> and <|vision_end|> sentinels. The runtime must:
- Render the template (preserving the single
<|image_pad|>per image). - For each image, replace its single
<|image_pad|>with N copies, where N is the number of LM tokens that image produces after the vision tower + merger (=patches / spatial_merge_size²). - Tokenize the expanded string →
input_ids. - At forward time, locate positions where
input_ids == image_token_id(248056) and splice in the vision tower's merger output.
Token IDs (top of config.json):
vision_start_token_id: 248053vision_end_token_id: 248054image_token_id: 248056video_token_id: 248057 (out of scope)bos_token_id: 248044eos_token_id: 248044, 248046 (pergeneration_config.json)
System messages cannot contain images (template raises). Other template-side details:
add_vision_id(jinja arg, default false): emits'Picture N: 'prefixes when true.preserve_thinking(jinja arg, default false): keeps<think>blocks from prior assistant turns in the rendered prompt.enable_thinking(jinja arg, default true): emits<think>\n(or skips it) at the end of the generation prompt.
The existing chat-template renderer in crates/neuron/src/harness/chat_template.rs already passes MessageContent::Parts to the Jinja context as a Value::Array; the template's is iterable branch (line 6 of the template) handles them. The path is structurally in place — Stage B just needs to do the <|image_pad|> expansion + token-position-aware splice.
LM-side considerations
The LM's RoPE config uses multi-axis RoPE (MRoPE):
rope_parameters: {
mrope_interleaved: true,
mrope_section: [11, 11, 10], # text + height + width components
partial_rotary_factor: 0.25,
rope_theta: 10000000,
rope_type: "default"
}
MRoPE encodes spatial position alongside text position so the LM attention layers can reason about image-token spatial structure. The LM's existing forward path may or may not already implement this — the qwen3_5 module's doc-comment notes "numerical correctness vs the reference Python is not yet validated." Verifying MRoPE behaviour in the language model is out of Stage A scope (vision tower only) but will be required in Stage B (LM splice) and is tracked under the numerical-validation issue #15.
max_position_embeddings = 262144 (256 K context), so context-length limits are not a constraint for vision.
Iteration target decision
The vision tower has its own self-contained weight tree and is small (~333 tensors in 2 shards, hidden_size 1152 vs LM's 5120). For Stage A specifically (vision-tower-only smoke), we don't need a smaller iteration model — we can:
- Build the Rust
VisionTowerstruct against the spec above. - Run unit tests with random tensor weights matching the exact shapes → assert forward produces correct output shape with finite values.
- Optionally: a CUDA-integration test that loads just the 2 vision shards from beast's cache (or on a smaller GPU like quadbrat's Ampere) and runs encode on a real image. Doesn't require loading the 27B LM at all.
This sidesteps the "develop against a smaller VL model" question for Stage A. Stage B (LM splice → end-to-end chat with vision) is where iteration speed becomes pressing; revisit there. The default scope pick 2a (smaller iteration model) is therefore deferred to Stage B planning — issue #13 covers deployment validation regardless.
Concrete Stage A1+ inputs
- Add deps to
crates/neuron/Cargo.toml:image = "0.25"base64 = "0.22"
- Stage A2 preprocessor target resolution (fixed): 448×448 → 28×28 patches → 14×14 = 196 image tokens per image. This balances minimum-patch-count for cheap tests against the model's expected input range.
- Stage A3 module structure: one
VisionTowerstruct holdingpatch_embed: Conv2d,pos_embed: Embedding,blocks: Vec<VisionBlock>,merger: Merger.VisionBlockcarriesnorm1,norm2,attn,mlp. Hand-roll using candle primitives. - Stage A4 weight loading: extend
Qwen3_5ForCausalLM::new()to constructSome(VisionTower::new(vb.pp("model.visual"), config))whenvision_configis present in the parsed config. - Stage A5 worker job:
Job::EncodeImage { handle, patches: Vec<f32>, patch_shape: (usize, usize, usize, usize, usize), reply: oneshot<Result<Vec<f32>>> }. Patch shape =(N, C, T, H, W)where T=1 for static images.