All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m23s
build-prerelease / Build neuron-ampere (push) Successful in 7m56s
CI / Test (push) Successful in 7m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
Stage A of the vision implementation plan (doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding that today's silent-drop failure mode (issue #3) needs — the Qwen3.6 ViT loads from `model.visual.*`, runs forward producing post-merger LM-side image embeddings, and routes through the device worker via a new `Job::EncodeImage`. No LM splice yet — that's Stage B. Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision), #13 (27B production deploy), #14 (dynamic resolution), #15 (numerical validation). What landed: - **A0 — investigation**: pulled config.json, preprocessor_config.json, chat_template.jinja, and safetensors index from beast's local Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with exact tensor shapes for every `model.visual.*` weight. Confirms 27-block ViT with `hidden_size=1152`, `patch_size=16`, `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives in 2 of the 15 safetensors shards. - **A1 — deps + scaffolding**: added `image = "0.25"` (default- features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to crates/neuron/Cargo.toml. Created `harness::preprocess` and `harness::arch::qwen3_5::vision` modules. - **A2 — preprocess.rs**: `decode_data_uri` strips `data:image/...;base64,...` → image bytes → `image::DynamicImage` (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess` resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448), normalises to `[-1, 1]` per the model's mean/std=0.5, emits row-major `(3, H, W)` f32. 9 unit tests covering data URI parse, decode failure paths, grayscale-to-RGB promotion, and the exact-value normalisation contract. - **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`, learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN + multi-head self-attention with fused QKV + GELU-tanh MLP + residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat → linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size). Includes the Conv3d→Conv2d fold trick documented at the top of the file — the published patch_embed.proj.weight is 5D `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static images we sum-collapse the temporal axis. Video would need real Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh` reference values from PyTorch. - **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config` with optional `vision_config: Option<VisionConfig>` and `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision tower when present, exposes `has_vision()` and `vision()` so the HTTP layer can advertise capability and so the encode path can reach it. - **A5 — device worker `Job::EncodeImage`**: new job variant carrying CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the tensor on the worker's device, calls `arch.encode_image(image)`, copies the result back to CPU as flat `Vec<f32>`. Keeps the "tensors don't escape the worker" invariant. Poisoned-worker drain path handles the new variant. - **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_ and_errors_on_unknown_handle` proves the channel/dispatch wiring works end-to-end via the CPU device worker (errors on unknown ArchHandle, which is the expected behaviour without a loaded model — real-weights validation happens in Stage B when the LM splice path exists). CI gate: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace (all 28 test groups ok, zero failures). New test counts: +9 in preprocess, +5 in vision, +1 in device_worker. Out of scope (deferred): - LM-side splice of image embeddings at `<|image_pad|>` positions → Stage B. - Streaming SSE for vision-bearing chat completions → Stage C. - Reject `image_url` with HTTP 400 for non-vision models / advertise `capabilities` in /v1/models → Stage C. - TP-vision (#12), 27B production deploy (#13), dynamic resolution (#14), numerical validation (#15). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
177 lines
10 KiB
Markdown
177 lines
10 KiB
Markdown
# Qwen3.6-27B vision specification (Stage A0)
|
||
|
||
Sourced from beast's local cache on 2026-06-01:
|
||
`/archive3/llm-cache/models--Qwen--Qwen3.6-27B/snapshots/6a9e13bd6fc8f0983b9b99948120bc37f49c13e9/`.
|
||
|
||
Single source of truth for Stages A–D of the vision plan in
|
||
`~/.claude/plans/foamy-twirling-catmull.md`. Umbrella issue:
|
||
[#3](https://git.lair.cafe/helexa/cortex/issues/3).
|
||
|
||
---
|
||
|
||
## Top-level shape
|
||
|
||
The model is a unified text+vision architecture (`Qwen3_5ForConditionalGeneration`,
|
||
`model_type: qwen3_5`) with three weight sections under a single safetensors
|
||
index. Counts from `model.safetensors.index.json`:
|
||
|
||
| Prefix | Tensors | Role |
|
||
|---|---|---|
|
||
| `model.language_model.*` | 850 | LM (currently loaded) |
|
||
| `model.visual.*` | 333 | Vision tower (currently filtered out at `arch/qwen3_5/mod.rs:228-230`) |
|
||
| `mtp.*` | 15 | Multi-token-prediction heads (filtered, out of scope) |
|
||
| `lm_head.weight` | 1 | LM head |
|
||
|
||
Vision tensors live in shards `model-00007-of-00015.safetensors` and
|
||
`model-00008-of-00015.safetensors` (2 of the 15 safetensors). Loading just
|
||
these two for vision-tower-only smoke tests is feasible.
|
||
|
||
## Vision tower architecture (`model.visual.*`)
|
||
|
||
From `config.json::vision_config`:
|
||
|
||
```
|
||
depth: 27 (transformer blocks)
|
||
hidden_size: 1152 (vision token dim)
|
||
num_heads: 16 (per-block self-attention)
|
||
intermediate_size: 4304 (MLP hidden)
|
||
patch_size: 16 (16×16 spatial patches)
|
||
temporal_patch_size: 2 (video frame pairing; irrelevant for stills)
|
||
spatial_merge_size: 2 (2×2 spatial merge in the merger → 4 patches/LM token)
|
||
num_position_embeddings: 2304 (learned pos embed slots — max patch sequence length)
|
||
in_channels: 3 (RGB)
|
||
hidden_act: gelu_pytorch_tanh (GELU with tanh approximation, not exact GELU)
|
||
out_hidden_size: 5120 (= LM hidden_size, merger output dim)
|
||
deepstack_visual_indexes: [] (no deep-stack visual indexes)
|
||
```
|
||
|
||
### Module inventory (per-block and global)
|
||
|
||
Global:
|
||
- `model.visual.patch_embed.proj.{weight, bias}` — Conv2d (3 → 1152, kernel 16×16, stride 16). Turns image patches into tokens.
|
||
- `model.visual.pos_embed.weight` — Learned position embedding, shape `(2304, 1152)`.
|
||
- `model.visual.merger.{norm, linear_fc1, linear_fc2}` — The projector that merges 2×2 patches and projects to LM hidden_size (1152 → 5120). All weights have biases.
|
||
|
||
Per block (×27, named `model.visual.blocks.{0..26}`):
|
||
- `norm1.{weight, bias}` — **LayerNorm** before attention (with bias — not RmsNorm).
|
||
- `attn.qkv.{weight, bias}` — Fused QKV linear (1152 → 3·1152 = 3456).
|
||
- `attn.proj.{weight, bias}` — Attention output projection (1152 → 1152).
|
||
- `norm2.{weight, bias}` — LayerNorm before MLP.
|
||
- `mlp.linear_fc1.{weight, bias}` — MLP up-projection (1152 → 4304).
|
||
- `mlp.linear_fc2.{weight, bias}` — MLP down-projection (4304 → 1152).
|
||
|
||
Pattern matches a standard ViT block with **pre-norm** layout (norm → attn → residual, norm → MLP → residual). Activation between fc1/fc2 is GELU-tanh-approx per `hidden_act`. No attention masking inside the vision tower (all patches attend to each other).
|
||
|
||
### Forward signature (target)
|
||
|
||
```
|
||
VisionTower::forward(
|
||
patches: Tensor [N, in_channels, patch_size, patch_size], # CPU-preprocessed RGB float patches
|
||
grid_thw: Option<(usize, usize, usize)>, # (t, h, w) patch grid for position lookup
|
||
) -> Tensor [N / (spatial_merge_size²), out_hidden_size] # = (N/4, 5120) for static images
|
||
```
|
||
|
||
Note: the merger consumes 4 spatially-adjacent patches and emits 1 LM token. So an image producing 64×64 = 4096 patches yields 1024 LM-side image tokens.
|
||
|
||
## Image preprocessor (`preprocessor_config.json`)
|
||
|
||
```json
|
||
{
|
||
"size": { "longest_edge": 16777216, "shortest_edge": 65536 },
|
||
"patch_size": 16,
|
||
"temporal_patch_size": 2,
|
||
"merge_size": 2,
|
||
"image_mean": [0.5, 0.5, 0.5],
|
||
"image_std": [0.5, 0.5, 0.5],
|
||
"processor_class": "Qwen3VLProcessor",
|
||
"image_processor_type": "Qwen2VLImageProcessorFast"
|
||
}
|
||
```
|
||
|
||
Reading:
|
||
|
||
- `image_mean = image_std = 0.5` → normalisation is simply `(x/255 - 0.5) / 0.5 = 2*x/255 - 1`, mapping `[0,255]` → `[-1, 1]`. No imagenet-style mean/std.
|
||
- `size.{shortest_edge, longest_edge}` are **pixel counts**, not edge lengths. The `Qwen2VLImageProcessorFast` recipe picks a resolution within `[65,536 = 256², 16,777,216 = 4096²]` total pixels, snapping `h` and `w` to multiples of `patch_size × spatial_merge_size = 32` pixels.
|
||
- Stage A ships **fixed resolution**: pick a target pixel count (e.g. 448×448 = 200,704 px → 28×28 patches → 14×14 LM tokens after merger). Variable resolution deferred to issue [#14](https://git.lair.cafe/helexa/cortex/issues/14).
|
||
|
||
## Chat template (`chat_template.jinja`)
|
||
|
||
Image insertion (lines 8–18 of the template):
|
||
|
||
```jinja
|
||
{%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
|
||
...
|
||
{{- '<|vision_start|><|image_pad|><|vision_end|>' }}
|
||
```
|
||
|
||
Per image, the template emits **one `<|image_pad|>` token** flanked by `<|vision_start|>` and `<|vision_end|>` sentinels. The runtime must:
|
||
|
||
1. Render the template (preserving the single `<|image_pad|>` per image).
|
||
2. For each image, replace its single `<|image_pad|>` with N copies, where N is the number of LM tokens that image produces after the vision tower + merger (= `patches / spatial_merge_size²`).
|
||
3. Tokenize the expanded string → `input_ids`.
|
||
4. At forward time, locate positions where `input_ids == image_token_id` (248056) and splice in the vision tower's merger output.
|
||
|
||
Token IDs (top of `config.json`):
|
||
- `vision_start_token_id`: 248053
|
||
- `vision_end_token_id`: 248054
|
||
- `image_token_id`: 248056
|
||
- `video_token_id`: 248057 (out of scope)
|
||
- `bos_token_id`: 248044
|
||
- `eos_token_id`: 248044, 248046 (per `generation_config.json`)
|
||
|
||
System messages cannot contain images (template raises). Other template-side details:
|
||
- `add_vision_id` (jinja arg, default false): emits `'Picture N: '` prefixes when true.
|
||
- `preserve_thinking` (jinja arg, default false): keeps `<think>` blocks from prior assistant turns in the rendered prompt.
|
||
- `enable_thinking` (jinja arg, default true): emits `<think>\n` (or skips it) at the end of the generation prompt.
|
||
|
||
The existing chat-template renderer in `crates/neuron/src/harness/chat_template.rs` already passes `MessageContent::Parts` to the Jinja context as a `Value::Array`; the template's `is iterable` branch (line 6 of the template) handles them. **The path is structurally in place** — Stage B just needs to do the `<|image_pad|>` expansion + token-position-aware splice.
|
||
|
||
## LM-side considerations
|
||
|
||
The LM's RoPE config uses **multi-axis RoPE (MRoPE)**:
|
||
|
||
```
|
||
rope_parameters: {
|
||
mrope_interleaved: true,
|
||
mrope_section: [11, 11, 10], # text + height + width components
|
||
partial_rotary_factor: 0.25,
|
||
rope_theta: 10000000,
|
||
rope_type: "default"
|
||
}
|
||
```
|
||
|
||
MRoPE encodes spatial position alongside text position so the LM attention layers can reason about image-token spatial structure. The LM's existing forward path *may or may not* already implement this — the qwen3_5 module's doc-comment notes "numerical correctness vs the reference Python is not yet validated." Verifying MRoPE behaviour in the language model is out of Stage A scope (vision tower only) but will be required in Stage B (LM splice) and is tracked under the numerical-validation issue [#15](https://git.lair.cafe/helexa/cortex/issues/15).
|
||
|
||
`max_position_embeddings = 262144` (256 K context), so context-length limits are not a constraint for vision.
|
||
|
||
## Iteration target decision
|
||
|
||
The vision tower has its own self-contained weight tree and is small (~333 tensors in 2 shards, hidden_size 1152 vs LM's 5120). For Stage A specifically (vision-tower-only smoke), we **don't need a smaller iteration model** — we can:
|
||
|
||
- Build the Rust `VisionTower` struct against the spec above.
|
||
- Run unit tests with random tensor weights matching the exact shapes → assert forward produces correct output shape with finite values.
|
||
- Optionally: a CUDA-integration test that loads just the 2 vision shards from beast's cache (or on a smaller GPU like quadbrat's Ampere) and runs encode on a real image. Doesn't require loading the 27B LM at all.
|
||
|
||
This sidesteps the "develop against a smaller VL model" question for Stage A. Stage B (LM splice → end-to-end chat with vision) is where iteration speed becomes pressing; revisit there. The default scope pick 2a (smaller iteration model) is therefore deferred to Stage B planning — issue [#13](https://git.lair.cafe/helexa/cortex/issues/13) covers deployment validation regardless.
|
||
|
||
## Concrete Stage A1+ inputs
|
||
|
||
- Add deps to `crates/neuron/Cargo.toml`:
|
||
- `image = "0.25"`
|
||
- `base64 = "0.22"`
|
||
- Stage A2 preprocessor target resolution (fixed): **448×448 → 28×28 patches → 14×14 = 196 image tokens per image**. This balances minimum-patch-count for cheap tests against the model's expected input range.
|
||
- Stage A3 module structure: one `VisionTower` struct holding `patch_embed: Conv2d`, `pos_embed: Embedding`, `blocks: Vec<VisionBlock>`, `merger: Merger`. `VisionBlock` carries `norm1`, `norm2`, `attn`, `mlp`. Hand-roll using candle primitives.
|
||
- Stage A4 weight loading: extend `Qwen3_5ForCausalLM::new()` to construct `Some(VisionTower::new(vb.pp("model.visual"), config))` when `vision_config` is present in the parsed config.
|
||
- Stage A5 worker job: `Job::EncodeImage { handle, patches: Vec<f32>, patch_shape: (usize, usize, usize, usize, usize), reply: oneshot<Result<Vec<f32>>> }`. Patch shape = `(N, C, T, H, W)` where T=1 for static images.
|
||
|
||
## What this doc does NOT settle (deferred to issues)
|
||
|
||
- Numerical correctness of `VisionTower` output vs Python transformers
|
||
→ issue [#15](https://git.lair.cafe/helexa/cortex/issues/15).
|
||
- Variable image resolution
|
||
→ issue [#14](https://git.lair.cafe/helexa/cortex/issues/14).
|
||
- TP-vision (multi-rank vision tower)
|
||
→ issue [#12](https://git.lair.cafe/helexa/cortex/issues/12).
|
||
- 27B production deployment
|
||
→ issue [#13](https://git.lair.cafe/helexa/cortex/issues/13).
|