Files
cortex/doc/vision-qwen3_6-spec.md
rob thijssen 7df84fed8f
All checks were successful
CI / CUDA type-check (push) Successful in 32s
build-prerelease / Resolve version stamps (push) Successful in 30s
CI / Format (push) Successful in 28s
CI / Clippy (push) Successful in 2m35s
build-prerelease / Build cortex binary (push) Successful in 5m13s
build-prerelease / Build neuron-blackwell (push) Successful in 6m23s
build-prerelease / Build neuron-ampere (push) Successful in 7m56s
CI / Test (push) Successful in 7m11s
CI / Build cortex SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Build neuron-ada (push) Successful in 5m30s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 2m56s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m45s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 4m25s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m1s
feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6
Stage A of the vision implementation plan
(doc/vision-qwen3_6-spec.md). Builds the vision tower scaffolding
that today's silent-drop failure mode (issue #3) needs — the
Qwen3.6 ViT loads from `model.visual.*`, runs forward producing
post-merger LM-side image embeddings, and routes through the
device worker via a new `Job::EncodeImage`. No LM splice yet —
that's Stage B.

Refs #3 (umbrella). Deferred sub-stages tracked as #12 (TP-vision),
#13 (27B production deploy), #14 (dynamic resolution), #15
(numerical validation).

What landed:

- **A0 — investigation**: pulled config.json, preprocessor_config.json,
  chat_template.jinja, and safetensors index from beast's local
  Qwen3.6-27B cache. Documented in doc/vision-qwen3_6-spec.md with
  exact tensor shapes for every `model.visual.*` weight. Confirms
  27-block ViT with `hidden_size=1152`, `patch_size=16`,
  `spatial_merge_size=2`, `out_hidden_size=5120`. Vision tower lives
  in 2 of the 15 safetensors shards.

- **A1 — deps + scaffolding**: added `image = "0.25"` (default-
  features off, PNG/JPEG/WebP/BMP/GIF) and `base64 = "0.22"` to
  crates/neuron/Cargo.toml. Created `harness::preprocess` and
  `harness::arch::qwen3_5::vision` modules.

- **A2 — preprocess.rs**: `decode_data_uri` strips
  `data:image/...;base64,...` → image bytes → `image::DynamicImage`
  (rejecting `http(s)://` URLs to avoid SSRF/recursion); `preprocess`
  resizes to a fixed `PreprocessProfile::qwen3_6()` (448×448),
  normalises to `[-1, 1]` per the model's mean/std=0.5, emits
  row-major `(3, H, W)` f32. 9 unit tests covering data URI parse,
  decode failure paths, grayscale-to-RGB promotion, and the
  exact-value normalisation contract.

- **A3 — vision.rs**: `VisionTower` struct with `patch_embed: Conv2d`,
  learned `pos_embed: Embedding`, 27 `VisionBlock`s (pre-LN +
  multi-head self-attention with fused QKV + GELU-tanh MLP +
  residuals), and `VisionMerger` (LayerNorm → 2×2 spatial concat →
  linear_fc1 → GELU-tanh → linear_fc2 to LM hidden_size).
  Includes the Conv3d→Conv2d fold trick documented at the top of
  the file — the published patch_embed.proj.weight is 5D
  `(1152, 3, 2, 16, 16)` but candle 0.10 has no Conv3d; for static
  images we sum-collapse the temporal axis. Video would need real
  Conv3d. 5 unit tests including the exact `gelu_pytorch_tanh`
  reference values from PyTorch.

- **A4 — wire vision into Qwen3_5ForCausalLM**: extended `Config`
  with optional `vision_config: Option<VisionConfig>` and
  `image_token_id`; `Qwen3_5ForCausalLM::new` now loads the vision
  tower when present, exposes `has_vision()` and `vision()` so the
  HTTP layer can advertise capability and so the encode path can
  reach it.

- **A5 — device worker `Job::EncodeImage`**: new job variant carrying
  CPU-side `(C, H, W)` pixels. Dispatch handler reconstructs the
  tensor on the worker's device, calls `arch.encode_image(image)`,
  copies the result back to CPU as flat `Vec<f32>`. Keeps the
  "tensors don't escape the worker" invariant. Poisoned-worker
  drain path handles the new variant.

- **A6 — dispatch round-trip test**: `encode_image_routes_to_dispatch_
  and_errors_on_unknown_handle` proves the channel/dispatch wiring
  works end-to-end via the CPU device worker (errors on unknown
  ArchHandle, which is the expected behaviour without a loaded
  model — real-weights validation happens in Stage B when the LM
  splice path exists).

CI gate: cargo fmt --check, cargo clippy --workspace --all-targets
-- -D warnings, cargo test --workspace (all 28 test groups ok,
zero failures). New test counts: +9 in preprocess, +5 in vision,
+1 in device_worker.

Out of scope (deferred):
- LM-side splice of image embeddings at `<|image_pad|>` positions
  → Stage B.
- Streaming SSE for vision-bearing chat completions → Stage C.
- Reject `image_url` with HTTP 400 for non-vision models /
  advertise `capabilities` in /v1/models → Stage C.
- TP-vision (#12), 27B production deploy (#13), dynamic resolution
  (#14), numerical validation (#15).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 11:40:47 +03:00

177 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Qwen3.6-27B vision specification (Stage A0)
Sourced from beast's local cache on 2026-06-01:
`/archive3/llm-cache/models--Qwen--Qwen3.6-27B/snapshots/6a9e13bd6fc8f0983b9b99948120bc37f49c13e9/`.
Single source of truth for Stages AD of the vision plan in
`~/.claude/plans/foamy-twirling-catmull.md`. Umbrella issue:
[#3](https://git.lair.cafe/helexa/cortex/issues/3).
---
## Top-level shape
The model is a unified text+vision architecture (`Qwen3_5ForConditionalGeneration`,
`model_type: qwen3_5`) with three weight sections under a single safetensors
index. Counts from `model.safetensors.index.json`:
| Prefix | Tensors | Role |
|---|---|---|
| `model.language_model.*` | 850 | LM (currently loaded) |
| `model.visual.*` | 333 | Vision tower (currently filtered out at `arch/qwen3_5/mod.rs:228-230`) |
| `mtp.*` | 15 | Multi-token-prediction heads (filtered, out of scope) |
| `lm_head.weight` | 1 | LM head |
Vision tensors live in shards `model-00007-of-00015.safetensors` and
`model-00008-of-00015.safetensors` (2 of the 15 safetensors). Loading just
these two for vision-tower-only smoke tests is feasible.
## Vision tower architecture (`model.visual.*`)
From `config.json::vision_config`:
```
depth: 27 (transformer blocks)
hidden_size: 1152 (vision token dim)
num_heads: 16 (per-block self-attention)
intermediate_size: 4304 (MLP hidden)
patch_size: 16 (16×16 spatial patches)
temporal_patch_size: 2 (video frame pairing; irrelevant for stills)
spatial_merge_size: 2 (2×2 spatial merge in the merger → 4 patches/LM token)
num_position_embeddings: 2304 (learned pos embed slots — max patch sequence length)
in_channels: 3 (RGB)
hidden_act: gelu_pytorch_tanh (GELU with tanh approximation, not exact GELU)
out_hidden_size: 5120 (= LM hidden_size, merger output dim)
deepstack_visual_indexes: [] (no deep-stack visual indexes)
```
### Module inventory (per-block and global)
Global:
- `model.visual.patch_embed.proj.{weight, bias}` — Conv2d (3 → 1152, kernel 16×16, stride 16). Turns image patches into tokens.
- `model.visual.pos_embed.weight` — Learned position embedding, shape `(2304, 1152)`.
- `model.visual.merger.{norm, linear_fc1, linear_fc2}` — The projector that merges 2×2 patches and projects to LM hidden_size (1152 → 5120). All weights have biases.
Per block (×27, named `model.visual.blocks.{0..26}`):
- `norm1.{weight, bias}`**LayerNorm** before attention (with bias — not RmsNorm).
- `attn.qkv.{weight, bias}` — Fused QKV linear (1152 → 3·1152 = 3456).
- `attn.proj.{weight, bias}` — Attention output projection (1152 → 1152).
- `norm2.{weight, bias}` — LayerNorm before MLP.
- `mlp.linear_fc1.{weight, bias}` — MLP up-projection (1152 → 4304).
- `mlp.linear_fc2.{weight, bias}` — MLP down-projection (4304 → 1152).
Pattern matches a standard ViT block with **pre-norm** layout (norm → attn → residual, norm → MLP → residual). Activation between fc1/fc2 is GELU-tanh-approx per `hidden_act`. No attention masking inside the vision tower (all patches attend to each other).
### Forward signature (target)
```
VisionTower::forward(
patches: Tensor [N, in_channels, patch_size, patch_size], # CPU-preprocessed RGB float patches
grid_thw: Option<(usize, usize, usize)>, # (t, h, w) patch grid for position lookup
) -> Tensor [N / (spatial_merge_size²), out_hidden_size] # = (N/4, 5120) for static images
```
Note: the merger consumes 4 spatially-adjacent patches and emits 1 LM token. So an image producing 64×64 = 4096 patches yields 1024 LM-side image tokens.
## Image preprocessor (`preprocessor_config.json`)
```json
{
"size": { "longest_edge": 16777216, "shortest_edge": 65536 },
"patch_size": 16,
"temporal_patch_size": 2,
"merge_size": 2,
"image_mean": [0.5, 0.5, 0.5],
"image_std": [0.5, 0.5, 0.5],
"processor_class": "Qwen3VLProcessor",
"image_processor_type": "Qwen2VLImageProcessorFast"
}
```
Reading:
- `image_mean = image_std = 0.5` → normalisation is simply `(x/255 - 0.5) / 0.5 = 2*x/255 - 1`, mapping `[0,255]``[-1, 1]`. No imagenet-style mean/std.
- `size.{shortest_edge, longest_edge}` are **pixel counts**, not edge lengths. The `Qwen2VLImageProcessorFast` recipe picks a resolution within `[65,536 = 256², 16,777,216 = 4096²]` total pixels, snapping `h` and `w` to multiples of `patch_size × spatial_merge_size = 32` pixels.
- Stage A ships **fixed resolution**: pick a target pixel count (e.g. 448×448 = 200,704 px → 28×28 patches → 14×14 LM tokens after merger). Variable resolution deferred to issue [#14](https://git.lair.cafe/helexa/cortex/issues/14).
## Chat template (`chat_template.jinja`)
Image insertion (lines 818 of the template):
```jinja
{%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
...
{{- '<|vision_start|><|image_pad|><|vision_end|>' }}
```
Per image, the template emits **one `<|image_pad|>` token** flanked by `<|vision_start|>` and `<|vision_end|>` sentinels. The runtime must:
1. Render the template (preserving the single `<|image_pad|>` per image).
2. For each image, replace its single `<|image_pad|>` with N copies, where N is the number of LM tokens that image produces after the vision tower + merger (= `patches / spatial_merge_size²`).
3. Tokenize the expanded string → `input_ids`.
4. At forward time, locate positions where `input_ids == image_token_id` (248056) and splice in the vision tower's merger output.
Token IDs (top of `config.json`):
- `vision_start_token_id`: 248053
- `vision_end_token_id`: 248054
- `image_token_id`: 248056
- `video_token_id`: 248057 (out of scope)
- `bos_token_id`: 248044
- `eos_token_id`: 248044, 248046 (per `generation_config.json`)
System messages cannot contain images (template raises). Other template-side details:
- `add_vision_id` (jinja arg, default false): emits `'Picture N: '` prefixes when true.
- `preserve_thinking` (jinja arg, default false): keeps `<think>` blocks from prior assistant turns in the rendered prompt.
- `enable_thinking` (jinja arg, default true): emits `<think>\n` (or skips it) at the end of the generation prompt.
The existing chat-template renderer in `crates/neuron/src/harness/chat_template.rs` already passes `MessageContent::Parts` to the Jinja context as a `Value::Array`; the template's `is iterable` branch (line 6 of the template) handles them. **The path is structurally in place** — Stage B just needs to do the `<|image_pad|>` expansion + token-position-aware splice.
## LM-side considerations
The LM's RoPE config uses **multi-axis RoPE (MRoPE)**:
```
rope_parameters: {
mrope_interleaved: true,
mrope_section: [11, 11, 10], # text + height + width components
partial_rotary_factor: 0.25,
rope_theta: 10000000,
rope_type: "default"
}
```
MRoPE encodes spatial position alongside text position so the LM attention layers can reason about image-token spatial structure. The LM's existing forward path *may or may not* already implement this — the qwen3_5 module's doc-comment notes "numerical correctness vs the reference Python is not yet validated." Verifying MRoPE behaviour in the language model is out of Stage A scope (vision tower only) but will be required in Stage B (LM splice) and is tracked under the numerical-validation issue [#15](https://git.lair.cafe/helexa/cortex/issues/15).
`max_position_embeddings = 262144` (256 K context), so context-length limits are not a constraint for vision.
## Iteration target decision
The vision tower has its own self-contained weight tree and is small (~333 tensors in 2 shards, hidden_size 1152 vs LM's 5120). For Stage A specifically (vision-tower-only smoke), we **don't need a smaller iteration model** — we can:
- Build the Rust `VisionTower` struct against the spec above.
- Run unit tests with random tensor weights matching the exact shapes → assert forward produces correct output shape with finite values.
- Optionally: a CUDA-integration test that loads just the 2 vision shards from beast's cache (or on a smaller GPU like quadbrat's Ampere) and runs encode on a real image. Doesn't require loading the 27B LM at all.
This sidesteps the "develop against a smaller VL model" question for Stage A. Stage B (LM splice → end-to-end chat with vision) is where iteration speed becomes pressing; revisit there. The default scope pick 2a (smaller iteration model) is therefore deferred to Stage B planning — issue [#13](https://git.lair.cafe/helexa/cortex/issues/13) covers deployment validation regardless.
## Concrete Stage A1+ inputs
- Add deps to `crates/neuron/Cargo.toml`:
- `image = "0.25"`
- `base64 = "0.22"`
- Stage A2 preprocessor target resolution (fixed): **448×448 → 28×28 patches → 14×14 = 196 image tokens per image**. This balances minimum-patch-count for cheap tests against the model's expected input range.
- Stage A3 module structure: one `VisionTower` struct holding `patch_embed: Conv2d`, `pos_embed: Embedding`, `blocks: Vec<VisionBlock>`, `merger: Merger`. `VisionBlock` carries `norm1`, `norm2`, `attn`, `mlp`. Hand-roll using candle primitives.
- Stage A4 weight loading: extend `Qwen3_5ForCausalLM::new()` to construct `Some(VisionTower::new(vb.pp("model.visual"), config))` when `vision_config` is present in the parsed config.
- Stage A5 worker job: `Job::EncodeImage { handle, patches: Vec<f32>, patch_shape: (usize, usize, usize, usize, usize), reply: oneshot<Result<Vec<f32>>> }`. Patch shape = `(N, C, T, H, W)` where T=1 for static images.
## What this doc does NOT settle (deferred to issues)
- Numerical correctness of `VisionTower` output vs Python transformers
→ issue [#15](https://git.lair.cafe/helexa/cortex/issues/15).
- Variable image resolution
→ issue [#14](https://git.lair.cafe/helexa/cortex/issues/14).
- TP-vision (multi-rank vision tower)
→ issue [#12](https://git.lair.cafe/helexa/cortex/issues/12).
- 27B production deployment
→ issue [#13](https://git.lair.cafe/helexa/cortex/issues/13).