Vision: dynamic image resolution (Qwen-VL min/max pixels) #14

New Issue

grenade · 2026-06-01T13:18:32Z

grenade commented

2026-06-01 13:18:32 +00:00

Context

Deferred during planning of the initial vision capability (umbrella:
#3). Stage A–C ships fixed-resolution preprocessing; this issue
covers Qwen-VL's native variable-resolution behaviour. Refs:
~/.claude/plans/foamy-twirling-catmull.md.

Problem

Qwen2-VL / Qwen3-VL natively support variable image sizes via
min_pixels and max_pixels bounds. The reference Python
Qwen2VLImageProcessor picks a bucket within those bounds based on
the image's aspect ratio and produces a variable patch count per
image. This is meaningful for quality:

Landscape images get more horizontal patches, portrait images get
more vertical, preserving aspect-ratio-relevant detail.
Documents and OCR-style content benefit from higher pixel counts;
thumbnails downsample sensibly.

Stage A ships fixed resolution (e.g. 448×448 → 256 patches) to keep
the preprocessor and patch-count math simple. This issue tracks the
upgrade to dynamic resolution to match the reference and avoid
quality regressions on non-square input.

Scope

Read min_pixels / max_pixels (or the equivalent keys in Qwen3.6
— confirmed at Stage A0) from preprocessor_config.json.
Port the bucket-selection algorithm from Python
Qwen2VLImageProcessor (or equivalent for Qwen3.6) into
crates/neuron/src/harness/preprocess.rs.
Update prompt-side patch accounting in build_prompt_for_request
so the per-image <|image_pad|> expansion uses the actual patch
count for that image rather than a fixed constant.
Update chat_template.rs invocation so the template receives the
computed grid_thw (temporal-height-width tuple) per image, which
is what the Qwen-VL templates branch on.

Acceptance

A landscape and a portrait image of the same nominal subject
produce different patch counts, both reflected in prompt_tokens.
Quality benchmark from Stage D measurably improves over the
fixed-resolution baseline on documents / OCR-style content.

Blocked by

Stage B of the vision plan must ship first; this is a refinement
that lives on top of the working fixed-resolution path.

References

Plan: ~/.claude/plans/foamy-twirling-catmull.md
Umbrella: Image content (`image_url`) is dropped — multimodal chat requests are processed as text-only (#3)
Reference impl:
transformers/models/qwen2_vl/image_processing_qwen2_vl.py in the
Python HF repo (smart_resize + select_best_resolution).
Critical files:
crates/neuron/src/harness/preprocess.rs,
crates/neuron/src/harness/chat_template.rs,
crates/neuron/src/harness/candle.rs::build_prompt_for_request.

## Context Deferred during planning of the initial vision capability (umbrella: #3). Stage A–C ships fixed-resolution preprocessing; this issue covers Qwen-VL's native variable-resolution behaviour. Refs: `~/.claude/plans/foamy-twirling-catmull.md`. ## Problem Qwen2-VL / Qwen3-VL natively support variable image sizes via `min_pixels` and `max_pixels` bounds. The reference Python `Qwen2VLImageProcessor` picks a bucket within those bounds based on the image's aspect ratio and produces a variable patch count per image. This is meaningful for quality: - Landscape images get more horizontal patches, portrait images get more vertical, preserving aspect-ratio-relevant detail. - Documents and OCR-style content benefit from higher pixel counts; thumbnails downsample sensibly. Stage A ships fixed resolution (e.g. 448×448 → 256 patches) to keep the preprocessor and patch-count math simple. This issue tracks the upgrade to dynamic resolution to match the reference and avoid quality regressions on non-square input. ## Scope - Read `min_pixels` / `max_pixels` (or the equivalent keys in Qwen3.6 — confirmed at Stage A0) from `preprocessor_config.json`. - Port the bucket-selection algorithm from Python `Qwen2VLImageProcessor` (or equivalent for Qwen3.6) into `crates/neuron/src/harness/preprocess.rs`. - Update prompt-side patch accounting in `build_prompt_for_request` so the per-image `<|image_pad|>` expansion uses the actual patch count for that image rather than a fixed constant. - Update `chat_template.rs` invocation so the template receives the computed `grid_thw` (temporal-height-width tuple) per image, which is what the Qwen-VL templates branch on. ## Acceptance - A landscape and a portrait image of the same nominal subject produce different patch counts, both reflected in `prompt_tokens`. - Quality benchmark from Stage D measurably improves over the fixed-resolution baseline on documents / OCR-style content. ## Blocked by Stage B of the vision plan must ship first; this is a refinement that lives on top of the working fixed-resolution path. ## References - Plan: `~/.claude/plans/foamy-twirling-catmull.md` - Umbrella: #3 - Reference impl: `transformers/models/qwen2_vl/image_processing_qwen2_vl.py` in the Python HF repo (`smart_resize` + `select_best_resolution`). - Critical files: `crates/neuron/src/harness/preprocess.rs`, `crates/neuron/src/harness/chat_template.rs`, `crates/neuron/src/harness/candle.rs::build_prompt_for_request`.

grenade referenced this issue from a commit

2026-06-02 08:40:50 +00:00

feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6

grenade referenced this issue from a commit

2026-06-02 12:33:04 +00:00

feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6

grenade referenced this issue

2026-06-04 10:36:01 +00:00

Vision Stage C: streaming SSE + Responses API + cortex-gateway capability propagation #16

grenade referenced this issue from a commit

2026-06-04 19:47:31 +00:00

feat(neuron): dynamic-resolution images via Qwen smart_resize (#14)