Vision: dynamic image resolution (Qwen-VL min/max pixels) #14

Open
opened 2026-06-01 13:18:32 +00:00 by grenade · 0 comments
Owner

Context

Deferred during planning of the initial vision capability (umbrella:
#3). Stage A–C ships fixed-resolution preprocessing; this issue
covers Qwen-VL's native variable-resolution behaviour. Refs:
~/.claude/plans/foamy-twirling-catmull.md.

Problem

Qwen2-VL / Qwen3-VL natively support variable image sizes via
min_pixels and max_pixels bounds. The reference Python
Qwen2VLImageProcessor picks a bucket within those bounds based on
the image's aspect ratio and produces a variable patch count per
image. This is meaningful for quality:

  • Landscape images get more horizontal patches, portrait images get
    more vertical, preserving aspect-ratio-relevant detail.
  • Documents and OCR-style content benefit from higher pixel counts;
    thumbnails downsample sensibly.

Stage A ships fixed resolution (e.g. 448×448 → 256 patches) to keep
the preprocessor and patch-count math simple. This issue tracks the
upgrade to dynamic resolution to match the reference and avoid
quality regressions on non-square input.

Scope

  • Read min_pixels / max_pixels (or the equivalent keys in Qwen3.6
    — confirmed at Stage A0) from preprocessor_config.json.
  • Port the bucket-selection algorithm from Python
    Qwen2VLImageProcessor (or equivalent for Qwen3.6) into
    crates/neuron/src/harness/preprocess.rs.
  • Update prompt-side patch accounting in build_prompt_for_request
    so the per-image <|image_pad|> expansion uses the actual patch
    count for that image rather than a fixed constant.
  • Update chat_template.rs invocation so the template receives the
    computed grid_thw (temporal-height-width tuple) per image, which
    is what the Qwen-VL templates branch on.

Acceptance

  • A landscape and a portrait image of the same nominal subject
    produce different patch counts, both reflected in prompt_tokens.
  • Quality benchmark from Stage D measurably improves over the
    fixed-resolution baseline on documents / OCR-style content.

Blocked by

Stage B of the vision plan must ship first; this is a refinement
that lives on top of the working fixed-resolution path.

References

## Context Deferred during planning of the initial vision capability (umbrella: #3). Stage A–C ships fixed-resolution preprocessing; this issue covers Qwen-VL's native variable-resolution behaviour. Refs: `~/.claude/plans/foamy-twirling-catmull.md`. ## Problem Qwen2-VL / Qwen3-VL natively support variable image sizes via `min_pixels` and `max_pixels` bounds. The reference Python `Qwen2VLImageProcessor` picks a bucket within those bounds based on the image's aspect ratio and produces a variable patch count per image. This is meaningful for quality: - Landscape images get more horizontal patches, portrait images get more vertical, preserving aspect-ratio-relevant detail. - Documents and OCR-style content benefit from higher pixel counts; thumbnails downsample sensibly. Stage A ships fixed resolution (e.g. 448×448 → 256 patches) to keep the preprocessor and patch-count math simple. This issue tracks the upgrade to dynamic resolution to match the reference and avoid quality regressions on non-square input. ## Scope - Read `min_pixels` / `max_pixels` (or the equivalent keys in Qwen3.6 — confirmed at Stage A0) from `preprocessor_config.json`. - Port the bucket-selection algorithm from Python `Qwen2VLImageProcessor` (or equivalent for Qwen3.6) into `crates/neuron/src/harness/preprocess.rs`. - Update prompt-side patch accounting in `build_prompt_for_request` so the per-image `<|image_pad|>` expansion uses the actual patch count for that image rather than a fixed constant. - Update `chat_template.rs` invocation so the template receives the computed `grid_thw` (temporal-height-width tuple) per image, which is what the Qwen-VL templates branch on. ## Acceptance - A landscape and a portrait image of the same nominal subject produce different patch counts, both reflected in `prompt_tokens`. - Quality benchmark from Stage D measurably improves over the fixed-resolution baseline on documents / OCR-style content. ## Blocked by Stage B of the vision plan must ship first; this is a refinement that lives on top of the working fixed-resolution path. ## References - Plan: `~/.claude/plans/foamy-twirling-catmull.md` - Umbrella: #3 - Reference impl: `transformers/models/qwen2_vl/image_processing_qwen2_vl.py` in the Python HF repo (`smart_resize` + `select_best_resolution`). - Critical files: `crates/neuron/src/harness/preprocess.rs`, `crates/neuron/src/harness/chat_template.rs`, `crates/neuron/src/harness/candle.rs::build_prompt_for_request`.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#14