Vision: numerical validation against transformers reference #15

New Issue

grenade · 2026-06-01T13:18:45Z

grenade commented

2026-06-01 13:18:45 +00:00

Context

Deferred during planning of the initial vision capability (umbrella:
#3). Stage A–D ships with "loose" validation — coherent
image-grounded responses, a per-image quality benchmark — but no
rigorous numerical-correctness check against the reference Python
implementation. Refs: ~/.claude/plans/foamy-twirling-catmull.md.

Problem

The qwen3_5 arch module's own doc-comment already notes "numerical
correctness vs the reference Python is not yet validated." Adding a
vision tower stacks more hand-rolled tensor math on top of that.
Without a rigorous comparison fixture, a subtle numerical bug
(wrong RoPE base for vision attention, off-by-one in patch
position embedding, projector bias missing, etc.) could go
unnoticed and surface as gradual quality degradation rather than a
crash — exactly the failure mode that's hardest to debug.

Scope

Small Python harness in script/ that loads the same model via
transformers, encodes a known image, and dumps:
- Vision-tower output (pre-projector) for the first patch.
- Projector output for the first patch.
- LM-side input embeddings at image-token positions.
- Final logits at the last text-token position for a fixed prompt.
Companion Rust test in crates/neuron/tests/vision_numerical.rs
that loads the same model, replays the same image+prompt, and
asserts the produced tensors match within 1e-3 (tunable).
Capture the reference dumps once and check them into a
crates/neuron/tests/fixtures/vision/ directory rather than
running Python in CI; document how to regenerate.

Acceptance

A reference fixture exists for the iteration model and for
Qwen3.6-27B (the latter once deployment lands).
The Rust numerical test passes against the iteration model and
surfaces a clear failure when a deliberately-mutated weight or
off-by-one is introduced.
Documented procedure for capturing new reference dumps when the
model is retrained or the iteration target changes.

Blocked by

Stage A–C of the vision plan. The Stage D quality benchmark is a
coarser substitute; this issue tracks the rigorous version.

References

Plan: ~/.claude/plans/foamy-twirling-catmull.md
Umbrella: Image content (`image_url`) is dropped — multimodal chat requests are processed as text-only (#3)
Existing arch doc-comment note:
crates/neuron/src/harness/arch/qwen3_5/mod.rs:1-65

## Context Deferred during planning of the initial vision capability (umbrella: #3). Stage A–D ships with "loose" validation — coherent image-grounded responses, a per-image quality benchmark — but no rigorous numerical-correctness check against the reference Python implementation. Refs: `~/.claude/plans/foamy-twirling-catmull.md`. ## Problem The `qwen3_5` arch module's own doc-comment already notes "numerical correctness vs the reference Python is not yet validated." Adding a vision tower stacks more hand-rolled tensor math on top of that. Without a rigorous comparison fixture, a subtle numerical bug (wrong RoPE base for vision attention, off-by-one in patch position embedding, projector bias missing, etc.) could go unnoticed and surface as gradual quality degradation rather than a crash — exactly the failure mode that's hardest to debug. ## Scope - Small Python harness in `script/` that loads the same model via `transformers`, encodes a known image, and dumps: - Vision-tower output (pre-projector) for the first patch. - Projector output for the first patch. - LM-side input embeddings at image-token positions. - Final logits at the last text-token position for a fixed prompt. - Companion Rust test in `crates/neuron/tests/vision_numerical.rs` that loads the same model, replays the same image+prompt, and asserts the produced tensors match within `1e-3` (tunable). - Capture the reference dumps once and check them into a `crates/neuron/tests/fixtures/vision/` directory rather than running Python in CI; document how to regenerate. ## Acceptance - A reference fixture exists for the iteration model and for Qwen3.6-27B (the latter once deployment lands). - The Rust numerical test passes against the iteration model and surfaces a clear failure when a deliberately-mutated weight or off-by-one is introduced. - Documented procedure for capturing new reference dumps when the model is retrained or the iteration target changes. ## Blocked by Stage A–C of the vision plan. The Stage D quality benchmark is a coarser substitute; this issue tracks the rigorous version. ## References - Plan: `~/.claude/plans/foamy-twirling-catmull.md` - Umbrella: #3 - Existing arch doc-comment note: `crates/neuron/src/harness/arch/qwen3_5/mod.rs:1-65`

grenade referenced this issue from a commit

2026-06-02 08:40:50 +00:00

feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6

grenade referenced this issue from a commit

2026-06-02 12:33:04 +00:00

feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6

grenade referenced this issue

2026-06-04 10:36:01 +00:00

Vision Stage C: streaming SSE + Responses API + cortex-gateway capability propagation #16

grenade referenced this issue from a commit

2026-06-04 19:50:07 +00:00

feat(neuron): operator pixel-budget env override + doc cleanup (#14 C5)