Vision: numerical validation against transformers reference #15

Open
opened 2026-06-01 13:18:45 +00:00 by grenade · 0 comments
Owner

Context

Deferred during planning of the initial vision capability (umbrella:
#3). Stage A–D ships with "loose" validation — coherent
image-grounded responses, a per-image quality benchmark — but no
rigorous numerical-correctness check against the reference Python
implementation. Refs: ~/.claude/plans/foamy-twirling-catmull.md.

Problem

The qwen3_5 arch module's own doc-comment already notes "numerical
correctness vs the reference Python is not yet validated." Adding a
vision tower stacks more hand-rolled tensor math on top of that.
Without a rigorous comparison fixture, a subtle numerical bug
(wrong RoPE base for vision attention, off-by-one in patch
position embedding, projector bias missing, etc.) could go
unnoticed and surface as gradual quality degradation rather than a
crash — exactly the failure mode that's hardest to debug.

Scope

  • Small Python harness in script/ that loads the same model via
    transformers, encodes a known image, and dumps:
    • Vision-tower output (pre-projector) for the first patch.
    • Projector output for the first patch.
    • LM-side input embeddings at image-token positions.
    • Final logits at the last text-token position for a fixed prompt.
  • Companion Rust test in crates/neuron/tests/vision_numerical.rs
    that loads the same model, replays the same image+prompt, and
    asserts the produced tensors match within 1e-3 (tunable).
  • Capture the reference dumps once and check them into a
    crates/neuron/tests/fixtures/vision/ directory rather than
    running Python in CI; document how to regenerate.

Acceptance

  • A reference fixture exists for the iteration model and for
    Qwen3.6-27B (the latter once deployment lands).
  • The Rust numerical test passes against the iteration model and
    surfaces a clear failure when a deliberately-mutated weight or
    off-by-one is introduced.
  • Documented procedure for capturing new reference dumps when the
    model is retrained or the iteration target changes.

Blocked by

Stage A–C of the vision plan. The Stage D quality benchmark is a
coarser substitute; this issue tracks the rigorous version.

References

## Context Deferred during planning of the initial vision capability (umbrella: #3). Stage A–D ships with "loose" validation — coherent image-grounded responses, a per-image quality benchmark — but no rigorous numerical-correctness check against the reference Python implementation. Refs: `~/.claude/plans/foamy-twirling-catmull.md`. ## Problem The `qwen3_5` arch module's own doc-comment already notes "numerical correctness vs the reference Python is not yet validated." Adding a vision tower stacks more hand-rolled tensor math on top of that. Without a rigorous comparison fixture, a subtle numerical bug (wrong RoPE base for vision attention, off-by-one in patch position embedding, projector bias missing, etc.) could go unnoticed and surface as gradual quality degradation rather than a crash — exactly the failure mode that's hardest to debug. ## Scope - Small Python harness in `script/` that loads the same model via `transformers`, encodes a known image, and dumps: - Vision-tower output (pre-projector) for the first patch. - Projector output for the first patch. - LM-side input embeddings at image-token positions. - Final logits at the last text-token position for a fixed prompt. - Companion Rust test in `crates/neuron/tests/vision_numerical.rs` that loads the same model, replays the same image+prompt, and asserts the produced tensors match within `1e-3` (tunable). - Capture the reference dumps once and check them into a `crates/neuron/tests/fixtures/vision/` directory rather than running Python in CI; document how to regenerate. ## Acceptance - A reference fixture exists for the iteration model and for Qwen3.6-27B (the latter once deployment lands). - The Rust numerical test passes against the iteration model and surfaces a clear failure when a deliberately-mutated weight or off-by-one is introduced. - Documented procedure for capturing new reference dumps when the model is retrained or the iteration target changes. ## Blocked by Stage A–C of the vision plan. The Stage D quality benchmark is a coarser substitute; this issue tracks the rigorous version. ## References - Plan: `~/.claude/plans/foamy-twirling-catmull.md` - Umbrella: #3 - Existing arch doc-comment note: `crates/neuron/src/harness/arch/qwen3_5/mod.rs:1-65`
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#15