Vision: tensor-parallel implementation (Stage E) #12

Open
opened 2026-06-01 13:18:04 +00:00 by grenade · 0 comments
Owner

Context

Deferred during planning of the initial vision capability for
Qwen3.6-27B (umbrella: #3). Stage A–C of that plan ships single-GPU
vision only. Refs: ~/.claude/plans/foamy-twirling-catmull.md.

Problem

Qwen3.6-27B in BF16 is ~54 GB and does not fit on a single 5090
(32 GB). Without TP-vision, the helexa stack can only serve vision
requests against the 27B by quantising heavily (Q4/Q5 ISQ on one
GPU) — which sacrifices the quality that motivates running 27B in
the first place. This issue tracks bringing vision support to the
TP path so beast/benjy (2×5090, 32 GB each) can serve full-quality
multimodal Qwen3.6-27B.

Scope

  • Mirror the VisionTower struct from
    crates/neuron/src/harness/arch/qwen3_5/vision.rs (introduced by
    Stage A) into the TP path at
    crates/neuron/src/harness/tp/tp_qwen3_5.rs.
  • Replicate the vision tower across ranks (not sharded). The tower
    is small relative to the LM (~1-2 GB out of 54 GB for the LM in
    BF16); sharding ViT layers across NCCL would add latency without
    meaningful VRAM benefit, and the embedding layer is already
    replicated per tp_qwen3_5.rs:30-31.
  • Decide image-tensor distribution: either each rank decodes from
    the same source bytes (broadcast pre-decode by the leader) or
    rank 0 encodes and broadcasts the resulting patch embeddings via
    NCCL Broadcast. Net data volume favours the latter for large
    images; the former is simpler. Pick one with a justifying note.
  • New Job variants on the TP worker pool for image encoding +
    image-aware forward, mirroring the single-GPU variants from
    Stage A/B.
  • TP integration test on a CUDA host.

Acceptance

  • Loading the 27B with tensor_parallel = 2 and a vision-capable
    config produces a working multimodal model on beast.
  • A request matching the issue #3 repro returns image-grounded
    text via the TP path; prompt_tokens includes patch tokens.
  • cargo test -p neuron --features cuda-integration ok on a CUDA
    host with NCCL.

Blocked by

Stage A–C of the vision plan must land first. This issue is the
gate to production use of vision on Qwen3.6-27B; without it,
hosting vision on beast/benjy is impossible.

References

## Context Deferred during planning of the initial vision capability for Qwen3.6-27B (umbrella: #3). Stage A–C of that plan ships single-GPU vision only. Refs: `~/.claude/plans/foamy-twirling-catmull.md`. ## Problem Qwen3.6-27B in BF16 is ~54 GB and does not fit on a single 5090 (32 GB). Without TP-vision, the helexa stack can only serve vision requests against the 27B by quantising heavily (Q4/Q5 ISQ on one GPU) — which sacrifices the quality that motivates running 27B in the first place. This issue tracks bringing vision support to the TP path so beast/benjy (2×5090, 32 GB each) can serve full-quality multimodal Qwen3.6-27B. ## Scope - Mirror the `VisionTower` struct from `crates/neuron/src/harness/arch/qwen3_5/vision.rs` (introduced by Stage A) into the TP path at `crates/neuron/src/harness/tp/tp_qwen3_5.rs`. - Replicate the vision tower across ranks (not sharded). The tower is small relative to the LM (~1-2 GB out of 54 GB for the LM in BF16); sharding ViT layers across NCCL would add latency without meaningful VRAM benefit, and the embedding layer is already replicated per `tp_qwen3_5.rs:30-31`. - Decide image-tensor distribution: either each rank decodes from the same source bytes (broadcast pre-decode by the leader) or rank 0 encodes and broadcasts the resulting patch embeddings via NCCL `Broadcast`. Net data volume favours the latter for large images; the former is simpler. Pick one with a justifying note. - New `Job` variants on the TP worker pool for image encoding + image-aware forward, mirroring the single-GPU variants from Stage A/B. - TP integration test on a CUDA host. ## Acceptance - Loading the 27B with `tensor_parallel = 2` and a vision-capable config produces a working multimodal model on beast. - A request matching the issue #3 repro returns image-grounded text via the TP path; `prompt_tokens` includes patch tokens. - `cargo test -p neuron --features cuda-integration` ok on a CUDA host with NCCL. ## Blocked by Stage A–C of the vision plan must land first. This issue is the gate to production use of vision on Qwen3.6-27B; without it, hosting vision on beast/benjy is impossible. ## References - Plan: `~/.claude/plans/foamy-twirling-catmull.md` - Umbrella: #3 - TP entry point: `crates/neuron/src/harness/tp/tp_qwen3_5.rs` - TP worker pool: `crates/neuron/src/harness/tp/mod.rs` - Embedding-replication note: `crates/neuron/src/harness/tp/tp_qwen3_5.rs:30-31`
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#12