Vision: tensor-parallel implementation (Stage E) #12

New Issue

grenade · 2026-06-01T13:18:04Z

grenade commented

2026-06-01 13:18:04 +00:00

Context

Deferred during planning of the initial vision capability for
Qwen3.6-27B (umbrella: #3). Stage A–C of that plan ships single-GPU
vision only. Refs: ~/.claude/plans/foamy-twirling-catmull.md.

Problem

Qwen3.6-27B in BF16 is ~54 GB and does not fit on a single 5090
(32 GB). Without TP-vision, the helexa stack can only serve vision
requests against the 27B by quantising heavily (Q4/Q5 ISQ on one
GPU) — which sacrifices the quality that motivates running 27B in
the first place. This issue tracks bringing vision support to the
TP path so beast/benjy (2×5090, 32 GB each) can serve full-quality
multimodal Qwen3.6-27B.

Scope

Mirror the VisionTower struct from
crates/neuron/src/harness/arch/qwen3_5/vision.rs (introduced by
Stage A) into the TP path at
crates/neuron/src/harness/tp/tp_qwen3_5.rs.
Replicate the vision tower across ranks (not sharded). The tower
is small relative to the LM (~1-2 GB out of 54 GB for the LM in
BF16); sharding ViT layers across NCCL would add latency without
meaningful VRAM benefit, and the embedding layer is already
replicated per tp_qwen3_5.rs:30-31.
Decide image-tensor distribution: either each rank decodes from
the same source bytes (broadcast pre-decode by the leader) or
rank 0 encodes and broadcasts the resulting patch embeddings via
NCCL Broadcast. Net data volume favours the latter for large
images; the former is simpler. Pick one with a justifying note.
New Job variants on the TP worker pool for image encoding +
image-aware forward, mirroring the single-GPU variants from
Stage A/B.
TP integration test on a CUDA host.

Acceptance

Loading the 27B with tensor_parallel = 2 and a vision-capable
config produces a working multimodal model on beast.
A request matching the issue #3 repro returns image-grounded
text via the TP path; prompt_tokens includes patch tokens.
cargo test -p neuron --features cuda-integration ok on a CUDA
host with NCCL.

Blocked by

Stage A–C of the vision plan must land first. This issue is the
gate to production use of vision on Qwen3.6-27B; without it,
hosting vision on beast/benjy is impossible.

References

Plan: ~/.claude/plans/foamy-twirling-catmull.md
Umbrella: Image content (`image_url`) is dropped — multimodal chat requests are processed as text-only (#3)
TP entry point: crates/neuron/src/harness/tp/tp_qwen3_5.rs
TP worker pool: crates/neuron/src/harness/tp/mod.rs
Embedding-replication note:
crates/neuron/src/harness/tp/tp_qwen3_5.rs:30-31

## Context Deferred during planning of the initial vision capability for Qwen3.6-27B (umbrella: #3). Stage A–C of that plan ships single-GPU vision only. Refs: `~/.claude/plans/foamy-twirling-catmull.md`. ## Problem Qwen3.6-27B in BF16 is ~54 GB and does not fit on a single 5090 (32 GB). Without TP-vision, the helexa stack can only serve vision requests against the 27B by quantising heavily (Q4/Q5 ISQ on one GPU) — which sacrifices the quality that motivates running 27B in the first place. This issue tracks bringing vision support to the TP path so beast/benjy (2×5090, 32 GB each) can serve full-quality multimodal Qwen3.6-27B. ## Scope - Mirror the `VisionTower` struct from `crates/neuron/src/harness/arch/qwen3_5/vision.rs` (introduced by Stage A) into the TP path at `crates/neuron/src/harness/tp/tp_qwen3_5.rs`. - Replicate the vision tower across ranks (not sharded). The tower is small relative to the LM (~1-2 GB out of 54 GB for the LM in BF16); sharding ViT layers across NCCL would add latency without meaningful VRAM benefit, and the embedding layer is already replicated per `tp_qwen3_5.rs:30-31`. - Decide image-tensor distribution: either each rank decodes from the same source bytes (broadcast pre-decode by the leader) or rank 0 encodes and broadcasts the resulting patch embeddings via NCCL `Broadcast`. Net data volume favours the latter for large images; the former is simpler. Pick one with a justifying note. - New `Job` variants on the TP worker pool for image encoding + image-aware forward, mirroring the single-GPU variants from Stage A/B. - TP integration test on a CUDA host. ## Acceptance - Loading the 27B with `tensor_parallel = 2` and a vision-capable config produces a working multimodal model on beast. - A request matching the issue #3 repro returns image-grounded text via the TP path; `prompt_tokens` includes patch tokens. - `cargo test -p neuron --features cuda-integration` ok on a CUDA host with NCCL. ## Blocked by Stage A–C of the vision plan must land first. This issue is the gate to production use of vision on Qwen3.6-27B; without it, hosting vision on beast/benjy is impossible. ## References - Plan: `~/.claude/plans/foamy-twirling-catmull.md` - Umbrella: #3 - TP entry point: `crates/neuron/src/harness/tp/tp_qwen3_5.rs` - TP worker pool: `crates/neuron/src/harness/tp/mod.rs` - Embedding-replication note: `crates/neuron/src/harness/tp/tp_qwen3_5.rs:30-31`

grenade referenced this issue from a commit

2026-06-02 08:40:50 +00:00

feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6

grenade referenced this issue from a commit

2026-06-02 12:33:04 +00:00

feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6

grenade referenced this issue

2026-06-04 10:36:01 +00:00

Vision Stage C: streaming SSE + Responses API + cortex-gateway capability propagation #16

grenade referenced this issue from a commit

2026-06-04 12:15:15 +00:00

feat(neuron): TP-vision Stage 3 — wire TP chat + stream vision prefill