Load the full, unsharded model.visual.* vision tower on every TP rank
(leader + each subprocess worker mmaps the same local safetensors) when
config.vision_config is present. VisionTower::load already takes a
ShardedVarBuilder whose plain .get() returns the full replicated tensor,
so the tower loads identically regardless of world_size — no sharding,
no NCCL broadcast.
- TpQwen3_5ForCausalLM gains vision: Option<VisionTower> + image_token_id,
plus has_vision/image_token_id/encode_image/forward_with_vision,
mirroring the single-GPU Qwen3_5ForCausalLM wrapper.
- TpQwen3_5Model::forward_with_vision mirrors the single-GPU
forward_inner splice: embed locally, replace rows at image_token_id
positions, run the sharded decoder stack. Because every rank encodes
the same pixels through its replicated tower, the spliced input
embeddings are identical across ranks — preserving the TP
replicated-hidden-state invariant the row-parallel AllReduce relies on.
- splice_runs is now pub(crate) and shared with the TP model.
No caller yet — Stage 2 wires the RPC/worker path that invokes
encode_image + forward_with_vision per rank. Most of this compiles on
the non-cuda build (only the cuda load variant's tower line is gated);
CI's CUDA type-check covers the rest.
Refs TP-vision plan Stage 1.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>