Chunk the single-GPU vision prefill (parity with TP) #18
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The TP vision prefill now chunks the prompt (
fa01350):TpQwen3_5ForCausalLM::prefill_with_images_chunkedencodes theimage(s) once, then walks the pre-expanded prompt in
prefill_chunk_tokens()windows, splicing the patch-embedding rowsinto whichever chunk(s) carry
<|image_pad|>positions. This boundsactivation memory and stopped a large-context vision request from
single-shot OOMing.
The single-GPU vision path was not changed and is still
single-shot:
run_inference_with_images_via_worker(harness/candle.rs) →Job::ForwardLogitsWithImages→forward_logits_with_images(harness/device_worker/dispatch.rs)builds the full
(1, L)input and callsModelArch::forward_with_visiononce.So a large-context image request to a single-GPU-loaded vision
model has the same single-shot OOM exposure the TP path just shed. For
now the new pre-flight guard (
validate_vision_prefill, applied to thesingle-GPU paths too) rejects oversized requests cleanly instead of
OOMing — but the path can't actually serve a long vision context.
This is lower priority than the TP work because beast runs the 27B as
TP=2; single-GPU vision isn't in production today. Track it so the two
paths stay at parity.
Goal
Single-GPU vision prefill chunks like the TP path, so a long
vision-bearing prompt is bounded by the chunk and actually serves
(rather than only being guard-rejected).
Scope
(
Qwen3_5ForCausalLM), mirroringTpQwen3_5ForCausalLM::prefill_with_images_chunked: encode once,loop
prefill_chunk_tokens()windows, splice the per-chunk<|image_pad|>slice via the sharedsplice_runs, keep the lastchunk's logits.
(
Job::ForwardLogitsWithImages→ carrychunk_size, or chunkinside the handler) and
run_inference_with_images_via_worker.to match the text bound (it's currently the only protection).
Verification
cargo fmt --check --all && cargo clippy --workspace --all-targets -- -D warnings && cargo test --workspace, plus the CUDA type-check.request returns image-grounded text with
prompt_tokensreflectingthe patch expansion, instead of being rejected by the guard.
References
fa01350.splice_runsincrates/neuron/src/harness/arch/qwen3_5/mod.rs.🤖 Generated with Claude Code