Chunk the single-GPU vision prefill (parity with TP) #18

New Issue

grenade · 2026-06-04T14:33:59Z

grenade commented

2026-06-04 14:33:59 +00:00

Context

The TP vision prefill now chunks the prompt (fa01350):
TpQwen3_5ForCausalLM::prefill_with_images_chunked encodes the
image(s) once, then walks the pre-expanded prompt in
prefill_chunk_tokens() windows, splicing the patch-embedding rows
into whichever chunk(s) carry <|image_pad|> positions. This bounds
activation memory and stopped a large-context vision request from
single-shot OOMing.

The single-GPU vision path was not changed and is still
single-shot:

run_inference_with_images_via_worker (harness/candle.rs) →
Job::ForwardLogitsWithImages →
forward_logits_with_images (harness/device_worker/dispatch.rs)
builds the full (1, L) input and calls
ModelArch::forward_with_vision once.

So a large-context image request to a single-GPU-loaded vision
model has the same single-shot OOM exposure the TP path just shed. For
now the new pre-flight guard (validate_vision_prefill, applied to the
single-GPU paths too) rejects oversized requests cleanly instead of
OOMing — but the path can't actually serve a long vision context.

This is lower priority than the TP work because beast runs the 27B as
TP=2; single-GPU vision isn't in production today. Track it so the two
paths stay at parity.

Goal

Single-GPU vision prefill chunks like the TP path, so a long
vision-bearing prompt is bounded by the chunk and actually serves
(rather than only being guard-rejected).

Scope

Add a chunked image prefill to the single-GPU model
(Qwen3_5ForCausalLM), mirroring
TpQwen3_5ForCausalLM::prefill_with_images_chunked: encode once,
loop prefill_chunk_tokens() windows, splice the per-chunk
<|image_pad|> slice via the shared splice_runs, keep the last
chunk's logits.
Wire it through the device-worker job
(Job::ForwardLogitsWithImages → carry chunk_size, or chunk
inside the handler) and run_inference_with_images_via_worker.
Once chunked, the single-GPU vision pre-flight guard can be relaxed
to match the text bound (it's currently the only protection).

Verification

cargo fmt --check --all && cargo clippy --workspace --all-targets -- -D warnings && cargo test --workspace, plus the CUDA type-check.
(When a single-GPU vision deployment exists) a long-context image
request returns image-grounded text with prompt_tokens reflecting
the patch expansion, instead of being rejected by the guard.

References

TP chunked prefill that this mirrors: fa01350.
Shared splice helper: splice_runs in
crates/neuron/src/harness/arch/qwen3_5/mod.rs.
TP-vision umbrella #16 / #12.

🤖 Generated with Claude Code

## Context The TP vision prefill now chunks the prompt (`fa01350`): `TpQwen3_5ForCausalLM::prefill_with_images_chunked` encodes the image(s) once, then walks the pre-expanded prompt in `prefill_chunk_tokens()` windows, splicing the patch-embedding rows into whichever chunk(s) carry `<|image_pad|>` positions. This bounds activation memory and stopped a large-context vision request from single-shot OOMing. The **single-GPU** vision path was not changed and is still **single-shot**: - `run_inference_with_images_via_worker` (`harness/candle.rs`) → - `Job::ForwardLogitsWithImages` → - `forward_logits_with_images` (`harness/device_worker/dispatch.rs`) builds the full `(1, L)` input and calls `ModelArch::forward_with_vision` once. So a large-context image request to a **single-GPU**-loaded vision model has the same single-shot OOM exposure the TP path just shed. For now the new pre-flight guard (`validate_vision_prefill`, applied to the single-GPU paths too) rejects oversized requests cleanly instead of OOMing — but the path can't actually *serve* a long vision context. This is lower priority than the TP work because beast runs the 27B as TP=2; single-GPU vision isn't in production today. Track it so the two paths stay at parity. ## Goal Single-GPU vision prefill chunks like the TP path, so a long vision-bearing prompt is bounded by the chunk and actually serves (rather than only being guard-rejected). ## Scope - Add a chunked image prefill to the single-GPU model (`Qwen3_5ForCausalLM`), mirroring `TpQwen3_5ForCausalLM::prefill_with_images_chunked`: encode once, loop `prefill_chunk_tokens()` windows, splice the per-chunk `<|image_pad|>` slice via the shared `splice_runs`, keep the last chunk's logits. - Wire it through the device-worker job (`Job::ForwardLogitsWithImages` → carry `chunk_size`, or chunk inside the handler) and `run_inference_with_images_via_worker`. - Once chunked, the single-GPU vision pre-flight guard can be relaxed to match the text bound (it's currently the only protection). ## Verification - `cargo fmt --check --all && cargo clippy --workspace --all-targets -- -D warnings && cargo test --workspace`, plus the CUDA type-check. - (When a single-GPU vision deployment exists) a long-context image request returns image-grounded text with `prompt_tokens` reflecting the patch expansion, instead of being rejected by the guard. ## References - TP chunked prefill that this mirrors: `fa01350`. - Shared splice helper: `splice_runs` in `crates/neuron/src/harness/arch/qwen3_5/mod.rs`. - TP-vision umbrella #16 / #12. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: helexa/cortex#18