Vision Stage C: streaming SSE + Responses API + cortex-gateway capability propagation #16

Open
opened 2026-06-04 10:36:01 +00:00 by grenade · 1 comment
Owner

Context

Stage A (7df84fe)
shipped the vision tower load + preprocessor. Stage B
(24968e9,
fixup 577781d)
shipped end-to-end non-streaming chat completion with the
<|image_pad|> splice, the vision_unsupported rejection
(InferenceError::VisionUnsupported → HTTP 400), and capabilities
in neuron's /v1/models. Closes the silent-drop pattern of #3 for
non-streaming clients.

Stage C completes the operator-visible feature so vision works for
every path the gateway exposes — streaming chat, the Responses
API, and capability propagation through cortex-gateway. Until this
lands, a client that requests stream=true with an image-bearing
prompt will silently degrade (the streaming forward path doesn't
splice image embeddings yet), and clients consuming
cortex-gateway's /v1/models won't see capabilities even though
neuron advertises them.

Plan reference: ~/.claude/plans/foamy-twirling-catmull.md — Stage C.

Scope

Three independent sub-tasks; each is a small commit.

C1 — Streaming chat completion with vision

File: crates/neuron/src/harness/candle.rs::stream_inference_via_worker
(around line 4051). Uses chunked_prefill_via_worker + the
route_token! macro for decode.

Today the streaming path follows the text-only contract:

  1. clear KV cache
  2. chunked prefill (chunked_prefill_via_worker)
  3. decode-loop with route_token!

Vision-bearing requests need a single-shot prefill that splices
image embeddings (same pattern as Stage B's
run_inference_with_images_via_worker), then the existing decode
loop. Image embeddings are prefill-only; decode tokens have no
images so the splice never recurs.

Concrete work:

  • Detect vision content in stream_inference_via_worker via the
    same request_has_images helper Stage B5 introduced.
  • When images present:
    • preprocess via harness::preprocess::preprocess_data_uri (already
      Stage A2)
    • expand <|image_pad|> per expand_image_pad_tokens (Stage B4)
    • call worker.forward_logits_with_images(handle, expanded_tokens, 0, images, image_token_id) for prefill instead of
      chunked_prefill_via_worker
    • sample first token from the returned logits (same shape contract)
    • enter the standard decode loop (no changes there — KV cache holds
      image-conditioned hidden states from prefill)
  • Reject image_url against non-vision models with the same
    VisionUnsupported error variant Stage B6 plumbed through.

Manual verification (after RPMs deploy on beast):

IMG_B64=$(base64 -w0 /path/to/test.jpg)
curl -sS http://hanzalova.internal:31313/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "{
    \"model\":\"Qwen/Qwen3.6-27B\",
    \"stream\": true,
    \"messages\":[{\"role\":\"user\",\"content\":[
      {\"type\":\"text\",\"text\":\"What's in this image?\"},
      {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,${IMG_B64}\"}}
    ]}],
    \"max_tokens\":200
  }"

Expect a coherent SSE stream where the assistant references actual
image content, and usage.prompt_tokens in the final chunk matches
the Stage B non-streaming baseline (~text + 196 patch tokens at
fixed resolution).

C2 — Responses API with vision

Files:

  • crates/neuron/src/wire/openai_responses.rs — request translator
    (Responses → ChatCompletionRequest)
  • crates/neuron/src/api.rs::responses handler

The Responses API path translates a ResponsesRequest into a
ChatCompletionRequest and then drives chat_completion. Stage B5's
vision routing lives inside chat_completion, so vision should work
through Responses provided the translator preserves image content
parts
.

Verify and (if needed) extend:

  • ResponsesContentPart::InputImage { image_url } already exists in
    cortex-core::responses (Stage 2 of the Responses work). Confirm
    the request translator emits a MessageContent::Parts array
    containing {"type": "image_url", "image_url": {"url": "..."}}
    shaped entries when the input carries InputImage parts.
  • Add a unit test in wire::openai_responses::tests that round-trips
    an InputImage through the translator and asserts the resulting
    ChatCompletionRequest has the right Parts shape.
  • Re-emit VisionUnsupported cleanly on the Responses path —
    inherits from the shared chat_completion error handling but
    worth a test.

Manual verification:

curl -sS http://hanzalova.internal:31313/v1/responses \
  -H 'Content-Type: application/json' \
  -d "{
    \"model\":\"Qwen/Qwen3.6-27B\",
    \"input\":[{\"type\":\"message\",\"role\":\"user\",\"content\":[
      {\"type\":\"input_text\",\"text\":\"Describe this.\"},
      {\"type\":\"input_image\",\"image_url\":\"data:image/jpeg;base64,...\"}
    ]}],
    \"max_output_tokens\": 200
  }"

C3 — cortex-gateway forwards capabilities through /v1/models

Files:

  • crates/cortex-core/src/node.rsModelEntry, CortexModelEntry
  • crates/cortex-gateway/src/poller.rs — reads neuron's
    Vec<ModelInfo> (which already carries capabilities per Stage B7)
  • crates/cortex-gateway/src/handlers.rs::list_models — builds the
    /v1/models response from the poller-cached node.models + the
    catalogue × topology join

Today ModelEntry (the per-node cache) doesn't carry capabilities,
so the field is dropped at the poller→handlers boundary. Add it
end-to-end:

  • ModelEntry { ..., capabilities: Vec<String> } (default empty).
  • Poller writes it from ModelInfo.capabilities.
  • CortexModelEntry { ..., capabilities: Vec<String> }.
  • list_models handler computes the union of capabilities across
    every node where the model is loaded. For catalogue-only entries
    with no loaded location, the catalogue profile doesn't currently
    declare capabilities; default to empty (or look it up from the
    catalogue once #3-adjacent catalogue-capability declaration lands —
    out of scope for C3).
  • Update the response JSON shape; document the field in
    models.example.toml if the catalogue gains a capabilities key
    (optional follow-up).

Acceptance: curl http://hanzalova.internal:31313/v1/models | jq
shows capabilities: ["text", "vision"] for Qwen3.6-27B once it's
loaded somewhere.

Tests

  • C1: streaming integration test in crates/neuron/tests/ that
    drives the SSE path with a mock vision-capable model (or gated on
    cuda-integration for the real-weights smoke).
  • C2: unit test in wire::openai_responses::tests for the
    InputImage → MessageContent::Parts translation round-trip.
  • C3: unit test in cortex-gateway that mocks two nodes with
    different capabilities arrays in their /models responses and
    asserts the gateway's /v1/models reflects the union.

Out of scope

  • TP-vision (#12) — Stage C still ships single-GPU only.
  • Dynamic resolution (#14) — fixed 448×448 across all paths.
  • Numerical validation (#15).
  • Streaming chunked prefill for very long vision prompts — Stage D
    follow-up. Stage C reuses Stage B's single-shot-prefill bound,
    which is well under the activation-memory threshold at 448×448.

References

## Context Stage A ([7df84fe](https://git.lair.cafe/helexa/cortex/commit/7df84fe)) shipped the vision tower load + preprocessor. Stage B ([24968e9](https://git.lair.cafe/helexa/cortex/commit/24968e9), fixup [577781d](https://git.lair.cafe/helexa/cortex/commit/577781d)) shipped end-to-end **non-streaming** chat completion with the `<|image_pad|>` splice, the `vision_unsupported` rejection (`InferenceError::VisionUnsupported` → HTTP 400), and `capabilities` in neuron's `/v1/models`. Closes the silent-drop pattern of #3 for non-streaming clients. Stage C completes the operator-visible feature so vision works for **every** path the gateway exposes — streaming chat, the Responses API, and capability propagation through cortex-gateway. Until this lands, a client that requests `stream=true` with an image-bearing prompt will silently degrade (the streaming forward path doesn't splice image embeddings yet), and clients consuming `cortex-gateway`'s `/v1/models` won't see `capabilities` even though neuron advertises them. Plan reference: `~/.claude/plans/foamy-twirling-catmull.md` — Stage C. ## Scope Three independent sub-tasks; each is a small commit. ### C1 — Streaming chat completion with vision **File:** `crates/neuron/src/harness/candle.rs::stream_inference_via_worker` (around line 4051). Uses `chunked_prefill_via_worker` + the `route_token!` macro for decode. Today the streaming path follows the text-only contract: 1. clear KV cache 2. chunked prefill (`chunked_prefill_via_worker`) 3. decode-loop with `route_token!` Vision-bearing requests need a **single-shot prefill** that splices image embeddings (same pattern as Stage B's `run_inference_with_images_via_worker`), then the existing decode loop. Image embeddings are prefill-only; decode tokens have no images so the splice never recurs. Concrete work: - Detect vision content in `stream_inference_via_worker` via the same `request_has_images` helper Stage B5 introduced. - When images present: - preprocess via `harness::preprocess::preprocess_data_uri` (already Stage A2) - expand `<|image_pad|>` per `expand_image_pad_tokens` (Stage B4) - call `worker.forward_logits_with_images(handle, expanded_tokens, 0, images, image_token_id)` for prefill instead of `chunked_prefill_via_worker` - sample first token from the returned logits (same shape contract) - enter the standard decode loop (no changes there — KV cache holds image-conditioned hidden states from prefill) - Reject `image_url` against non-vision models with the same `VisionUnsupported` error variant Stage B6 plumbed through. Manual verification (after RPMs deploy on beast): ```bash IMG_B64=$(base64 -w0 /path/to/test.jpg) curl -sS http://hanzalova.internal:31313/v1/chat/completions \ -H 'Content-Type: application/json' \ -d "{ \"model\":\"Qwen/Qwen3.6-27B\", \"stream\": true, \"messages\":[{\"role\":\"user\",\"content\":[ {\"type\":\"text\",\"text\":\"What's in this image?\"}, {\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/jpeg;base64,${IMG_B64}\"}} ]}], \"max_tokens\":200 }" ``` Expect a coherent SSE stream where the assistant references actual image content, and `usage.prompt_tokens` in the final chunk matches the Stage B non-streaming baseline (~text + 196 patch tokens at fixed resolution). ### C2 — Responses API with vision **Files:** - `crates/neuron/src/wire/openai_responses.rs` — request translator (Responses → ChatCompletionRequest) - `crates/neuron/src/api.rs::responses` handler The Responses API path translates a `ResponsesRequest` into a `ChatCompletionRequest` and then drives `chat_completion`. Stage B5's vision routing lives inside `chat_completion`, so vision should work through Responses **provided the translator preserves image content parts**. Verify and (if needed) extend: - `ResponsesContentPart::InputImage { image_url }` already exists in `cortex-core::responses` (Stage 2 of the Responses work). Confirm the request translator emits a `MessageContent::Parts` array containing `{"type": "image_url", "image_url": {"url": "..."}}` shaped entries when the input carries InputImage parts. - Add a unit test in `wire::openai_responses::tests` that round-trips an InputImage through the translator and asserts the resulting `ChatCompletionRequest` has the right Parts shape. - Re-emit `VisionUnsupported` cleanly on the Responses path — inherits from the shared `chat_completion` error handling but worth a test. Manual verification: ```bash curl -sS http://hanzalova.internal:31313/v1/responses \ -H 'Content-Type: application/json' \ -d "{ \"model\":\"Qwen/Qwen3.6-27B\", \"input\":[{\"type\":\"message\",\"role\":\"user\",\"content\":[ {\"type\":\"input_text\",\"text\":\"Describe this.\"}, {\"type\":\"input_image\",\"image_url\":\"data:image/jpeg;base64,...\"} ]}], \"max_output_tokens\": 200 }" ``` ### C3 — cortex-gateway forwards `capabilities` through `/v1/models` **Files:** - `crates/cortex-core/src/node.rs` — `ModelEntry`, `CortexModelEntry` - `crates/cortex-gateway/src/poller.rs` — reads neuron's `Vec<ModelInfo>` (which already carries `capabilities` per Stage B7) - `crates/cortex-gateway/src/handlers.rs::list_models` — builds the `/v1/models` response from the poller-cached `node.models` + the catalogue × topology join Today `ModelEntry` (the per-node cache) doesn't carry `capabilities`, so the field is dropped at the poller→handlers boundary. Add it end-to-end: - `ModelEntry { ..., capabilities: Vec<String> }` (default empty). - Poller writes it from `ModelInfo.capabilities`. - `CortexModelEntry { ..., capabilities: Vec<String> }`. - `list_models` handler computes the **union** of capabilities across every node where the model is loaded. For catalogue-only entries with no loaded location, the catalogue profile doesn't currently declare capabilities; default to empty (or look it up from the catalogue once #3-adjacent catalogue-capability declaration lands — out of scope for C3). - Update the response JSON shape; document the field in `models.example.toml` if the catalogue gains a `capabilities` key (optional follow-up). Acceptance: `curl http://hanzalova.internal:31313/v1/models | jq` shows `capabilities: ["text", "vision"]` for Qwen3.6-27B once it's loaded somewhere. ## Tests - C1: streaming integration test in `crates/neuron/tests/` that drives the SSE path with a mock vision-capable model (or gated on `cuda-integration` for the real-weights smoke). - C2: unit test in `wire::openai_responses::tests` for the InputImage → MessageContent::Parts translation round-trip. - C3: unit test in `cortex-gateway` that mocks two nodes with different `capabilities` arrays in their `/models` responses and asserts the gateway's `/v1/models` reflects the union. ## Out of scope - TP-vision (#12) — Stage C still ships single-GPU only. - Dynamic resolution (#14) — fixed 448×448 across all paths. - Numerical validation (#15). - Streaming chunked prefill for very long vision prompts — Stage D follow-up. Stage C reuses Stage B's single-shot-prefill bound, which is well under the activation-memory threshold at 448×448. ## References - Umbrella: #3 - Stage A commit: 7df84fe - Stage B commit: 24968e9 (cuda fixup: 577781d) - Plan doc: `~/.claude/plans/foamy-twirling-catmull.md` Stage C - Spec doc: `doc/vision-qwen3_6-spec.md`
Author
Owner

Implementation notes for a fresh session

Two clarifications that compress the "where do I start?" loop:

C1: sibling function, not in-place extension

stream_inference_via_worker (around candle.rs:4051) currently
takes prompt_tokens: &[u32] plus sampling params — no request, no
images. The Stage B precedent is to add a sibling function
rather than thread image params through the existing signature:

  • Stage B added run_inference_with_images_via_worker next to
    run_inference_via_worker (both in candle.rs, both cuda-gated).
  • Stage C should add stream_inference_with_images_via_worker next
    to stream_inference_via_worker. Same shape: takes
    expanded_tokens + Vec<ImageInput> + image_token_id plus the
    existing streaming params, does a single-shot
    worker.forward_logits_with_images(...) prefill, then enters the
    same route_token! decode loop the text-only streamer uses.

The dispatch happens at the caller — chat_completion_stream_with
(grep for it; it's the streaming analogue of chat_completion,
around candle.rs:1879). Same vision_route detection pattern as
Stage B5: when images present, call the new
stream_inference_with_images_via_worker; otherwise the existing
stream_inference_via_worker. The text-only path stays untouched.

Canonical references in-repo

The issue body links ~/.claude/plans/foamy-twirling-catmull.md,
which is in the original author's home directory and may not be
readable from a different invocation. The canonical in-repo source
is doc/vision-qwen3_6-spec.md
— it covers architecture, tensor
shapes, chat-template image-token insertion, MRoPE gap, and the
fixed-resolution preprocess choice. Read that first.

The Stage A and Stage B commits are the implementation precedent:

  • Stage A0–A7: 7df84fe (vision tower load + preprocessor)
  • Stage B1–B8: 24968e9 (LM splice + non-streaming chat)
  • Stage B fixup: 577781d (Clone derive on ImageInput for the cuda
    branch — same gotcha exists for the streaming path if you reuse
    the &vision_route pattern, so derive Clone or move out of the
    ref by value)

git log --oneline -20 crates/neuron/src/harness/ is the fastest
way to orient.

Order I'd tackle them

  1. C2 first: the Responses translator change is probably tiny
    (or already correct — Stage B5 routes via chat_completion and
    the Responses handler calls chat_completion, so vision likely
    works through Responses today; the work is mostly proving it
    with a test). Cheap warm-up that exercises the codebase.
  2. C3 second: pure data plumbing through ModelEntry
    CortexModelEntrylist_models handler. No tricky tensor
    work. Touches cortex-gateway tests; mock-friendly.
  3. C1 last: the real work. Mirror Stage B's pattern, then run
    the manual SSE curl on beast.

CI gate

Same as every prior stage:

cargo fmt --check --all
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace

Plus the cuda type-check job in Gitea Actions will catch
feature-gated breakage that the local CPU build misses (as it did
for 24968e9 → fixup 577781d). If you add a &vision_route
match in the streaming dispatch, double-check ImageInput.clone()
works — Clone is derived now, so it should, but the cuda branch
is the one CI catches you on.

## Implementation notes for a fresh session Two clarifications that compress the "where do I start?" loop: ### C1: sibling function, not in-place extension `stream_inference_via_worker` (around `candle.rs:4051`) currently takes `prompt_tokens: &[u32]` plus sampling params — no request, no images. **The Stage B precedent is to add a sibling function** rather than thread image params through the existing signature: - Stage B added `run_inference_with_images_via_worker` next to `run_inference_via_worker` (both in `candle.rs`, both cuda-gated). - Stage C should add `stream_inference_with_images_via_worker` next to `stream_inference_via_worker`. Same shape: takes `expanded_tokens` + `Vec<ImageInput>` + `image_token_id` plus the existing streaming params, does a single-shot `worker.forward_logits_with_images(...)` prefill, then enters the same `route_token!` decode loop the text-only streamer uses. The dispatch happens at the caller — `chat_completion_stream_with` (grep for it; it's the streaming analogue of `chat_completion`, around `candle.rs:1879`). Same `vision_route` detection pattern as Stage B5: when images present, call the new `stream_inference_with_images_via_worker`; otherwise the existing `stream_inference_via_worker`. The text-only path stays untouched. ### Canonical references in-repo The issue body links `~/.claude/plans/foamy-twirling-catmull.md`, which is in the original author's home directory and may not be readable from a different invocation. **The canonical in-repo source is `doc/vision-qwen3_6-spec.md`** — it covers architecture, tensor shapes, chat-template image-token insertion, MRoPE gap, and the fixed-resolution preprocess choice. Read that first. The Stage A and Stage B commits are the implementation precedent: - Stage A0–A7: `7df84fe` (vision tower load + preprocessor) - Stage B1–B8: `24968e9` (LM splice + non-streaming chat) - Stage B fixup: `577781d` (Clone derive on ImageInput for the cuda branch — same gotcha exists for the streaming path if you reuse the `&vision_route` pattern, so derive Clone or move out of the ref by value) `git log --oneline -20 crates/neuron/src/harness/` is the fastest way to orient. ### Order I'd tackle them 1. **C2 first**: the Responses translator change is probably tiny (or already correct — Stage B5 routes via `chat_completion` and the Responses handler calls `chat_completion`, so vision likely works through Responses today; the work is mostly *proving* it with a test). Cheap warm-up that exercises the codebase. 2. **C3 second**: pure data plumbing through `ModelEntry` → `CortexModelEntry` → `list_models` handler. No tricky tensor work. Touches cortex-gateway tests; mock-friendly. 3. **C1 last**: the real work. Mirror Stage B's pattern, then run the manual SSE curl on beast. ### CI gate Same as every prior stage: ```sh cargo fmt --check --all cargo clippy --workspace --all-targets -- -D warnings cargo test --workspace ``` Plus the cuda type-check job in Gitea Actions will catch feature-gated breakage that the local CPU build misses (as it did for `24968e9` → fixup `577781d`). If you add a `&vision_route` match in the streaming dispatch, double-check `ImageInput.clone()` works — `Clone` is derived now, so it should, but the cuda branch is the one CI catches you on.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: helexa/cortex#16