Vision Stage C: streaming SSE + Responses API + cortex-gateway capability propagation #16
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Stage A (7df84fe)
shipped the vision tower load + preprocessor. Stage B
(24968e9,
fixup 577781d)
shipped end-to-end non-streaming chat completion with the
<|image_pad|>splice, thevision_unsupportedrejection(
InferenceError::VisionUnsupported→ HTTP 400), andcapabilitiesin neuron's
/v1/models. Closes the silent-drop pattern of #3 fornon-streaming clients.
Stage C completes the operator-visible feature so vision works for
every path the gateway exposes — streaming chat, the Responses
API, and capability propagation through cortex-gateway. Until this
lands, a client that requests
stream=truewith an image-bearingprompt will silently degrade (the streaming forward path doesn't
splice image embeddings yet), and clients consuming
cortex-gateway's/v1/modelswon't seecapabilitieseven thoughneuron advertises them.
Plan reference:
~/.claude/plans/foamy-twirling-catmull.md— Stage C.Scope
Three independent sub-tasks; each is a small commit.
C1 — Streaming chat completion with vision
File:
crates/neuron/src/harness/candle.rs::stream_inference_via_worker(around line 4051). Uses
chunked_prefill_via_worker+ theroute_token!macro for decode.Today the streaming path follows the text-only contract:
chunked_prefill_via_worker)route_token!Vision-bearing requests need a single-shot prefill that splices
image embeddings (same pattern as Stage B's
run_inference_with_images_via_worker), then the existing decodeloop. Image embeddings are prefill-only; decode tokens have no
images so the splice never recurs.
Concrete work:
stream_inference_via_workervia thesame
request_has_imageshelper Stage B5 introduced.harness::preprocess::preprocess_data_uri(alreadyStage A2)
<|image_pad|>perexpand_image_pad_tokens(Stage B4)worker.forward_logits_with_images(handle, expanded_tokens, 0, images, image_token_id)for prefill instead ofchunked_prefill_via_workerimage-conditioned hidden states from prefill)
image_urlagainst non-vision models with the sameVisionUnsupportederror variant Stage B6 plumbed through.Manual verification (after RPMs deploy on beast):
Expect a coherent SSE stream where the assistant references actual
image content, and
usage.prompt_tokensin the final chunk matchesthe Stage B non-streaming baseline (~text + 196 patch tokens at
fixed resolution).
C2 — Responses API with vision
Files:
crates/neuron/src/wire/openai_responses.rs— request translator(Responses → ChatCompletionRequest)
crates/neuron/src/api.rs::responseshandlerThe Responses API path translates a
ResponsesRequestinto aChatCompletionRequestand then driveschat_completion. Stage B5'svision routing lives inside
chat_completion, so vision should workthrough Responses provided the translator preserves image content
parts.
Verify and (if needed) extend:
ResponsesContentPart::InputImage { image_url }already exists incortex-core::responses(Stage 2 of the Responses work). Confirmthe request translator emits a
MessageContent::Partsarraycontaining
{"type": "image_url", "image_url": {"url": "..."}}shaped entries when the input carries InputImage parts.
wire::openai_responses::teststhat round-tripsan InputImage through the translator and asserts the resulting
ChatCompletionRequesthas the right Parts shape.VisionUnsupportedcleanly on the Responses path —inherits from the shared
chat_completionerror handling butworth a test.
Manual verification:
C3 — cortex-gateway forwards
capabilitiesthrough/v1/modelsFiles:
crates/cortex-core/src/node.rs—ModelEntry,CortexModelEntrycrates/cortex-gateway/src/poller.rs— reads neuron'sVec<ModelInfo>(which already carriescapabilitiesper Stage B7)crates/cortex-gateway/src/handlers.rs::list_models— builds the/v1/modelsresponse from the poller-cachednode.models+ thecatalogue × topology join
Today
ModelEntry(the per-node cache) doesn't carrycapabilities,so the field is dropped at the poller→handlers boundary. Add it
end-to-end:
ModelEntry { ..., capabilities: Vec<String> }(default empty).ModelInfo.capabilities.CortexModelEntry { ..., capabilities: Vec<String> }.list_modelshandler computes the union of capabilities acrossevery node where the model is loaded. For catalogue-only entries
with no loaded location, the catalogue profile doesn't currently
declare capabilities; default to empty (or look it up from the
catalogue once #3-adjacent catalogue-capability declaration lands —
out of scope for C3).
models.example.tomlif the catalogue gains acapabilitieskey(optional follow-up).
Acceptance:
curl http://hanzalova.internal:31313/v1/models | jqshows
capabilities: ["text", "vision"]for Qwen3.6-27B once it'sloaded somewhere.
Tests
crates/neuron/tests/thatdrives the SSE path with a mock vision-capable model (or gated on
cuda-integrationfor the real-weights smoke).wire::openai_responses::testsfor theInputImage → MessageContent::Parts translation round-trip.
cortex-gatewaythat mocks two nodes withdifferent
capabilitiesarrays in their/modelsresponses andasserts the gateway's
/v1/modelsreflects the union.Out of scope
follow-up. Stage C reuses Stage B's single-shot-prefill bound,
which is well under the activation-memory threshold at 448×448.
References
7df84fe24968e9(cuda fixup:577781d)~/.claude/plans/foamy-twirling-catmull.mdStage Cdoc/vision-qwen3_6-spec.mdImplementation notes for a fresh session
Two clarifications that compress the "where do I start?" loop:
C1: sibling function, not in-place extension
stream_inference_via_worker(aroundcandle.rs:4051) currentlytakes
prompt_tokens: &[u32]plus sampling params — no request, noimages. The Stage B precedent is to add a sibling function
rather than thread image params through the existing signature:
run_inference_with_images_via_workernext torun_inference_via_worker(both incandle.rs, both cuda-gated).stream_inference_with_images_via_workernextto
stream_inference_via_worker. Same shape: takesexpanded_tokens+Vec<ImageInput>+image_token_idplus theexisting streaming params, does a single-shot
worker.forward_logits_with_images(...)prefill, then enters thesame
route_token!decode loop the text-only streamer uses.The dispatch happens at the caller —
chat_completion_stream_with(grep for it; it's the streaming analogue of
chat_completion,around
candle.rs:1879). Samevision_routedetection pattern asStage B5: when images present, call the new
stream_inference_with_images_via_worker; otherwise the existingstream_inference_via_worker. The text-only path stays untouched.Canonical references in-repo
The issue body links
~/.claude/plans/foamy-twirling-catmull.md,which is in the original author's home directory and may not be
readable from a different invocation. The canonical in-repo source
is
doc/vision-qwen3_6-spec.md— it covers architecture, tensorshapes, chat-template image-token insertion, MRoPE gap, and the
fixed-resolution preprocess choice. Read that first.
The Stage A and Stage B commits are the implementation precedent:
7df84fe(vision tower load + preprocessor)24968e9(LM splice + non-streaming chat)577781d(Clone derive on ImageInput for the cudabranch — same gotcha exists for the streaming path if you reuse
the
&vision_routepattern, so derive Clone or move out of theref by value)
git log --oneline -20 crates/neuron/src/harness/is the fastestway to orient.
Order I'd tackle them
(or already correct — Stage B5 routes via
chat_completionandthe Responses handler calls
chat_completion, so vision likelyworks through Responses today; the work is mostly proving it
with a test). Cheap warm-up that exercises the codebase.
ModelEntry→CortexModelEntry→list_modelshandler. No tricky tensorwork. Touches cortex-gateway tests; mock-friendly.
the manual SSE curl on beast.
CI gate
Same as every prior stage:
Plus the cuda type-check job in Gitea Actions will catch
feature-gated breakage that the local CPU build misses (as it did
for
24968e9→ fixup577781d). If you add a&vision_routematch in the streaming dispatch, double-check
ImageInput.clone()works —
Cloneis derived now, so it should, but the cuda branchis the one CI catches you on.