Image content (image_url) is dropped — multimodal chat requests are processed as text-only
#3
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
When sending an OpenAI-compatible
chat/completionsrequest that includes an image as acontentpart (type: "image_url"with adata:image/jpeg;base64,...URL), the image is not ingested. The request is processed as if it were text-only: the model never receives the pixels, and the prompt token count reflects only the text. Downstream vision consumers therefore get descriptions of "nothing" or refusals like "I cannot see an image."Endpoint / model
http://hanzalova.internal:31313/v1/chat/completionsQwen/Qwen3.6-27B(reported by/v1/modelsasloadedon nodebeast)Reproduction
POST a standard multimodal request with one small JPEG (here a 320×240 frame, ~6.9 KB → ~9 KB base64):
Observed
Response (abridged):
Two tells that the image was discarded:
prompt_tokens: 62. A genuine vision ingest of even a small frame adds hundreds-to-thousands of tokens (image patches). 62 tokens accounts for the text messages alone — theimage_urlpart contributed nothing.The same behaviour shows up through a real consumer: a request per scene returns no usable description because the model is reasoning about an absent image.
Expected
The
image_urlcontent part should be forwarded to (and ingested by) the backend so that:usage.prompt_tokensreflects the image tokens, andIf
Qwen/Qwen3.6-27Bas served is genuinely not multimodal, the router should ideally surface that (e.g. reject image content for non-vision models, or advertise vision capability in/v1/models) rather than silently dropping the image and returning a confident text-only answer.Where it might live
Unclear whether
cortex/helexa strips the structuredcontentarray when forwarding to the backend, or whether the backend engine for this model ignores image parts. Theprompt_tokens: 62figure suggests the image never reached a tokenizer that understands it. Determining the layer is part of triage.Impact
Any vision consumer of the unified endpoint (e.g. a video scene-describer sending sampled frames) cannot function — it receives text-only hallucinations instead of image-grounded output, with no error to signal the dropped modality.
Aside (not a cortex bug, noted for context)
The served model is a reasoning model: it emits a
<think>…</think>block and, atmax_tokens: 400, exhausted the budget thinking before producing any answer (finish_reason: length). Consumers will want to disable thinking (/no_think/chat_template_kwargs: {enable_thinking: false}) and raisemax_tokens— but that is orthogonal to the image-ingest problem above.Versions observed against
Pinning down the deployed versions when this was first reported, so future debugging can establish whether intervening changes affect the repro:
0.1.16-0.1.20260527185748.git249b2e5.fc43onhanzalova0.1.16-0.1.20260529094300.gitdf0abfe.fc43onbeast(commitdf0abfe)0.1.16-0.1.20260529094300.gitdf0abfe.fc43onbenjy0.1.16-0.1.20260527185748.git249b2e5.fc43onquadbrat(note: older build, since upgraded togitdf0abfe)For helexa-acp callers: this bug predates the InferenceEvent refactor on
main(commit302ccfb, pushed 2026-05-29) and is unrelated to it — the refactor only touched the streaming output path, not the request-parsing / image-ingest path. The next deploy off a build that includes302ccfb(or later) should reproduce identically until we actually wire image ingest through to the candle harness.Relevant code paths to inspect during triage:
crates/cortex-gateway/src/handlers.rs::chat_completions— does the gateway preservecontentarrays verbatim when proxying, or flatten them?crates/neuron/src/harness/candle.rs::format_qwen3_prompt— currently doesMessageContent::Parts(parts) => parts.iter().filter(text-only).join(...), which would silently dropimage_urlparts.Verified fixed from the consumer side after the neuron-harness vision work. Same endpoint/model (
http://hanzalova.internal:31313/v1,Qwen/Qwen3.6-27B):prompt_tokens: 225(was 62) — theimage_urlpart is ingested.testsrcframe).A full downstream run (a video scene-describer sending sampled frames) now produces correct per-scene, image-grounded descriptions end-to-end. 🎉
Re the aside (reasoning model emitting
<think>and exhaustingmax_tokens): handled consumer-side by sendingchat_template_kwargs: {enable_thinking: false}and raisingmax_tokens— the harness honoursenable_thinking: falsecorrectly.Thanks for the quick turnaround. Closing is at your discretion.