Image content (image_url) is dropped — multimodal chat requests are processed as text-only #3

New Issue

grenade · 2026-05-31T06:30:38Z

grenade commented

2026-05-31 06:30:38 +00:00

Summary

When sending an OpenAI-compatible chat/completions request that includes an image as a content part (type: "image_url" with a data:image/jpeg;base64,... URL), the image is not ingested. The request is processed as if it were text-only: the model never receives the pixels, and the prompt token count reflects only the text. Downstream vision consumers therefore get descriptions of "nothing" or refusals like "I cannot see an image."

Endpoint / model

Endpoint: http://hanzalova.internal:31313/v1/chat/completions
Model: Qwen/Qwen3.6-27B (reported by /v1/models as loaded on node beast)

Reproduction

POST a standard multimodal request with one small JPEG (here a 320×240 frame, ~6.9 KB → ~9 KB base64):

{
  "model": "Qwen/Qwen3.6-27B",
  "messages": [
    {"role": "system", "content": "Respond with ONLY a JSON object: {\"characters\":[string],\"activity\":string,\"description\":string}."},
    {"role": "user", "content": [
      {"type": "text", "text": "Analyze this frame and return the JSON object."},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<~9KB of base64>"}}
    ]}
  ],
  "max_tokens": 400,
  "response_format": {"type": "json_object"}
}

Observed

Response (abridged):

{
  "model": "Qwen/Qwen3.6-27B",
  "choices": [{"message": {"role": "assistant",
    "content": "<think>\n...I cannot see an image here... I don't have access to the visual input here...</think>"}},
    "finish_reason": "length"}],
  "usage": {"prompt_tokens": 62, "completion_tokens": 400, "total_tokens": 462}
}

Two tells that the image was discarded:

prompt_tokens: 62. A genuine vision ingest of even a small frame adds hundreds-to-thousands of tokens (image patches). 62 tokens accounts for the text messages alone — the image_url part contributed nothing.
The model explicitly states it has no visual input ("I cannot see an image", "I don't have access to the visual input").

The same behaviour shows up through a real consumer: a request per scene returns no usable description because the model is reasoning about an absent image.

Expected

The image_url content part should be forwarded to (and ingested by) the backend so that:

usage.prompt_tokens reflects the image tokens, and
the model actually conditions on the image.

If Qwen/Qwen3.6-27B as served is genuinely not multimodal, the router should ideally surface that (e.g. reject image content for non-vision models, or advertise vision capability in /v1/models) rather than silently dropping the image and returning a confident text-only answer.

Where it might live

Unclear whether cortex/helexa strips the structured content array when forwarding to the backend, or whether the backend engine for this model ignores image parts. The prompt_tokens: 62 figure suggests the image never reached a tokenizer that understands it. Determining the layer is part of triage.

Impact

Any vision consumer of the unified endpoint (e.g. a video scene-describer sending sampled frames) cannot function — it receives text-only hallucinations instead of image-grounded output, with no error to signal the dropped modality.

Aside (not a cortex bug, noted for context)

The served model is a reasoning model: it emits a <think>…</think> block and, at max_tokens: 400, exhausted the budget thinking before producing any answer (finish_reason: length). Consumers will want to disable thinking (/no_think / chat_template_kwargs: {enable_thinking: false}) and raise max_tokens — but that is orthogonal to the image-ingest problem above.

## Summary When sending an OpenAI-compatible `chat/completions` request that includes an image as a `content` part (`type: "image_url"` with a `data:image/jpeg;base64,...` URL), the image is **not ingested**. The request is processed as if it were text-only: the model never receives the pixels, and the prompt token count reflects only the text. Downstream vision consumers therefore get descriptions of "nothing" or refusals like *"I cannot see an image."* ## Endpoint / model - Endpoint: `http://hanzalova.internal:31313/v1/chat/completions` - Model: `Qwen/Qwen3.6-27B` (reported by `/v1/models` as `loaded` on node `beast`) ## Reproduction POST a standard multimodal request with one small JPEG (here a 320×240 frame, ~6.9 KB → ~9 KB base64): ```json { "model": "Qwen/Qwen3.6-27B", "messages": [ {"role": "system", "content": "Respond with ONLY a JSON object: {\"characters\":[string],\"activity\":string,\"description\":string}."}, {"role": "user", "content": [ {"type": "text", "text": "Analyze this frame and return the JSON object."}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<~9KB of base64>"}} ]} ], "max_tokens": 400, "response_format": {"type": "json_object"} } ``` ## Observed Response (abridged): ```json { "model": "Qwen/Qwen3.6-27B", "choices": [{"message": {"role": "assistant", "content": "<think>\n...I cannot see an image here... I don't have access to the visual input here...</think>"}}, "finish_reason": "length"}], "usage": {"prompt_tokens": 62, "completion_tokens": 400, "total_tokens": 462} } ``` Two tells that the image was discarded: 1. **`prompt_tokens: 62`.** A genuine vision ingest of even a small frame adds hundreds-to-thousands of tokens (image patches). 62 tokens accounts for the text messages alone — the `image_url` part contributed nothing. 2. **The model explicitly states it has no visual input** ("I cannot see an image", "I don't have access to the visual input"). The same behaviour shows up through a real consumer: a request per scene returns no usable description because the model is reasoning about an absent image. ## Expected The `image_url` content part should be forwarded to (and ingested by) the backend so that: - `usage.prompt_tokens` reflects the image tokens, and - the model actually conditions on the image. If `Qwen/Qwen3.6-27B` as served is genuinely not multimodal, the router should ideally surface that (e.g. reject image content for non-vision models, or advertise vision capability in `/v1/models`) rather than silently dropping the image and returning a confident text-only answer. ## Where it might live Unclear whether `cortex`/helexa strips the structured `content` array when forwarding to the backend, or whether the backend engine for this model ignores image parts. The `prompt_tokens: 62` figure suggests the image never reached a tokenizer that understands it. Determining the layer is part of triage. ## Impact Any vision consumer of the unified endpoint (e.g. a video scene-describer sending sampled frames) cannot function — it receives text-only hallucinations instead of image-grounded output, with no error to signal the dropped modality. ## Aside (not a cortex bug, noted for context) The served model is a *reasoning* model: it emits a `<think>…</think>` block and, at `max_tokens: 400`, exhausted the budget thinking before producing any answer (`finish_reason: length`). Consumers will want to disable thinking (`/no_think` / `chat_template_kwargs: {enable_thinking: false}`) and raise `max_tokens` — but that is orthogonal to the image-ingest problem above.

grenade commented

2026-05-31 06:49:41 +00:00

Versions observed against

Pinning down the deployed versions when this was first reported, so future debugging can establish whether intervening changes affect the repro:

cortex 0.1.16-0.1.20260527185748.git249b2e5.fc43 on hanzalova
helexa-neuron-blackwell 0.1.16-0.1.20260529094300.gitdf0abfe.fc43 on beast (commit df0abfe)
helexa-neuron-ada 0.1.16-0.1.20260529094300.gitdf0abfe.fc43 on benjy
helexa-neuron-ampere 0.1.16-0.1.20260527185748.git249b2e5.fc43 on quadbrat (note: older build, since upgraded to gitdf0abfe)

For helexa-acp callers: this bug predates the InferenceEvent refactor on main (commit 302ccfb, pushed 2026-05-29) and is unrelated to it — the refactor only touched the streaming output path, not the request-parsing / image-ingest path. The next deploy off a build that includes 302ccfb (or later) should reproduce identically until we actually wire image ingest through to the candle harness.

Relevant code paths to inspect during triage:

crates/cortex-gateway/src/handlers.rs::chat_completions — does the gateway preserve content arrays verbatim when proxying, or flatten them?
crates/neuron/src/harness/candle.rs::format_qwen3_prompt — currently does MessageContent::Parts(parts) => parts.iter().filter(text-only).join(...), which would silently drop image_url parts.

### Versions observed against Pinning down the deployed versions when this was first reported, so future debugging can establish whether intervening changes affect the repro: - cortex `0.1.16-0.1.20260527185748.git249b2e5.fc43` on `hanzalova` - helexa-neuron-blackwell `0.1.16-0.1.20260529094300.gitdf0abfe.fc43` on `beast` (commit [`df0abfe`](https://git.lair.cafe/helexa/cortex/commit/df0abfe)) - helexa-neuron-ada `0.1.16-0.1.20260529094300.gitdf0abfe.fc43` on `benjy` - helexa-neuron-ampere `0.1.16-0.1.20260527185748.git249b2e5.fc43` on `quadbrat` (note: older build, since upgraded to `gitdf0abfe`) For helexa-acp callers: this bug predates the InferenceEvent refactor on `main` (commit [`302ccfb`](https://git.lair.cafe/helexa/cortex/commit/302ccfb), pushed 2026-05-29) and is unrelated to it — the refactor only touched the streaming output path, not the request-parsing / image-ingest path. The next deploy off a build that includes `302ccfb` (or later) should reproduce identically until we actually wire image ingest through to the candle harness. Relevant code paths to inspect during triage: - `crates/cortex-gateway/src/handlers.rs::chat_completions` — does the gateway preserve `content` arrays verbatim when proxying, or flatten them? - `crates/neuron/src/harness/candle.rs::format_qwen3_prompt` — currently does `MessageContent::Parts(parts) => parts.iter().filter(text-only).join(...)`, which would silently drop `image_url` parts.

grenade referenced this issue

2026-06-01 13:18:04 +00:00

Vision: tensor-parallel implementation (Stage E) #12

grenade referenced this issue

2026-06-01 13:18:18 +00:00

Vision: deploy on Qwen3.6-27B (production validation) #13

grenade referenced this issue

2026-06-01 13:18:32 +00:00

Vision: dynamic image resolution (Qwen-VL min/max pixels) #14

grenade referenced this issue

2026-06-01 13:18:45 +00:00

Vision: numerical validation against transformers reference #15

grenade referenced this issue from a commit

2026-06-02 08:40:50 +00:00

feat(neuron): Stage A — vision tower load + preprocessor for Qwen3.6

grenade referenced this issue from a commit

2026-06-02 12:33:04 +00:00

feat(neuron): Stage B — end-to-end text+image chat for Qwen3.6

grenade referenced this issue

2026-06-04 10:36:01 +00:00

Vision Stage C: streaming SSE + Responses API + cortex-gateway capability propagation #16

grenade commented

2026-06-04 19:33:10 +00:00

Verified fixed from the consumer side after the neuron-harness vision work. Same endpoint/model (http://hanzalova.internal:31313/v1, Qwen/Qwen3.6-27B):

The identical reproduction request now returns prompt_tokens: 225 (was 62) — the image_url part is ingested.
The model accurately describes the actual frame contents (e.g. "standard television color bar test pattern… cyan, magenta, yellow, green, blue, red… white digital '0'" for an ffmpeg testsrc frame).

A full downstream run (a video scene-describer sending sampled frames) now produces correct per-scene, image-grounded descriptions end-to-end. 🎉

Re the aside (reasoning model emitting <think> and exhausting max_tokens): handled consumer-side by sending chat_template_kwargs: {enable_thinking: false} and raising max_tokens — the harness honours enable_thinking: false correctly.

Thanks for the quick turnaround. Closing is at your discretion.

Verified fixed from the consumer side after the neuron-harness vision work. Same endpoint/model (`http://hanzalova.internal:31313/v1`, `Qwen/Qwen3.6-27B`): - The identical reproduction request now returns **`prompt_tokens: 225`** (was 62) — the `image_url` part is ingested. - The model accurately describes the actual frame contents (e.g. *"standard television color bar test pattern… cyan, magenta, yellow, green, blue, red… white digital '0'"* for an ffmpeg `testsrc` frame). A full downstream run (a video scene-describer sending sampled frames) now produces correct per-scene, image-grounded descriptions end-to-end. 🎉 Re the aside (reasoning model emitting `<think>` and exhausting `max_tokens`): handled consumer-side by sending `chat_template_kwargs: {enable_thinking: false}` and raising `max_tokens` — the harness honours `enable_thinking: false` correctly. Thanks for the quick turnaround. Closing is at your discretion.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: helexa/cortex#3

Image content (image_url) is dropped — multimodal chat requests are processed as text-only #3

Summary

Endpoint / model

Reproduction

Observed

Expected

Where it might live

Impact

Aside (not a cortex bug, noted for context)

Versions observed against

Image content (`image_url`) is dropped — multimodal chat requests are processed as text-only #3