feat(helexa-acp): image input for vision-capable models

Stage 5. Zed clipboard/DnD images get forwarded as OpenAI content-array messages on user turns. - New MessageContent::MultiPart variant + MessagePart (Text|Image) + ImageData struct (mime_type, base64 data, optional uri). - flatten_prompt now produces structured content: collapses to Text when every block is text (some upstreams treat array-form as vision-only and refuse on text-only models), otherwise produces MultiPart preserving block order. - OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url", image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user messages. Data URIs are used over remote `uri` because they round-trip through every upstream we care about. - prompt_capabilities.image = true at initialize so Zed actually sends image blocks. - compaction estimates ~512 tokens per image (the middle of the Qwen3-VL / OpenAI detail range) so the budget tracker doesn't pretend images are free. - session/load replays image-bearing user turns by surfacing the text parts verbatim and rendering each image as a "[image: {mime} ({n} bytes)]" placeholder chunk — Zed can show the prior text context even though re-uploading the bytes through ACP isn't meaningful for resume. - 4 new tests: flatten produces MultiPart in block order, image-only prompts still flatten to MultiPart, encoder emits the correct array shape, text-only encoding stays as the string form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 09:43:00 +03:00
parent b9016571f6
commit df0abfe4d4
4 changed files with 328 additions and 31 deletions
--- a/crates/helexa-acp/src/compaction.rs
+++ b/crates/helexa-acp/src/compaction.rs
@@ -32,7 +32,7 @@
 //! (over-estimates tokens slightly) so we compact a touch early
 //! rather than a touch late.

-use crate::provider::{Message, MessageContent, Role};
+use crate::provider::{Message, MessageContent, MessagePart, Role};

 /// Most-recent N messages that are never elided. Roughly "the
 /// current tool round in flight" — assistant turn that called the
@@ -54,6 +54,13 @@ const CHARS_PER_TOKEN: f32 = 3.5;
 /// to a few tokens; tiny but it adds up across long histories.
 const ENVELOPE_TOKENS: usize = 8;

+/// Rough per-image token cost used by the budget estimator. Real
+/// vision tokenizers vary widely (256–1024 tokens for typical
+/// resolutions on Qwen3-VL, OpenAI's `low`/`high` detail toggles
+/// pick between ~85 and ~1000+). 512 is a defensible middle that
+/// keeps compaction from treating images as free.
+const IMAGE_TOKENS_APPROX: usize = 512;
+
 /// Stats reported back from [`compact_to_budget`] for the caller to
 /// log. The numbers are estimates (see [`estimate_tokens`]), so
 /// don't compare them to upstream-reported token counts as if they
@@ -87,6 +94,19 @@ impl CompactionStats {
 pub fn estimate_tokens(msg: &Message) -> usize {
    let chars = match &msg.content {
        MessageContent::Text { text } => text.len(),
+        MessageContent::MultiPart { parts } => parts
+            .iter()
+            .map(|p| match p {
+                MessagePart::Text { text } => text.len(),
+                // Each image is one block in the context window; the
+                // upstream tokenizer handles the real cost (and it
+                // varies wildly by model — Qwen3-VL uses ~256-1024
+                // tokens per image depending on size). Take a
+                // middle estimate so the budget tracker doesn't
+                // pretend images are free.
+                MessagePart::Image(_) => IMAGE_TOKENS_APPROX * CHARS_PER_TOKEN as usize,
+            })
+            .sum(),
        MessageContent::ToolCalls { text, calls } => {
            let txt = text.as_deref().map(|s| s.len()).unwrap_or(0);
            let calls_size: usize = calls
@@ -206,6 +226,15 @@ fn elide_in_place(msg: &mut Message) -> bool {
            *text = format!("(elided: {} bytes of assistant prose)", text.len());
            true
        }
+        MessageContent::MultiPart { .. } => {
+            // MultiPart messages today only exist as User turns,
+            // and User turns are protected by the role check in
+            // `compact_to_budget` — so this branch is unreachable
+            // for current call sites. Returning false keeps the
+            // unreachable path benign if a future stage starts
+            // emitting MultiPart on other roles.
+            false
+        }
    }
 }