feat(helexa-acp): image input for vision-capable models
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m2s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m27s
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s

Stage 5. Zed clipboard/DnD images get forwarded as OpenAI
content-array messages on user turns.

- New MessageContent::MultiPart variant + MessagePart (Text|Image)
  + ImageData struct (mime_type, base64 data, optional uri).
- flatten_prompt now produces structured content: collapses to
  Text when every block is text (some upstreams treat array-form
  as vision-only and refuse on text-only models), otherwise
  produces MultiPart preserving block order.
- OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url",
  image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user
  messages. Data URIs are used over remote `uri` because they
  round-trip through every upstream we care about.
- prompt_capabilities.image = true at initialize so Zed actually
  sends image blocks.
- compaction estimates ~512 tokens per image (the middle of the
  Qwen3-VL / OpenAI detail range) so the budget tracker doesn't
  pretend images are free.
- session/load replays image-bearing user turns by surfacing the
  text parts verbatim and rendering each image as a "[image: {mime}
  ({n} bytes)]" placeholder chunk — Zed can show the prior text
  context even though re-uploading the bytes through ACP isn't
  meaningful for resume.
- 4 new tests: flatten produces MultiPart in block order, image-only
  prompts still flatten to MultiPart, encoder emits the correct
  array shape, text-only encoding stays as the string form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 09:43:00 +03:00
parent b9016571f6
commit df0abfe4d4
4 changed files with 328 additions and 31 deletions

View File

@@ -32,7 +32,7 @@
//! (over-estimates tokens slightly) so we compact a touch early
//! rather than a touch late.
use crate::provider::{Message, MessageContent, Role};
use crate::provider::{Message, MessageContent, MessagePart, Role};
/// Most-recent N messages that are never elided. Roughly "the
/// current tool round in flight" — assistant turn that called the
@@ -54,6 +54,13 @@ const CHARS_PER_TOKEN: f32 = 3.5;
/// to a few tokens; tiny but it adds up across long histories.
const ENVELOPE_TOKENS: usize = 8;
/// Rough per-image token cost used by the budget estimator. Real
/// vision tokenizers vary widely (2561024 tokens for typical
/// resolutions on Qwen3-VL, OpenAI's `low`/`high` detail toggles
/// pick between ~85 and ~1000+). 512 is a defensible middle that
/// keeps compaction from treating images as free.
const IMAGE_TOKENS_APPROX: usize = 512;
/// Stats reported back from [`compact_to_budget`] for the caller to
/// log. The numbers are estimates (see [`estimate_tokens`]), so
/// don't compare them to upstream-reported token counts as if they
@@ -87,6 +94,19 @@ impl CompactionStats {
pub fn estimate_tokens(msg: &Message) -> usize {
let chars = match &msg.content {
MessageContent::Text { text } => text.len(),
MessageContent::MultiPart { parts } => parts
.iter()
.map(|p| match p {
MessagePart::Text { text } => text.len(),
// Each image is one block in the context window; the
// upstream tokenizer handles the real cost (and it
// varies wildly by model — Qwen3-VL uses ~256-1024
// tokens per image depending on size). Take a
// middle estimate so the budget tracker doesn't
// pretend images are free.
MessagePart::Image(_) => IMAGE_TOKENS_APPROX * CHARS_PER_TOKEN as usize,
})
.sum(),
MessageContent::ToolCalls { text, calls } => {
let txt = text.as_deref().map(|s| s.len()).unwrap_or(0);
let calls_size: usize = calls
@@ -206,6 +226,15 @@ fn elide_in_place(msg: &mut Message) -> bool {
*text = format!("(elided: {} bytes of assistant prose)", text.len());
true
}
MessageContent::MultiPart { .. } => {
// MultiPart messages today only exist as User turns,
// and User turns are protected by the role check in
// `compact_to_budget` — so this branch is unreachable
// for current call sites. Returning false keeps the
// unreachable path benign if a future stage starts
// emitting MultiPart on other roles.
false
}
}
}