feat(helexa-acp): image input for vision-capable models
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m2s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m27s
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
All checks were successful
build-prerelease / Resolve version stamps (push) Successful in 34s
CI / Format (push) Successful in 37s
CI / Clippy (push) Successful in 2m33s
CI / Test (push) Successful in 5m4s
CI / Build cortex SRPM (push) Has been skipped
CI / Build neuron SRPM (push) Has been skipped
CI / Publish cortex to COPR (push) Has been skipped
CI / Publish neuron to COPR (push) Has been skipped
CI / Bump version in source (push) Has been skipped
build-prerelease / Build neuron-blackwell (push) Successful in 6m2s
build-prerelease / Build neuron-ampere (push) Successful in 7m49s
build-prerelease / Build neuron-ada (push) Successful in 5m27s
build-prerelease / Build cortex binary (push) Successful in 4m16s
build-prerelease / Package cortex RPM (push) Successful in 1m19s
build-prerelease / Package helexa-neuron-ampere RPM (push) Successful in 3m2s
build-prerelease / Package helexa-neuron-ada RPM (push) Successful in 3m10s
build-prerelease / Package helexa-neuron-blackwell RPM (push) Successful in 3m47s
build-prerelease / Publish to rpm.lair.cafe (unstable) (push) Successful in 1m2s
Stage 5. Zed clipboard/DnD images get forwarded as OpenAI
content-array messages on user turns.
- New MessageContent::MultiPart variant + MessagePart (Text|Image)
+ ImageData struct (mime_type, base64 data, optional uri).
- flatten_prompt now produces structured content: collapses to
Text when every block is text (some upstreams treat array-form
as vision-only and refuse on text-only models), otherwise
produces MultiPart preserving block order.
- OpenAI encoder emits `[{type:"text",text:…}, {type:"image_url",
image_url:{url:"data:{mime};base64,{data}"}}]` for MultiPart user
messages. Data URIs are used over remote `uri` because they
round-trip through every upstream we care about.
- prompt_capabilities.image = true at initialize so Zed actually
sends image blocks.
- compaction estimates ~512 tokens per image (the middle of the
Qwen3-VL / OpenAI detail range) so the budget tracker doesn't
pretend images are free.
- session/load replays image-bearing user turns by surfacing the
text parts verbatim and rendering each image as a "[image: {mime}
({n} bytes)]" placeholder chunk — Zed can show the prior text
context even though re-uploading the bytes through ACP isn't
meaningful for resume.
- 4 new tests: flatten produces MultiPart in block order, image-only
prompts still flatten to MultiPart, encoder emits the correct
array shape, text-only encoding stays as the string form.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -12,7 +12,8 @@
|
|||||||
//! | `session/set_model` | switch the session's active model (endpoint:model selector) |
|
//! | `session/set_model` | switch the session's active model (endpoint:model selector) |
|
||||||
//! | (anything else) | "not implemented yet" error |
|
//! | (anything else) | "not implemented yet" error |
|
||||||
//!
|
//!
|
||||||
//! Stage 5 flips on image content.
|
//! Stage 5 flipped on image content. Stage 6 starts adding new wire
|
||||||
|
//! protocols (Anthropic /v1/messages first).
|
||||||
|
|
||||||
use std::path::PathBuf;
|
use std::path::PathBuf;
|
||||||
use std::sync::Arc;
|
use std::sync::Arc;
|
||||||
@@ -288,9 +289,9 @@ impl Agent {
|
|||||||
}
|
}
|
||||||
|
|
||||||
fn initialize_response(req: &InitializeRequest) -> InitializeResponse {
|
fn initialize_response(req: &InitializeRequest) -> InitializeResponse {
|
||||||
// Stage 2: text-only prompts. Image / audio / embedded resources
|
// Stage 5: image is on (Zed clipboard / drag-drop). Audio and
|
||||||
// flip on in later stages.
|
// embedded resources flip on in later stages.
|
||||||
let prompt_caps = PromptCapabilities::default();
|
let prompt_caps = PromptCapabilities::default().image(true);
|
||||||
// Stage 3b: advertise both the top-level `load_session` flag and
|
// Stage 3b: advertise both the top-level `load_session` flag and
|
||||||
// the `session/list` sub-capability. Zed (and other ACP clients)
|
// the `session/list` sub-capability. Zed (and other ACP clients)
|
||||||
// uses `session/list` to discover the session id that belongs to
|
// uses `session/list` to discover the session id that belongs to
|
||||||
@@ -488,6 +489,31 @@ fn replay_history(cx: &ConnectionTo<Client>, session_id: &SessionId, history: &[
|
|||||||
send(SessionUpdate::UserMessageChunk(text_chunk(text.clone())));
|
send(SessionUpdate::UserMessageChunk(text_chunk(text.clone())));
|
||||||
total_events += 1;
|
total_events += 1;
|
||||||
}
|
}
|
||||||
|
(Role::User, MessageContent::MultiPart { parts }) => {
|
||||||
|
// We can re-emit text parts as UserMessageChunks.
|
||||||
|
// Images get a placeholder line so the user sees
|
||||||
|
// *that* an image was attached; re-replaying the
|
||||||
|
// image bytes themselves through ACP would require
|
||||||
|
// a path round-trip we don't currently keep.
|
||||||
|
for part in parts {
|
||||||
|
match part {
|
||||||
|
crate::provider::MessagePart::Text { text } => {
|
||||||
|
send(SessionUpdate::UserMessageChunk(text_chunk(text.clone())));
|
||||||
|
total_events += 1;
|
||||||
|
}
|
||||||
|
crate::provider::MessagePart::Image(img) => {
|
||||||
|
let label = match &img.uri {
|
||||||
|
Some(u) => format!("[image: {u}]"),
|
||||||
|
None => {
|
||||||
|
format!("[image: {} ({} bytes)]", img.mime_type, img.data.len())
|
||||||
|
}
|
||||||
|
};
|
||||||
|
send(SessionUpdate::UserMessageChunk(text_chunk(label)));
|
||||||
|
total_events += 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
(Role::Assistant, MessageContent::Text { text }) => {
|
(Role::Assistant, MessageContent::Text { text }) => {
|
||||||
send(SessionUpdate::AgentMessageChunk(text_chunk(text.clone())));
|
send(SessionUpdate::AgentMessageChunk(text_chunk(text.clone())));
|
||||||
total_events += 1;
|
total_events += 1;
|
||||||
@@ -586,10 +612,18 @@ fn handle_list_sessions(req: ListSessionsRequest) -> anyhow::Result<ListSessions
|
|||||||
/// first user turn's text (truncated to ~60 chars). Empty string
|
/// first user turn's text (truncated to ~60 chars). Empty string
|
||||||
/// becomes `None` so Zed can fall back to its own placeholder.
|
/// becomes `None` so Zed can fall back to its own placeholder.
|
||||||
fn derive_session_title(history: &[Message]) -> Option<String> {
|
fn derive_session_title(history: &[Message]) -> Option<String> {
|
||||||
|
use crate::provider::MessagePart;
|
||||||
history
|
history
|
||||||
.iter()
|
.iter()
|
||||||
.find_map(|msg| match (msg.role, &msg.content) {
|
.find_map(|msg| match (msg.role, &msg.content) {
|
||||||
(Role::User, MessageContent::Text { text }) => Some(text.as_str()),
|
(Role::User, MessageContent::Text { text }) => Some(text.clone()),
|
||||||
|
(Role::User, MessageContent::MultiPart { parts }) => parts.iter().find_map(|p| {
|
||||||
|
if let MessagePart::Text { text } = p {
|
||||||
|
Some(text.clone())
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
}
|
||||||
|
}),
|
||||||
_ => None,
|
_ => None,
|
||||||
})
|
})
|
||||||
.map(|s| {
|
.map(|s| {
|
||||||
@@ -823,10 +857,10 @@ async fn drive_prompt(
|
|||||||
state.cancel.cancel();
|
state.cancel.cancel();
|
||||||
let cancel = CancellationToken::new();
|
let cancel = CancellationToken::new();
|
||||||
state.cancel = cancel.clone();
|
state.cancel = cancel.clone();
|
||||||
let user_text = flatten_prompt(&req.prompt);
|
let user_content = flatten_prompt(&req.prompt);
|
||||||
state.history.push(Message {
|
state.history.push(Message {
|
||||||
role: Role::User,
|
role: Role::User,
|
||||||
content: MessageContent::Text { text: user_text },
|
content: user_content,
|
||||||
});
|
});
|
||||||
(
|
(
|
||||||
state.history.clone(),
|
state.history.clone(),
|
||||||
@@ -1422,31 +1456,76 @@ fn map_finish_reason(reason: Option<&str>) -> StopReason {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/// Pure helper — turn a prompt's ContentBlocks into the user-message
|
/// Pure helper — turn a prompt's ContentBlocks into the user-message
|
||||||
/// text that goes into history. Lifted out so unit tests don't need a
|
/// content that goes into history.
|
||||||
/// running runtime.
|
///
|
||||||
fn flatten_prompt(blocks: &[ContentBlock]) -> String {
|
/// - All-text prompts collapse to [`MessageContent::Text`] (cheaper
|
||||||
let mut out = String::new();
|
/// to encode upstream — many OpenAI-compatible servers prefer the
|
||||||
|
/// string form when there's no reason to use the array form).
|
||||||
|
/// - Anything with at least one image becomes
|
||||||
|
/// [`MessageContent::MultiPart`], preserving block order so the
|
||||||
|
/// user's "this image, then this text" pacing reaches the model.
|
||||||
|
/// - `ResourceLink` is rendered as inline text so the model knows
|
||||||
|
/// it was referenced. Audio and embedded resources aren't
|
||||||
|
/// advertised as supported in [`PromptCapabilities`]; drop with a
|
||||||
|
/// warning if a non-conformant client sends one.
|
||||||
|
fn flatten_prompt(blocks: &[ContentBlock]) -> MessageContent {
|
||||||
|
use crate::provider::{ImageData, MessagePart};
|
||||||
|
|
||||||
|
let mut parts: Vec<MessagePart> = Vec::new();
|
||||||
|
let mut text_buf = String::new();
|
||||||
|
let flush_text = |buf: &mut String, parts: &mut Vec<MessagePart>| {
|
||||||
|
if !buf.is_empty() {
|
||||||
|
parts.push(MessagePart::Text {
|
||||||
|
text: std::mem::take(buf),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
for block in blocks {
|
for block in blocks {
|
||||||
if !out.is_empty() {
|
|
||||||
out.push_str("\n\n");
|
|
||||||
}
|
|
||||||
match block {
|
match block {
|
||||||
ContentBlock::Text(t) => out.push_str(&t.text),
|
ContentBlock::Text(t) => {
|
||||||
|
if !text_buf.is_empty() {
|
||||||
|
text_buf.push_str("\n\n");
|
||||||
|
}
|
||||||
|
text_buf.push_str(&t.text);
|
||||||
|
}
|
||||||
ContentBlock::ResourceLink(link) => {
|
ContentBlock::ResourceLink(link) => {
|
||||||
// Stage 2 has no fs access; surface the link as a
|
if !text_buf.is_empty() {
|
||||||
// textual reference so the model at least knows it
|
text_buf.push_str("\n\n");
|
||||||
// was asked about something.
|
}
|
||||||
out.push_str(&format!("[resource link: {}]", link.uri));
|
text_buf.push_str(&format!("[resource link: {}]", link.uri));
|
||||||
|
}
|
||||||
|
ContentBlock::Image(img) => {
|
||||||
|
flush_text(&mut text_buf, &mut parts);
|
||||||
|
parts.push(MessagePart::Image(ImageData {
|
||||||
|
mime_type: img.mime_type.clone(),
|
||||||
|
data: img.data.clone(),
|
||||||
|
uri: img.uri.clone(),
|
||||||
|
}));
|
||||||
}
|
}
|
||||||
// Image / Audio / Resource: not advertised in
|
|
||||||
// PromptCapabilities for Stage 2; a well-behaved client
|
|
||||||
// shouldn't send these. If one does, drop and warn.
|
|
||||||
other => {
|
other => {
|
||||||
tracing::warn!(?other, "ignoring unsupported content block in Stage 2");
|
tracing::warn!(?other, "ignoring unsupported content block");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
out
|
flush_text(&mut text_buf, &mut parts);
|
||||||
|
|
||||||
|
// Collapse to plain Text when there's no image part — the
|
||||||
|
// OpenAI string-form is friendlier to non-vision endpoints
|
||||||
|
// (some treat the array form as a vision-only path).
|
||||||
|
let has_image = parts.iter().any(|p| matches!(p, MessagePart::Image(_)));
|
||||||
|
if !has_image {
|
||||||
|
let text = parts
|
||||||
|
.into_iter()
|
||||||
|
.filter_map(|p| match p {
|
||||||
|
MessagePart::Text { text } => Some(text),
|
||||||
|
MessagePart::Image(_) => None,
|
||||||
|
})
|
||||||
|
.collect::<Vec<_>>()
|
||||||
|
.join("\n\n");
|
||||||
|
return MessageContent::Text { text };
|
||||||
|
}
|
||||||
|
MessageContent::MultiPart { parts }
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Pure helper — pick which provider handles a session's `model_id`.
|
/// Pure helper — pick which provider handles a session's `model_id`.
|
||||||
@@ -1475,9 +1554,16 @@ mod tests {
|
|||||||
|
|
||||||
// ── flatten_prompt ──────────────────────────────────────────────
|
// ── flatten_prompt ──────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn expect_text(content: &MessageContent) -> &str {
|
||||||
|
match content {
|
||||||
|
MessageContent::Text { text } => text.as_str(),
|
||||||
|
other => panic!("expected MessageContent::Text, got {other:?}"),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
fn flatten_empty_prompt_is_empty() {
|
fn flatten_empty_prompt_is_empty() {
|
||||||
assert_eq!(flatten_prompt(&[]), "");
|
assert_eq!(expect_text(&flatten_prompt(&[])), "");
|
||||||
}
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
@@ -1486,7 +1572,7 @@ mod tests {
|
|||||||
ContentBlock::Text(TextContent::new("first")),
|
ContentBlock::Text(TextContent::new("first")),
|
||||||
ContentBlock::Text(TextContent::new("second")),
|
ContentBlock::Text(TextContent::new("second")),
|
||||||
];
|
];
|
||||||
assert_eq!(flatten_prompt(&blocks), "first\n\nsecond");
|
assert_eq!(expect_text(&flatten_prompt(&blocks)), "first\n\nsecond");
|
||||||
}
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
@@ -1495,7 +1581,50 @@ mod tests {
|
|||||||
"readme",
|
"readme",
|
||||||
"file:///tmp/x",
|
"file:///tmp/x",
|
||||||
))];
|
))];
|
||||||
assert_eq!(flatten_prompt(&blocks), "[resource link: file:///tmp/x]");
|
assert_eq!(
|
||||||
|
expect_text(&flatten_prompt(&blocks)),
|
||||||
|
"[resource link: file:///tmp/x]"
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn flatten_text_and_image_produces_multipart_in_order() {
|
||||||
|
use crate::provider::MessagePart;
|
||||||
|
let blocks = vec![
|
||||||
|
ContentBlock::Text(TextContent::new("describe:")),
|
||||||
|
ContentBlock::Image(agent_client_protocol::schema::ImageContent::new(
|
||||||
|
"iVBORw0KGgo=",
|
||||||
|
"image/png",
|
||||||
|
)),
|
||||||
|
ContentBlock::Text(TextContent::new("…in detail.")),
|
||||||
|
];
|
||||||
|
let content = flatten_prompt(&blocks);
|
||||||
|
match content {
|
||||||
|
MessageContent::MultiPart { parts } => {
|
||||||
|
assert_eq!(parts.len(), 3);
|
||||||
|
assert!(matches!(&parts[0], MessagePart::Text { text } if text == "describe:"));
|
||||||
|
assert!(matches!(&parts[1], MessagePart::Image(img)
|
||||||
|
if img.mime_type == "image/png" && img.data == "iVBORw0KGgo="));
|
||||||
|
assert!(matches!(&parts[2], MessagePart::Text { text } if text == "…in detail."));
|
||||||
|
}
|
||||||
|
other => panic!("expected MultiPart, got {other:?}"),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn flatten_image_only_still_produces_multipart() {
|
||||||
|
use crate::provider::MessagePart;
|
||||||
|
let blocks = vec![ContentBlock::Image(
|
||||||
|
agent_client_protocol::schema::ImageContent::new("AAAA", "image/jpeg"),
|
||||||
|
)];
|
||||||
|
match flatten_prompt(&blocks) {
|
||||||
|
MessageContent::MultiPart { parts } => {
|
||||||
|
assert_eq!(parts.len(), 1);
|
||||||
|
assert!(matches!(&parts[0], MessagePart::Image(img)
|
||||||
|
if img.mime_type == "image/jpeg"));
|
||||||
|
}
|
||||||
|
other => panic!("expected MultiPart, got {other:?}"),
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── resolve_provider ────────────────────────────────────────────
|
// ── resolve_provider ────────────────────────────────────────────
|
||||||
|
|||||||
@@ -32,7 +32,7 @@
|
|||||||
//! (over-estimates tokens slightly) so we compact a touch early
|
//! (over-estimates tokens slightly) so we compact a touch early
|
||||||
//! rather than a touch late.
|
//! rather than a touch late.
|
||||||
|
|
||||||
use crate::provider::{Message, MessageContent, Role};
|
use crate::provider::{Message, MessageContent, MessagePart, Role};
|
||||||
|
|
||||||
/// Most-recent N messages that are never elided. Roughly "the
|
/// Most-recent N messages that are never elided. Roughly "the
|
||||||
/// current tool round in flight" — assistant turn that called the
|
/// current tool round in flight" — assistant turn that called the
|
||||||
@@ -54,6 +54,13 @@ const CHARS_PER_TOKEN: f32 = 3.5;
|
|||||||
/// to a few tokens; tiny but it adds up across long histories.
|
/// to a few tokens; tiny but it adds up across long histories.
|
||||||
const ENVELOPE_TOKENS: usize = 8;
|
const ENVELOPE_TOKENS: usize = 8;
|
||||||
|
|
||||||
|
/// Rough per-image token cost used by the budget estimator. Real
|
||||||
|
/// vision tokenizers vary widely (256–1024 tokens for typical
|
||||||
|
/// resolutions on Qwen3-VL, OpenAI's `low`/`high` detail toggles
|
||||||
|
/// pick between ~85 and ~1000+). 512 is a defensible middle that
|
||||||
|
/// keeps compaction from treating images as free.
|
||||||
|
const IMAGE_TOKENS_APPROX: usize = 512;
|
||||||
|
|
||||||
/// Stats reported back from [`compact_to_budget`] for the caller to
|
/// Stats reported back from [`compact_to_budget`] for the caller to
|
||||||
/// log. The numbers are estimates (see [`estimate_tokens`]), so
|
/// log. The numbers are estimates (see [`estimate_tokens`]), so
|
||||||
/// don't compare them to upstream-reported token counts as if they
|
/// don't compare them to upstream-reported token counts as if they
|
||||||
@@ -87,6 +94,19 @@ impl CompactionStats {
|
|||||||
pub fn estimate_tokens(msg: &Message) -> usize {
|
pub fn estimate_tokens(msg: &Message) -> usize {
|
||||||
let chars = match &msg.content {
|
let chars = match &msg.content {
|
||||||
MessageContent::Text { text } => text.len(),
|
MessageContent::Text { text } => text.len(),
|
||||||
|
MessageContent::MultiPart { parts } => parts
|
||||||
|
.iter()
|
||||||
|
.map(|p| match p {
|
||||||
|
MessagePart::Text { text } => text.len(),
|
||||||
|
// Each image is one block in the context window; the
|
||||||
|
// upstream tokenizer handles the real cost (and it
|
||||||
|
// varies wildly by model — Qwen3-VL uses ~256-1024
|
||||||
|
// tokens per image depending on size). Take a
|
||||||
|
// middle estimate so the budget tracker doesn't
|
||||||
|
// pretend images are free.
|
||||||
|
MessagePart::Image(_) => IMAGE_TOKENS_APPROX * CHARS_PER_TOKEN as usize,
|
||||||
|
})
|
||||||
|
.sum(),
|
||||||
MessageContent::ToolCalls { text, calls } => {
|
MessageContent::ToolCalls { text, calls } => {
|
||||||
let txt = text.as_deref().map(|s| s.len()).unwrap_or(0);
|
let txt = text.as_deref().map(|s| s.len()).unwrap_or(0);
|
||||||
let calls_size: usize = calls
|
let calls_size: usize = calls
|
||||||
@@ -206,6 +226,15 @@ fn elide_in_place(msg: &mut Message) -> bool {
|
|||||||
*text = format!("(elided: {} bytes of assistant prose)", text.len());
|
*text = format!("(elided: {} bytes of assistant prose)", text.len());
|
||||||
true
|
true
|
||||||
}
|
}
|
||||||
|
MessageContent::MultiPart { .. } => {
|
||||||
|
// MultiPart messages today only exist as User turns,
|
||||||
|
// and User turns are protected by the role check in
|
||||||
|
// `compact_to_budget` — so this branch is unreachable
|
||||||
|
// for current call sites. Returning false keeps the
|
||||||
|
// unreachable path benign if a future stage starts
|
||||||
|
// emitting MultiPart on other roles.
|
||||||
|
false
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -101,6 +101,11 @@ pub enum MessageContent {
|
|||||||
/// enum, which is incompatible with newtype-of-primitive
|
/// enum, which is incompatible with newtype-of-primitive
|
||||||
/// variants.
|
/// variants.
|
||||||
Text { text: String },
|
Text { text: String },
|
||||||
|
/// Mixed text + image user turn. Stage 5 introduces this when
|
||||||
|
/// Zed sends an `ImageContent` block alongside the user's prompt.
|
||||||
|
/// Providers that don't support vision should down-convert by
|
||||||
|
/// dropping image parts and concatenating text parts.
|
||||||
|
MultiPart { parts: Vec<MessagePart> },
|
||||||
/// Assistant turn that called one or more tools. Stage 3 starts
|
/// Assistant turn that called one or more tools. Stage 3 starts
|
||||||
/// constructing this when the provider stream yields a
|
/// constructing this when the provider stream yields a
|
||||||
/// `ToolCallStart` / `ToolCallArgsDelta` sequence.
|
/// `ToolCallStart` / `ToolCallArgsDelta` sequence.
|
||||||
@@ -117,6 +122,31 @@ pub enum MessageContent {
|
|||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// One part of a [`MessageContent::MultiPart`] message.
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
#[serde(tag = "type", rename_all = "snake_case")]
|
||||||
|
pub enum MessagePart {
|
||||||
|
Text { text: String },
|
||||||
|
Image(ImageData),
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Inline image attachment. `data` is base64-encoded raw image
|
||||||
|
/// bytes; the encoder constructs an `image_url` data URI from it
|
||||||
|
/// at request time. `uri` carries any pointer the client supplied
|
||||||
|
/// (e.g. `file:///tmp/x.png`) — we keep it on the message for
|
||||||
|
/// debugging / future providers but the OpenAI encoder ignores it
|
||||||
|
/// when `data` is present (data wins, since it round-trips through
|
||||||
|
/// every wire format).
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
pub struct ImageData {
|
||||||
|
pub mime_type: String,
|
||||||
|
/// Base64-encoded image bytes (no `data:` prefix, no padding
|
||||||
|
/// stripped — exactly what `ImageContent.data` carried).
|
||||||
|
pub data: String,
|
||||||
|
#[serde(default, skip_serializing_if = "Option::is_none")]
|
||||||
|
pub uri: Option<String>,
|
||||||
|
}
|
||||||
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
pub struct ToolCall {
|
pub struct ToolCall {
|
||||||
/// Provider-assigned id that ties the call to its result. The
|
/// Provider-assigned id that ties the call to its result. The
|
||||||
|
|||||||
@@ -14,8 +14,8 @@ use serde_json::{Value, json};
|
|||||||
use tokio_util::sync::CancellationToken;
|
use tokio_util::sync::CancellationToken;
|
||||||
|
|
||||||
use super::{
|
use super::{
|
||||||
CompletionEvent, CompletionRequest, Message, MessageContent, ModelInfo, Provider, Role,
|
CompletionEvent, CompletionRequest, ImageData, Message, MessageContent, MessagePart, ModelInfo,
|
||||||
ToolSpec, UsageStats,
|
Provider, Role, ToolSpec, UsageStats,
|
||||||
};
|
};
|
||||||
use crate::config::EndpointConfig;
|
use crate::config::EndpointConfig;
|
||||||
|
|
||||||
@@ -189,6 +189,71 @@ mod tests {
|
|||||||
assert_eq!(body["stream_options"]["include_usage"], true);
|
assert_eq!(body["stream_options"]["include_usage"], true);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn encodes_user_multipart_with_image_as_content_array() {
|
||||||
|
let req = CompletionRequest {
|
||||||
|
model: "vl".into(),
|
||||||
|
messages: vec![Message {
|
||||||
|
role: Role::User,
|
||||||
|
content: MessageContent::MultiPart {
|
||||||
|
parts: vec![
|
||||||
|
MessagePart::Text {
|
||||||
|
text: "what's in this?".into(),
|
||||||
|
},
|
||||||
|
MessagePart::Image(ImageData {
|
||||||
|
mime_type: "image/png".into(),
|
||||||
|
data: "iVBORw0KGgo=".into(),
|
||||||
|
uri: None,
|
||||||
|
}),
|
||||||
|
],
|
||||||
|
},
|
||||||
|
}],
|
||||||
|
tools: vec![],
|
||||||
|
temperature: None,
|
||||||
|
top_p: None,
|
||||||
|
max_tokens: None,
|
||||||
|
};
|
||||||
|
let body = encode_request(&req);
|
||||||
|
let msg = &body["messages"][0];
|
||||||
|
assert_eq!(msg["role"], "user");
|
||||||
|
let content = msg["content"].as_array().expect("content array");
|
||||||
|
assert_eq!(content.len(), 2);
|
||||||
|
assert_eq!(content[0]["type"], "text");
|
||||||
|
assert_eq!(content[0]["text"], "what's in this?");
|
||||||
|
assert_eq!(content[1]["type"], "image_url");
|
||||||
|
assert_eq!(
|
||||||
|
content[1]["image_url"]["url"],
|
||||||
|
"data:image/png;base64,iVBORw0KGgo="
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn encodes_text_only_user_message_as_string() {
|
||||||
|
// Regression: even though we accept MultiPart now, the
|
||||||
|
// string form must stay the encoded shape for plain text
|
||||||
|
// messages — some upstreams treat array-form as a vision
|
||||||
|
// request and refuse on text-only models.
|
||||||
|
let req = CompletionRequest {
|
||||||
|
model: "m".into(),
|
||||||
|
messages: vec![Message {
|
||||||
|
role: Role::User,
|
||||||
|
content: MessageContent::Text {
|
||||||
|
text: "plain".into(),
|
||||||
|
},
|
||||||
|
}],
|
||||||
|
tools: vec![],
|
||||||
|
temperature: None,
|
||||||
|
top_p: None,
|
||||||
|
max_tokens: None,
|
||||||
|
};
|
||||||
|
let body = encode_request(&req);
|
||||||
|
assert_eq!(body["messages"][0]["content"], "plain");
|
||||||
|
assert!(
|
||||||
|
body["messages"][0]["content"].as_array().is_none(),
|
||||||
|
"text-only must not become array form"
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
#[test]
|
#[test]
|
||||||
fn encodes_tool_call_round_trip() {
|
fn encodes_tool_call_round_trip() {
|
||||||
let req = CompletionRequest {
|
let req = CompletionRequest {
|
||||||
@@ -510,6 +575,10 @@ fn encode_message(m: &Message) -> Value {
|
|||||||
json!({"role": "system", "content": text})
|
json!({"role": "system", "content": text})
|
||||||
}
|
}
|
||||||
(Role::User, MessageContent::Text { text }) => json!({"role": "user", "content": text}),
|
(Role::User, MessageContent::Text { text }) => json!({"role": "user", "content": text}),
|
||||||
|
(Role::User, MessageContent::MultiPart { parts }) => json!({
|
||||||
|
"role": "user",
|
||||||
|
"content": encode_user_parts(parts),
|
||||||
|
}),
|
||||||
(Role::Assistant, MessageContent::Text { text }) => {
|
(Role::Assistant, MessageContent::Text { text }) => {
|
||||||
json!({"role": "assistant", "content": text})
|
json!({"role": "assistant", "content": text})
|
||||||
}
|
}
|
||||||
@@ -549,6 +618,38 @@ fn encode_message(m: &Message) -> Value {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Encode a [`MessageContent::MultiPart`] user message as the OpenAI
|
||||||
|
/// chat content-array form:
|
||||||
|
///
|
||||||
|
/// ```jsonc
|
||||||
|
/// [
|
||||||
|
/// {"type": "text", "text": "describe this:"},
|
||||||
|
/// {"type": "image_url", "image_url": {"url": "data:image/png;base64,…"}}
|
||||||
|
/// ]
|
||||||
|
/// ```
|
||||||
|
///
|
||||||
|
/// Images use a `data:` URI built from `mime_type` + base64 `data`.
|
||||||
|
/// `uri` is intentionally ignored here — the inline data form
|
||||||
|
/// round-trips through every upstream we care about (cortex's
|
||||||
|
/// model loaders read the bytes directly, OpenAI accepts both
|
||||||
|
/// data and remote URLs but data is portable). Sticking to one
|
||||||
|
/// shape keeps wire-level surprises down.
|
||||||
|
fn encode_user_parts(parts: &[MessagePart]) -> Value {
|
||||||
|
let items: Vec<Value> = parts
|
||||||
|
.iter()
|
||||||
|
.map(|p| match p {
|
||||||
|
MessagePart::Text { text } => json!({ "type": "text", "text": text }),
|
||||||
|
MessagePart::Image(ImageData {
|
||||||
|
mime_type, data, ..
|
||||||
|
}) => json!({
|
||||||
|
"type": "image_url",
|
||||||
|
"image_url": { "url": format!("data:{mime_type};base64,{data}") },
|
||||||
|
}),
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
Value::Array(items)
|
||||||
|
}
|
||||||
|
|
||||||
fn role_str(r: Role) -> &'static str {
|
fn role_str(r: Role) -> &'static str {
|
||||||
match r {
|
match r {
|
||||||
Role::System => "system",
|
Role::System => "system",
|
||||||
@@ -561,6 +662,14 @@ fn role_str(r: Role) -> &'static str {
|
|||||||
fn content_as_text(c: &MessageContent) -> String {
|
fn content_as_text(c: &MessageContent) -> String {
|
||||||
match c {
|
match c {
|
||||||
MessageContent::Text { text } => text.clone(),
|
MessageContent::Text { text } => text.clone(),
|
||||||
|
MessageContent::MultiPart { parts } => parts
|
||||||
|
.iter()
|
||||||
|
.filter_map(|p| match p {
|
||||||
|
MessagePart::Text { text } => Some(text.as_str()),
|
||||||
|
MessagePart::Image(_) => None,
|
||||||
|
})
|
||||||
|
.collect::<Vec<_>>()
|
||||||
|
.join("\n\n"),
|
||||||
MessageContent::ToolCalls { text, .. } => text.clone().unwrap_or_default(),
|
MessageContent::ToolCalls { text, .. } => text.clone().unwrap_or_default(),
|
||||||
MessageContent::ToolResult { content, .. } => content.clone(),
|
MessageContent::ToolResult { content, .. } => content.clone(),
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user