Responses API: synthesise function_call output items from tool-call deltas #6

New Issue

grenade · 2026-05-31T08:19:16Z

grenade commented

2026-05-31 08:19:16 +00:00

Scope cut from Step 2 (commit `957f704`)

The Responses projector currently emits exactly one output item per response — a single message containing text. function_call items are defined on ResponsesOutputItem (crates/cortex-core/src/responses.rs) but the projector never synthesises them.

Two reasons:

The candle harness doesn't extract tool calls today. Qwen3 emits <tool_call>{json}</tool_call> blocks inline in its content stream; the parsing happens in helexa-acp (crates/helexa-acp/src/qwen3.rs::ToolCallParser) rather than in neuron.
InferenceEvent::ToolCallStart and InferenceEvent::ToolCallArgsDelta are defined on the event enum (crates/neuron/src/wire/event.rs) but never produced.

Compare to how the OpenAI chat projector handles this: it also doesn't split tool calls from content. Today both projectors expose the raw <tool_call> text and rely on every consumer to parse it.

Why it was cut

Tool-call extraction touches the inference inner loop in three sites (CPU, CUDA single-GPU, CUDA TP) the same way reasoning extraction does. We didn't want to bundle that with Step 1's refactor or Step 2's new surface.

What implementation looks like

Tool-call tag parser — same shape as the proposed <think> parser in #5 but for <tool_call>{json}</tool_call> blocks. Emits InferenceEvent::ToolCallStart on the open tag, ToolCallArgsDelta for the body, and finalises when the close tag arrives.
JSON repair — the parser must handle the same Qwen3 misemissions helexa-acp already deals with (trailing braces, name-nested-in-arguments, shape inference). Lift those repairs from helexa-acp/src/qwen3.rs and helexa-acp/src/tools.rs::infer_tool_name into neuron::wire.
Chat projector extension — emit OpenAI's delta: { tool_calls: [{ index, id, function: { name, arguments } }] } shape on each ToolCall* event, plus the final finish_reason: "tool_calls" chunk.
Responses projector extension — emit:
- response.output_item.added with a new function_call item (after the message item or instead of it depending on order).
- response.function_call_arguments.delta for each args delta.
- response.function_call_arguments.done when the call closes.
- response.output_item.done for the function_call item.
finish_reason mapping — FinishReason::ToolCalls is already defined; threads through to tool_calls on chat and the appropriate Responses status.

Acceptance

A prompt that triggers Qwen3 tool use produces structured ToolCallStart + ToolCallArgsDelta events on the InferenceEvent stream, with the model's <tool_call>{json}</tool_call> markers consumed (not surfaced as text).
Chat completions clients see the OpenAI tool_calls delta shape.
Responses clients see response.function_call_arguments.* events.
helexa-acp's ToolCallParser becomes redundant for cortex-backed sessions.

Tracking

Blocks: Responses-API-driven tool use against neuron (helexa-acp's openai-responses provider can't get function-call output items today). Coupled with #5 — both are the same harness-level extraction work, just for different tags. May want to land them together.

## Scope cut from Step 2 (commit [`957f704`](https://git.lair.cafe/helexa/cortex/commit/957f704)) The Responses projector currently emits exactly one output item per response — a single `message` containing text. `function_call` items are defined on `ResponsesOutputItem` (`crates/cortex-core/src/responses.rs`) but the projector never synthesises them. Two reasons: 1. The candle harness doesn't extract tool calls today. Qwen3 emits `<tool_call>{json}</tool_call>` blocks inline in its content stream; the parsing happens in helexa-acp (`crates/helexa-acp/src/qwen3.rs::ToolCallParser`) rather than in neuron. 2. `InferenceEvent::ToolCallStart` and `InferenceEvent::ToolCallArgsDelta` are *defined* on the event enum (`crates/neuron/src/wire/event.rs`) but never produced. Compare to how the OpenAI chat projector handles this: it also doesn't split tool calls from content. Today both projectors expose the raw `<tool_call>` text and rely on every consumer to parse it. ## Why it was cut Tool-call extraction touches the inference inner loop in three sites (CPU, CUDA single-GPU, CUDA TP) the same way reasoning extraction does. We didn't want to bundle that with Step 1's refactor or Step 2's new surface. ## What implementation looks like 1. **Tool-call tag parser** — same shape as the proposed `<think>` parser in #5 but for `<tool_call>{json}</tool_call>` blocks. Emits `InferenceEvent::ToolCallStart` on the open tag, `ToolCallArgsDelta` for the body, and finalises when the close tag arrives. 2. **JSON repair** — the parser must handle the same Qwen3 misemissions helexa-acp already deals with (trailing braces, name-nested-in-arguments, shape inference). Lift those repairs from `helexa-acp/src/qwen3.rs` and `helexa-acp/src/tools.rs::infer_tool_name` into `neuron::wire`. 3. **Chat projector extension** — emit OpenAI's `delta: { tool_calls: [{ index, id, function: { name, arguments } }] }` shape on each ToolCall* event, plus the final `finish_reason: "tool_calls"` chunk. 4. **Responses projector extension** — emit: - `response.output_item.added` with a new `function_call` item (after the message item or instead of it depending on order). - `response.function_call_arguments.delta` for each args delta. - `response.function_call_arguments.done` when the call closes. - `response.output_item.done` for the function_call item. 5. **`finish_reason` mapping** — `FinishReason::ToolCalls` is already defined; threads through to `tool_calls` on chat and the appropriate Responses status. ## Acceptance - A prompt that triggers Qwen3 tool use produces structured `ToolCallStart` + `ToolCallArgsDelta` events on the InferenceEvent stream, with the model's `<tool_call>{json}</tool_call>` markers consumed (not surfaced as text). - Chat completions clients see the OpenAI `tool_calls` delta shape. - Responses clients see `response.function_call_arguments.*` events. - helexa-acp's `ToolCallParser` becomes redundant for cortex-backed sessions. ## Tracking Blocks: Responses-API-driven tool use against neuron (helexa-acp's openai-responses provider can't get function-call output items today). Coupled with #5 — both are the same harness-level extraction work, just for different tags. May want to land them together.

grenade referenced this issue

2026-05-31 08:19:37 +00:00

Responses API: emit `response.in_progress` and built-in-tool event families #7

grenade referenced this issue from a commit

2026-05-31 08:30:28 +00:00

feat(helexa-acp): openai-responses provider

grenade referenced this issue

2026-05-31 14:43:38 +00:00

Pass through `chat_template_kwargs` to the chat template at tokenization #9

grenade referenced this issue from a commit

2026-05-31 20:26:34 +00:00

feat(neuron): extract `<tool_call>` blocks to structured tool_calls deltas