Support multimodal custom tool outputs (#12948)

## Summary

This changes `custom_tool_call_output` to use the same output payload
shape as `function_call_output`, so freeform tools can return either
plain text or structured content items.

The main goal is to let `js_repl` return image content from nested
`view_image` calls in its own `custom_tool_call_output`, instead of
relying on a separate injected message.

## What changed

- Changed `custom_tool_call_output.output` from `string` to
`FunctionCallOutputPayload`
- Updated freeform tool plumbing to preserve structured output bodies
- Updated `js_repl` to aggregate nested tool content items and attach
them to the outer `js_repl` result
- Removed the old `js_repl` special case that injected `view_image`
results as a separate pending user image message
- Updated normalization/history/truncation paths to handle multimodal
`custom_tool_call_output`
- Regenerated app-server protocol schema artifacts

## Behavior

Direct `view_image` calls still return a `function_call_output` with
image content.

When `view_image` is called inside `js_repl`, the outer `js_repl`
`custom_tool_call_output` now carries:
- an `input_text` item if the JS produced text output
- one or more `input_image` items from nested tool results

So the nested image result now stays inside the `js_repl` tool output
instead of being injected as a separate message.

## Compatibility

This is intended to be backward-compatible for resumed conversations.

Older histories that stored `custom_tool_call_output.output` as a plain
string still deserialize correctly, and older histories that used the
previous injected-image-message flow also continue to resume.

Added regression coverage for resuming a pre-change rollout containing:
- string-valued `custom_tool_call_output`
- legacy injected image message history


#### [git stack](https://github.com/magus/git-stack-cli)
- 👉 `1` https://github.com/openai/codex/pull/12948
This commit is contained in:
Curtis 'Fjord' Hawthorne
2026-02-26 18:17:46 -08:00
committed by GitHub
parent f90e97e414
commit 7e980d7db6
20 changed files with 688 additions and 177 deletions

View File

@@ -161,7 +161,7 @@ pub enum ResponseInputItem {
},
CustomToolCallOutput {
call_id: String,
output: String,
output: FunctionCallOutputPayload,
},
}
@@ -261,9 +261,12 @@ pub enum ResponseItem {
name: String,
input: String,
},
// `custom_tool_call_output.output` uses the same wire encoding as
// `function_call_output.output` so freeform tools can return either plain
// text or structured content items.
CustomToolCallOutput {
call_id: String,
output: String,
output: FunctionCallOutputPayload,
},
// Emitted by the Responses API when the agent triggers a web search.
// Example payload (from SSE `response.output_item.done`):
@@ -1538,6 +1541,26 @@ mod tests {
Ok(())
}
#[test]
fn serializes_custom_tool_image_outputs_as_array() -> Result<()> {
let item = ResponseInputItem::CustomToolCallOutput {
call_id: "call1".into(),
output: FunctionCallOutputPayload::from_content_items(vec![
FunctionCallOutputContentItem::InputImage {
image_url: "data:image/png;base64,BASE64".into(),
},
]),
};
let json = serde_json::to_string(&item)?;
let v: serde_json::Value = serde_json::from_str(&json)?;
let output = v.get("output").expect("output field");
assert!(output.is_array(), "expected array output");
Ok(())
}
#[test]
fn preserves_existing_image_data_urls() -> Result<()> {
let call_tool_result = CallToolResult {