- make ThreadStore::update_thread_metadata accept a broad range of
metadata patches
- keep ThreadStore::append_items as raw canonical history append (no
metadata side effects)
- in the local store, write these metadata updates to a combination of
sqlite and rollout jsonl files for backwards-compat. It special cases
which fields need to go into jsonl vs sqlite vs whatever, confining the
awkwardness to just this implementation
- in remote stores we can simply persist the metadata directly to a
database, no special casing required.
- move the "implicit metadata updates triggered by appending rollout
items" from the RolloutRecorder (which is local-threadstore-specific) to
the LiveThread layer above the ThreadStore, inside of a private helper
utility called ThreadMetadataSync. LiveThread calls ThreadStore
append_items and update_metadata separately.
- Add a generic update metadata method to ThreadManager that works on
both live threads and "cold" threads
- Call that ThreadManager method from app server code, so app server
doesn't need to worry about whether the thread is live or not
## Why
`tool_search` already had solid end-to-end coverage for discovery and
follow-up execution, but it did not prove that distinct pieces of
indexed search text actually work in integration. In particular, we were
not exercising whether unique tool names, descriptions, namespaces,
underscore-expanded dynamic names, and schema-property terms were
sufficient to surface the expected deferred tools.
This change adds focused integration coverage for those term sources so
regressions in search text construction are caught by a real `TestCodex`
flow instead of only by lower-level unit tests.
## What changed
- added a small helper in `core/tests/suite/search_tool.rs` to assert
that a `tool_search_output` contains an expected namespace child tool
- added an MCP integration test that issues several `tool_search_call`s
and verifies distinct query terms match the expected app tools:
- exact tool name: `calendar_timezone_option_99`
- tool description phrase: `uploaded document`
- top-level schema property: `starts_at`
- added a dynamic-tool integration test that verifies distinct query
terms match the expected deferred dynamic tool:
- exact name: `quasar_ping_beacon`
- underscore-expanded name: `quasar ping beacon`
- description phrase: `saffron metronome`
- namespace: `orbit_ops`
- schema property: `chrono_spec`
## Validation
- `cargo test -p codex-core tool_search_matches_`
## Docs
No documentation update needed.
## Why
This is the base PR in the split stack for the permissions migration. It
isolates stack-safety work that had been mixed into the larger
permissions PR, so reviewers can evaluate the async-future changes
separately from the permissions model changes in #22267.
The main risk this addresses is large or recursive multi-agent futures
overflowing smaller runner stacks. A follow-up review also called out
that `shutdown_live_agent` must remain quiescent: callers should not
remove a live agent from tracking or release its spawn slot until the
worker loop has actually terminated.
## What Changed
- Boxes the large async futures in the multi-agent spawn, resume, and
close tool handlers.
- Boxes the `AgentControl` spawn and recursive close/shutdown paths that
can otherwise build very deep futures.
- Keeps `shutdown_live_agent` waiting for thread termination before
removing/releasing the live agent, preserving the previous shutdown
ordering while still boxing the recursive close path.
## Verification Strategy
The focused local coverage was `cargo test -p codex-core multi_agents`,
which exercises the multi-agent spawn/resume/close handlers, cascade
close/resume behavior, and the shutdown path touched by this PR.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/openai/codex/pull/22266).
* #22330
* #22329
* #22328
* #22327
* __->__ #22266
Introduce execute_to_pending and wait_to_pending APIs that freeze
pending-mode runtimes until an explicit resume, while preserving the
existing continuously-running execute path. Add runtime and service
coverage for pending, resume, completion, and freeze behavior.
## Summary
This refactor makes tool handlers the owner of the specs they can
publish, so registry construction can register handlers once and
separately publish only the specs that should be model-visible.
The main motivation is deferred tools: MCP and dynamic tools still need
handlers registered up front, but deferred tools should be discoverable
through `tool_search` rather than emitted in the initial tool spec list.
## What changed
- `McpHandler` and `DynamicToolHandler` can return their own `ToolSpec`.
- `build_tool_registry_builder` now collects handlers, registers them
through the no-spec path, and publishes only non-deferred handler specs.
- Deferred MCP and dynamic tool names are combined into one
`all_deferred_tools` set that drives spec filtering, code-mode
deferred-tool signaling, and `tool_search` registration.
- `tool_search` registration now requires both deferred tools and
`namespace_tools`.
- Namespace specs are merged in `spec_plan`, preserving top-level spec
order, sorting tools within each namespace, and backfilling empty
namespace descriptions.
- Hosted web search and image-generation specs are included in the
collected spec vector before namespace merge/publication, and tool-name
tests that should not care about hosted relative order now compare sets.
## Testing
- `cargo test -p codex-core tools::spec::tests:: -- --nocapture`
- `cargo test -p codex-core tools::spec_plan::tests:: -- --nocapture`
- `cargo test -p codex-core
tools::router::tests::specs_filter_deferred_dynamic_tools --
--nocapture`
- `cargo test -p codex-core
suite::prompt_caching::prompt_tools_are_consistent_across_requests --
--nocapture`
- `just fmt`
- `just fix -p codex-core`
- `cargo test -p codex-core -- --skip
tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`
passed the library suite after skipping the known stack-overflowing unit
test.
Full `cargo test -p codex-core` currently hits a stack overflow in
`tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`;
the same focused test reproduces on `origin/main`.
## Summary
- make workspace owner nudge handling unconditional in the TUI now that
it is fully rolled out
- keep `workspace_owner_usage_nudge` as a removed no-op compatibility
flag so old configs/app overrides remain accepted during rollout
- remove flag-disabled test setup
## Companion PR
- https://github.com/openai/openai/pull/876351 removes the Codex Apps
Statsig rollout gate override after this change is available to the
app/runtime path
## Validation
- `just write-config-schema`
- `just fmt`
- `cargo test -p codex-features`
- `cargo test -p codex-tui status_and_layout`
## Why
Remote exec-server now needs one executor websocket to serve multiple
harness JSON-RPC sessions. Rendezvous routes by `stream_id`, and the
exec-server side needs to use the same stable relay frame contract
instead of a hand-rolled JSON shape.
The relay protocol also needs to make ownership boundaries clear:
harness and executor endpoints own sequencing, acks, retries, duplicate
suppression, segmentation, and reassembly; rendezvous only routes
frames.
## What Changed
- Add the checked-in `codex.exec_server.relay.v1.RelayMessageFrame`
proto plus generated prost bindings for `codex-exec-server`.
- Encode remote harness/executor relay traffic as binary protobuf
websocket frames while keeping local websocket JSON-RPC unchanged.
- Demux executor-side relay streams into independent
`ConnectionProcessor` sessions keyed by `stream_id`.
- Add a programmatic `RemoteExecutorConfig::with_bearer_token(...)`
constructor for non-CLI callers and integration tests.
- Add an integration test that starts the remote executor against a fake
registry/rendezvous websocket and verifies two virtual streams share one
executor websocket without cross-talk, including per-stream reset
behavior.
- Document the remote relay envelope, sequence ranges, `ack`/`ack_bits`,
and endpoint responsibilities in `exec-server/README.md`.
## Verification
- `cargo test -p codex-exec-server --test relay
multiplexed_remote_executor_routes_independent_virtual_streams --
--exact`
- `cargo test -p codex-exec-server --test relay`
- `cargo test -p codex-exec-server` passed outside the sandbox. The
sandboxed run hit macOS `sandbox-exec: sandbox_apply: Operation not
permitted` in filesystem sandbox tests.
## Why
Windows CI has been timing out in
`configured_pet_load_is_deferred_until_after_construction` while waiting
for the deferred configured-pet load event.
The test still needs to prove construction returns before the pet image
is available, but the background load slices the built-in pet
spritesheet into frame cache files. That work can exceed the old 2
second deadline on slower or more contended CI machines.
## What Changed
- Increased the test wait for `ConfiguredPetLoaded` from 2 seconds to 30
seconds.
- Kept the post-construction assertion intact so the test still verifies
that the pet is not loaded synchronously during `ChatWidget`
construction.
## How to Test
Targeted tests:
- `cargo test -p codex-tui
configured_pet_load_is_deferred_until_after_construction`
- `just argument-comment-lint`
Additional check:
- `cargo test -p codex-tui` was run, but the broader crate suite did not
complete successfully due to unrelated existing failures:
-
`status::tests::status_permissions_full_disk_managed_without_network_is_external_sandbox`
-
`status::tests::status_permissions_full_disk_managed_with_network_is_danger_full_access`
- later abort in
`tests::fork_last_filters_latest_session_by_cwd_unless_show_all` from
stack overflow
## Why
Code mode only used nested spec lookup at execution time to rediscover
whether a nested tool should be invoked as a function tool or a freeform
tool.
That information is already present in the enabled tool metadata that
code mode builds to expose `tools.*` and `ALL_TOOLS`, so re-looking it
up from the router was redundant and kept execution coupled to a
separate spec lookup path.
## What Changed
- thread `CodeModeToolKind` through the code-mode runtime `ToolCall`
event and `CodeModeNestedToolCall`
- emit the nested tool kind directly from the V8 callback using the
already-enabled tool metadata
- build nested tool payloads from the propagated kind instead of calling
`find_spec`
- remove the now-unused `find_spec` plumbing from the router and
parallel runtime helpers
- add unit coverage for function vs freeform payload shaping and update
affected router tests
## Testing
- `cargo test -p codex-code-mode`
- `cargo test -p codex-core code_mode::tests`
- `cargo test -p codex-core
extension_tool_bundles_are_model_visible_and_dispatchable`
- `cargo test -p codex-core
model_visible_specs_filter_deferred_dynamic_tools`
## Summary
Adds include_collaboration_mode_instructions, which is a config
equivalent to include_permissions_instructions for collaboration modes.
Desired for situations where we want to disable this instruction from
entering the context
## Testing
- [x] Added unit test
## Why
Tool dispatch had two serialization mechanisms:
- `supports_parallel_tool_calls` decides whether a tool participates in
the shared parallel-execution lock.
- `is_mutating` separately gated some calls inside dispatch.
That second hook no longer carried its weight. The remaining
parallel-support flag is already the per-tool concurrency policy, so
keeping a second mutating gate made dispatch harder to follow and left
behind extra session plumbing that only existed for that path.
## What changed
- Removed `is_mutating` from tool handlers and deleted the
`tool_call_gate` path that existed only to support it.
- Simplified dispatch and routing to rely on the existing per-tool
`supports_parallel_tool_calls` boolean.
- Dropped the now-unused handler overrides and related session/test
scaffolding.
- Kept the router/parallel tests focused on the surviving per-tool
behavior.
- Removed the unused `codex-utils-readiness` dependency from
`codex-core` as a follow-up fix for `cargo shear`.
## Testing
- `cargo test -p codex-core
parallel_support_does_not_match_namespaced_local_tool_names`
- `cargo test -p codex-core mcp_parallel_support_uses_handler_data`
- `cargo test -p codex-core
tools_without_handlers_do_not_support_parallel`
## Summary
- tighten unified exec sandbox initialization
- preserve the requested process workdir independently from sandbox
setup
- add regression coverage for the updated invariant
## Validation
- Ran `/tmp/cargo-tools/bin/just fmt`.
- Ran the targeted `codex-core` regression test successfully.
- Ran `cargo test -p codex-core`; it did not complete cleanly because
unrelated existing agent/config-loader tests failed and the run later
aborted on a stack overflow in
`tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`.
Co-authored-by: Codex <noreply@openai.com>