Summary
- move `core/src/hooks` implementation into a new `codex-hooks` crate
with its own manifest
- update `codex-rs` workspace and `codex-core` crate to depend on the
extracted `hooks` crate and wire up the shared APIs
- ensure references, modules, and lockfile reflect the new crate layout
Testing
- Not run (not requested)
As of this PR, `SessionServices` retains a
`Option<StartedNetworkProxy>`, if appropriate.
Now the `network` field on `Config` is `Option<NetworkProxySpec>`
instead of `Option<NetworkProxy>`.
Over in `Session::new()`, we invoke `NetworkProxySpec::start_proxy()` to
create the `StartedNetworkProxy`, which is a new struct that retains the
`NetworkProxy` as well as the `NetworkProxyHandle`. (Note that `Drop` is
implemented for `NetworkProxyHandle` to ensure the proxies are shutdown
when it is dropped.)
The `NetworkProxy` from the `StartedNetworkProxy` is threaded through to
the appropriate places.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/openai/codex/pull/11207).
* #11285
* __->__ #11207
# External (non-OpenAI) Pull Request Requirements
Before opening this Pull Request, please read the dedicated
"Contributing" markdown file or your PR may be closed:
https://github.com/openai/codex/blob/main/docs/contributing.md
If your PR conforms to our contribution guidelines, replace this text
with a detailed and high quality description of your changes.
Include a link to a bug report or enhancement request.
**Why We Did This**
- The goal is to reduce MCP tool context pollution by not exposing the
full MCP tool list up front
- It forces an explicit discovery step (`search_tool_bm25`) so the model
narrows tool scope before making MCP calls, which helps relevance and
lowers prompt/tool clutter.
**What It Changed**
- Added a new experimental feature flag `search_tool` in
`core/src/features.rs:90` and `core/src/features.rs:430`.
- Added config/schema support for that flag in
`core/config.schema.json:214` and `core/config.schema.json:1235`.
- Added BM25 dependency (`bm25`) in `Cargo.toml:129` and
`core/Cargo.toml:23`.
- Added new tool handler `search_tool_bm25` in
`core/src/tools/handlers/search_tool_bm25.rs:18`.
- Registered the handler and tool spec in
`core/src/tools/handlers/mod.rs:11` and `core/src/tools/spec.rs:780` and
`core/src/tools/spec.rs:1344`.
- Extended `ToolsConfig` to carry `search_tool` enablement in
`core/src/tools/spec.rs:32` and `core/src/tools/spec.rs:56`.
- Injected dedicated developer instructions for tool-discovery workflow
in `core/src/codex.rs:483` and `core/src/codex.rs:1976`, using
`core/templates/search_tool/developer_instructions.md:1`.
- Added session state to store one-shot selected MCP tools in
`core/src/state/session.rs:27` and `core/src/state/session.rs:131`.
- Added filtering so when feature is enabled, only selected MCP tools
are exposed on the next request (then consumed) in
`core/src/codex.rs:3800` and `core/src/codex.rs:3843`.
- Added E2E suite coverage for
enablement/instructions/hide-until-search/one-turn-selection in
`core/tests/suite/search_tool.rs:72`,
`core/tests/suite/search_tool.rs:109`,
`core/tests/suite/search_tool.rs:147`, and
`core/tests/suite/search_tool.rs:218`.
- Refactored test helper utilities to support config-driven tool
collection in `core/tests/suite/tools.rs:281`.
**Net Behavioral Effect**
- With `search_tool` **off**: existing MCP behavior (tools exposed
normally).
- With `search_tool` **on**: MCP tools start hidden, model must call
`search_tool_bm25`, and only returned `selected_tools` are available for
the next model call.
## Summary
- add targeted remote-compaction failure diagnostics in compact_remote
logging
- log the specific values needed to explain overflow timing:
- last_api_response_total_tokens
- estimated_tokens_of_items_added_since_last_successful_api_response
- estimated_bytes_of_items_added_since_last_successful_api_response
- failing_compaction_request_body_bytes
- simplify breakdown naming and remove
last_api_response_total_bytes_estimate (it was an approximation and not
useful for debugging)
## Why
When compaction fails with context_length_exceeded, we need concrete,
low-ambiguity numbers that map directly to:
1) what the API most recently reported, and
2) what local history added since then.
This keeps the failure logs actionable without adding broad, noisy
metrics.
## Testing
- just fmt
- cargo test -p codex-core
Instead of storing a special connection on the client level make the
regular task responsible for establishing a normal client session and
open a connection on it.
Then when the turn is started we pass in a pre-established session.
With this PR we do not close the unified exec processes (i.e. background
terminals) at the end of a turn unless:
* The user interrupt the turn
* The user decide to clean the processes through `app-server` or
`/clean`
I made sure that `codex exec` correctly kill all the processes
This PR adds the following field to `Config`:
```rust
pub network: Option<NetworkProxy>,
```
Though for the moment, it will always be initialized as `None` (this
will be addressed in a subsequent PR).
This PR does the work to thread `network` through to `execute_exec_env()`, `process_exec_tool_call()`, and `UnifiedExecRuntime.run()` to ensure it is available whenever we span a process.
If `NetworkConstraints` is set, then include the relevant settings on `<environment_context>`. Example:
```xml
<environment_context>
<cwd>/repo</cwd>
<shell>bash</shell>
<network enabled="true">
<allowed>api.example.com</allowed>
<allowed>*.openai.com</allowed>
<denied>blocked.example.com</denied>
</network>
</environment_context>
```
- Defer rollout persistence for fresh threads (`InitialHistory::New`):
keep rollout events in memory and only materialize rollout file + state
DB row on first `EventMsg::UserMessage`.
- Keep precomputed rollout path available before materialization.
- Change `thread/start` to build thread response from live config
snapshot and optional precomputed path.
- Improve pre-materialization behavior in app-server/TUI: clearer
invalid-request errors for file-backed ops and a friendlier `/fork` “not
ready yet” UX.
- Update tests to match deferred semantics across
start/read/archive/unarchive/fork/resume/review flows.
- Improved resilience of user_shell test, which should be unrelated to
this change but must be affected by timing changes
For Reviewers:
* The primary change is in recorder.rs
* Most of the other changes were to fix up broken assumptions in
existing tests
Testing:
* Manually tested CLI
* Exercised app server paths by manually running IDE Extension with
rebuilt CLI binary
* Only user-visible change is that `/fork` in TUI generates visible
error if used prior to first turn
This PR makes it possible to disable live web search via an enterprise
config even if the user is running in `--yolo` mode (though cached web
search will still be available). To do this, create
`/etc/codex/requirements.toml` as follows:
```toml
# "live" is not allowed; "disabled" is allowed even though not listed explicitly.
allowed_web_search_modes = ["cached"]
```
Or set `requirements_toml_base64` MDM as explained on
https://developers.openai.com/codex/security/#locations.
### Why
- Enforce admin/MDM/`requirements.toml` constraints on web-search
behavior, independent of user config and per-turn sandbox defaults.
- Ensure per-turn config resolution and review-mode overrides never
crash when constraints are present.
### What
- Add `allowed_web_search_modes` to requirements parsing and surface it
in app-server v2 `ConfigRequirements` (`allowedWebSearchModes`), with
fixtures updated.
- Define a requirements allowlist type (`WebSearchModeRequirement`) and
normalize semantics:
- `disabled` is always implicitly allowed (even if not listed).
- An empty list is treated as `["disabled"]`.
- Make `Config.web_search_mode` a `Constrained<WebSearchMode>` and apply
requirements via `ConstrainedWithSource<WebSearchMode>`.
- Update per-turn resolution (`resolve_web_search_mode_for_turn`) to:
- Prefer `Live → Cached → Disabled` when
`SandboxPolicy::DangerFullAccess` is active (subject to requirements),
unless the user preference is explicitly `Disabled`.
- Otherwise, honor the user’s preferred mode, falling back to an allowed
mode when necessary.
- Update TUI `/debug-config` and app-server mapping to display
normalized `allowed_web_search_modes` (including implicit `disabled`).
- Fix web-search integration tests to assert cached behavior under
`SandboxPolicy::ReadOnly` (since `DangerFullAccess` legitimately prefers
`live` when allowed).
TLDR: use new message phase field emitted by preamble-supported models
to determine whether an AgentMessage is mid-turn commentary. if so,
restore the status indicator afterwards to indicate the turn has not
completed.
### Problem
`commit_tick` hides the status indicator while streaming assistant text.
For preamble-capable models, that text can be commentary mid-turn, so
hiding was correct during streaming but restore timing mattered:
- restoring too aggressively caused jitter/flashing
- not restoring caused indicator to stay hidden before subsequent work
(tool calls, web search, etc.)
### Fix
- Add optional `phase` to `AgentMessageItem` and propagate it from
`ResponseItem::Message`
- Keep indicator hidden during streamed commit ticks, restore only when:
- assistant item completes as `phase=commentary`, and
- stream queues are idle + task is still running.
- Treat `phase=None` as final-answer behavior (no restore) to keep
existing behavior for non-preamble models
### Tests
Add/update tests for:
- no idle-tick restore without commentary completion
- commentary completion restoring status before tool begin
- snapshot coverage for preamble/status behavior
---------
Co-authored-by: Josh McKinney <joshka@openai.com>
## Summary
When replaying compacted history (especially `replacement_history` from
remote compaction), we should not keep stale developer messages from
older session state. This PR trims developer-
role messages from compacted replacement history and reinjects fresh
developer instructions derived from current turn/session state.
This aligns compaction replay behavior with the intended "fresh
instructions after summary" model.
## Problem
Compaction replay had two paths:
- `Compacted { replacement_history: None }`: rebuilt with fresh initial
context
- `Compacted { replacement_history: Some(...) }`: previously used raw
replacement history as-is
The second path could carry stale developer instructions
(permissions/personality/collab-mode guidance) across session changes.
## What Changed
### 1) Added helper to refresh compacted developer instructions
- **File:** `codex-rs/core/src/compact.rs`
- **Function:** `refresh_compacted_developer_instructions(...)`
Behavior:
- remove all `ResponseItem::Message { role: "developer", .. }` from
compacted history
- append fresh developer messages from current
`build_initial_context(...)`
### 2) Applied helper in remote compaction flow
- **File:** `codex-rs/core/src/compact_remote.rs`
- After receiving compact endpoint output, refresh developer
instructions before replacing history and persisting
`replacement_history`.
### 3) Applied helper while reconstructing history from rollout
- **File:** `codex-rs/core/src/codex.rs`
- In `reconstruct_history_from_rollout(...)`, when processing
`Compacted` entries with `replacement_history`, refresh developer
instructions instead of directly replacing with raw history.
## Non-Goals / Follow-up
This PR does **not** address the existing first-turn-after-resume
double-injection behavior.
A follow-up PR will handle resume-time dedup/idempotence separately.
If you want, I can also give you a shorter “squash-merge friendly”
version of the description.
## Codex author
`codex fork 019c25e6-706e-75d1-9198-688ec00a8256`
## Problem
The first user turn can pay websocket handshake latency even when a
session has already started. We want to reduce that initial delay while
preserving turn semantics and avoiding any prompt send during startup.
Reviewer feedback also called out duplicated connect/setup paths and
unnecessary preconnect state complexity.
## Mental model
`ModelClient` owns session-scoped transport state. During session
startup, it can opportunistically warm one websocket handshake slot. A
turn-scoped `ModelClientSession` adopts that slot once if available,
restores captured sticky turn-state, and otherwise opens a websocket
through the same shared connect path.
If startup preconnect is still in flight, first turn setup awaits that
task and treats it as the first connection attempt for the turn.
Preconnect is handshake-only. The first `response.create` is still sent
only when a turn starts.
## Non-goals
This change does not make preconnect required for correctness and does
not change prompt/turn payload semantics. It also does not expand
fallback behavior beyond clearing preconnect state when fallback
activates.
## Tradeoffs
The implementation prioritizes simpler ownership and shared connection
code over header-match gating for reuse. The single-slot cache keeps
lifecycle straightforward but only benefits the immediate next turn.
Awaiting in-flight preconnect has the same app-level connect-timeout
semantics as existing websocket connect behavior (no new timeout class
introduced by this PR).
## Architecture
`core/src/client.rs`:
- Added session-level preconnect lifecycle state (`Idle` / `InFlight` /
`Ready`) carrying one warmed websocket plus optional captured
turn-state.
- Added `pre_establish_connection()` startup warmup and `preconnect()`
handshake-only setup.
- Deduped auth/provider resolution into `current_client_setup()` and
websocket handshake wiring into `connect_websocket()` /
`build_websocket_headers()`.
- Updated turn websocket path to adopt preconnect first, await in-flight
preconnect when present, then create a new websocket only when needed.
- Ensured fallback activation clears warmed preconnect state.
- Added documentation for lifecycle, ownership, sticky-routing
invariants, and timeout semantics.
`core/src/codex.rs`:
- Session startup invokes `model_client.pre_establish_connection(...)`.
- Turn metadata resolution uses the shared timeout helper.
`core/src/turn_metadata.rs`:
- Centralized shared timeout helper used by both turn-time metadata
resolution and startup preconnect metadata building.
`core/tests/common/responses.rs` + websocket test suites:
- Added deterministic handshake waiting helper (`wait_for_handshakes`)
with bounded polling.
- Added startup preconnect and in-flight preconnect reuse coverage.
- Fallback expectations now assert exactly two websocket attempts in
covered scenarios (startup preconnect + turn attempt before fallback
sticks).
## Observability
Preconnect remains best-effort and non-fatal. Existing
websocket/fallback telemetry remains in place, and debug logs now make
preconnect-await behavior and preconnect failures easier to reason
about.
## Tests
Validated with:
1. `just fmt`
2. `cargo test -p codex-core websocket_preconnect -- --nocapture`
3. `cargo test -p codex-core websocket_fallback -- --nocapture`
4. `cargo test -p codex-core
websocket_first_turn_waits_for_inflight_preconnect -- --nocapture`
Summary
- add a `required` flag for MCP servers everywhere config/CLI data is
touched so mandatory helpers can be round-tripped
- have `codex exec` and `codex app-server` thread start/resume fail fast
when required MCPs fail to initialize
<img width="1019" height="284" alt="Screenshot 2026-02-05 at 23 34 08"
src="https://github.com/user-attachments/assets/19ec3ce1-3c3b-40f5-b251-a31d964bf3bb"
/>
Currently, if a config value is set that fails the requirements, we exit
Codex.
Now, instead of this, we print a warning and default to a
requirements-permitting value.
This PR adds a dedicated `turn/steer` API for appending user input to an
in-flight turn.
## Motivation
Currently, steering in the app is implemented by just calling
`turn/start` while a turn is running. This has some really weird quirks:
- Client gets back a new `turn.id`, even though streamed
events/approvals remained tied to the original active turn ID.
- All the various turn-level override params on `turn/start` do not
apply to the "steer", and would only apply to the next real turn.
- There can also be a race condition where the client thinks the turn is
active but the server has already completed it, so there might be bugs
if the client has baked in some client-specific behavior thinking it's a
steer when in fact the server kicked off a new turn. This is
particularly possible when running a client against a remote app-server.
Having a dedicated `turn/steer` API eliminates all those quirks.
`turn/steer` behavior:
- Requires an active turn on threadId. Returns a JSON-RPC error if there
is no active turn.
- If expectedTurnId is provided, it must match the active turn (more
useful when connecting to a remote app-server).
- Does not emit `turn/started`.
- Does not accept turn overrides (`cwd`, `model`, `sandbox`, etc.) or
`outputSchema` to accurately reflect that these are not applied when
steering.
###### What
Remove special-casing that prevented auto-enabling `web_search` for
Azure model provider users. Addresses #10071, #10257.
###### Why
Azure fixed their responsesapi implementation; `web_search` is now
supported on models it wasn't before (like `gpt-5.1-codex-max`).
This request now works:
```
curl "$AZURE_API_ENDPOINT" -H "Content-Type: application/json" -H "Authorization: Bearer $AZURE_API_KEY" -d '{
"model": "gpt-5.1-codex-max",
"tools": [
{ "type": "web_search" }
],
"tool_choice": "auto",
"input": "Find the sunrise time in Paris today and cite the source."
}'
```
###### Tests
Tested with above curl, removed Azure-specific tests.
default-enablement of web_search is now client-side, no need to send
eligibility headers to backend.
Tested locally, headers no longer sent.
will wait for corresponding backend change to deploy before merging
So that the rest of the codebase (like TUI) don't need to be concerned
whether ChatGPT auth was handled by Codex itself or passed in via
app-server's external auth mode.
This introduces a `Hooks` service. It registers hooks from config and
dispatches hook events at runtime.
N.B. The hook config is not wired up to this yet. But for legacy
reasons, we wire up `notify` from config and power it using hooks now.
Nothing about the `notify` interface has changed.
I'd start by reviewing `hooks/types.rs`
Some things to note:
- hook names subject to change
- no hook result yet
- stopping semantics yet to be introduced
- additional hooks yet to be introduced
Summary
- refactor user shell command execution into a shared helper and add
modes for standalone vs active-turn execution
- run user shell commands asynchronously when a turn is already active
so they don’t replace or abort the current turn
- extend the tests to cover the new behavior and add the generated Codex
environment manifest
Testing
- Not run (not requested)
## Summary
When resuming with a different model, we should also append a developer
message with the model instructions
## Testing
- [x] Added unit tests
## Summary
This PR fixes a deterministic mismatch in remote compaction where
pre-trim estimation and the `/v1/responses/compact` payload could use
different base instructions.
Before this change:
- pre-trim estimation used model-derived instructions
(`model_info.get_model_instructions(...)`)
- compact payload used session base instructions
(`sess.get_base_instructions()`)
After this change:
- remote pre-trim estimation and compact payload both use the same
`BaseInstructions` instance from session state.
## Changes
- Added a shared estimator entry point in `ContextManager`:
- `estimate_token_count_with_base_instructions(&self, base_instructions:
&BaseInstructions) -> Option<i64>`
- Kept `estimate_token_count(&TurnContext)` as a thin wrapper that
resolves model/personality instructions and delegates to the new helper.
- Updated remote compaction flow to fetch base instructions once and
reuse it for both:
- trim preflight estimation
- compact request payload construction
- Added regression coverage for parity and behavior:
- unit test verifying explicit-base estimator behavior
- integration test proving remote compaction uses session override
instructions and trims accordingly
## Why this matters
This removes a deterministic divergence source where pre-trim could
think the request fits while the actual compact request exceeded context
because its instructions were longer/different.
## Scope
In scope:
- estimator/payload base-instructions parity in remote compaction
Out of scope:
- retry-on-`context_length_exceeded`
- compaction threshold/headroom policy changes
- broader trimming policy changes
## Codex author:
`codex fork 019c2b24-c2df-7b31-a482-fb8cf7a28559`
## Summary
When switching models, we should append the instructions of the new
model to the conversation as a developer message.
## Test
- [x] Adds a unit test
Make ModelClient a session-scoped object.
Move state that is session level onto the client, and make state that is
per-turn explicit on corresponding methods.
Stop taking a huge Config object, instead only pass in values that are
actually needed.
---------
Co-authored-by: Josh McKinney <joshka@openai.com>
Took over the work that @aaronl-openai started here:
https://github.com/openai/codex/pull/10397
Now that app-server clients are able to set up custom tools (called
`dynamic_tools` in app-server), we should expose a way for clients to
pass in not just text, but also image outputs. This is something the
Responses API already supports for function call outputs, where you can
pass in either a string or an array of content outputs (text, image,
file):
https://platform.openai.com/docs/api-reference/responses/create#responses_create-input-input_item_list-item-function_tool_call_output-output-array-input_image
So let's just plumb it through in Codex (with the caveat that we only
support text and image for now). This is implemented end-to-end across
app-server v2 protocol types and core tool handling.
## Breaking API change
NOTE: This introduces a breaking change with dynamic tools, but I think
it's ok since this concept was only recently introduced
(https://github.com/openai/codex/pull/9539) and it's better to get the
API contract correct. I don't think there are any real consumers of this
yet (not even the Codex App).
Old shape:
`{ "output": "dynamic-ok", "success": true }`
New shape:
```
{
"contentItems": [
{ "type": "inputText", "text": "dynamic-ok" },
{ "type": "inputImage", "imageUrl": "data:image/png;base64,AAA" }
]
"success": true
}
```
Add a centralized FileWatcher in codex-core (using notify) that watches
skill roots from the config layer stack (recursive)
Send `SkillsChanged` events when relevant file system changes are
detected
On `SkillsChanged`:
* Invalidate the skills cache immediately in ThreadManager
* Emit EventMsg::SkillsUpdateAvailable to active sessions
~~* Broadcast a new app-server notification:
SkillsListUpdatedNotification~~
This change does not inject new items into the event stream. That means
the agent will not know about new skills, so it won't be able to
implicitly invoke new skills. It also won't know about changes to
existing skills, so if it has already read the contents of a modified
skill, it will not honor the new behavior.
This change also does not detect modifications to AGENTS.md.
I plan to address these limitations in a follow-on PR modeled after
#9985. Injection of new skills and AGENTS was deemed to risky, hence the
need to split the feature into two stages. The changes in this PR were
designed to easily accommodate the second stage once we have some other
foundational changes in place.
Testing: In addition to automated tests, I did manual testing to confirm
that newly-created skills, deleted skills, and renamed skills are
reflected in the TUI skill picker menu. Also confirmed that
modifications to behaviors for explicitly-invoked skills are honored.
---------
Co-authored-by: Xin Lin <xl@openai.com>
## Summary
This PR introduces a gated Bubblewrap (bwrap) Linux sandbox path. The
curent Linux sandbox path relies on in-process restrictions (including
Landlock). Bubblewrap gives us a more uniform filesystem isolation
model, especially explicit writable roots with the option to make some
directories read-only and granular network controls.
This is behind a feature flag so we can validate behavior safely before
making it the default.
- Added temporary rollout flag:
- `features.use_linux_sandbox_bwrap`
- Preserved existing default path when the flag is off.
- In Bubblewrap mode:
- Added internal retry without /proc when /proc mount is not permitted
by the host/container.
## Summary
This PR simplifies collaboration modes to the visible set `default |
plan`, while preserving backward compatibility for older partners that
may still send legacy mode
names.
Specifically:
- Renames the old Code behavior to **Default**.
- Keeps **Plan** as-is.
- Removes **Custom** mode behavior (fallbacks now resolve to Default).
- Keeps `PairProgramming` and `Execute` internally for compatibility
plumbing, while removing them from schema/API and UI visibility.
- Adds legacy input aliasing so older clients can still send old mode
names.
## What Changed
1. Mode enum and compatibility
- `ModeKind` now uses `Plan` + `Default` as active/public modes.
- `ModeKind::Default` deserialization accepts legacy values:
- `code`
- `pair_programming`
- `execute`
- `custom`
- `PairProgramming` and `Execute` variants remain in code but are hidden
from protocol/schema generation.
- `Custom` variant is removed; previous custom fallbacks now map to
`Default`.
2. Collaboration presets and templates
- Built-in presets now return only:
- `Plan`
- `Default`
- Template rename:
- `core/templates/collaboration_mode/code.md` -> `default.md`
- `execute.md` and `pair_programming.md` remain on disk but are not
surfaced in visible preset lists.
3. TUI updates
- Updated user-facing naming and prompts from “Code” to “Default”.
- Updated mode-cycle and indicator behavior to reflect only visible
`Plan` and `Default`.
- Updated corresponding tests and snapshots.
4. request_user_input behavior
- `request_user_input` remains allowed only in `Plan` mode.
- Rejection messaging now consistently treats non-plan modes as
`Default`.
5. Schemas
- Regenerated config and app-server schemas.
- Public schema types now advertise mode values as:
- `plan`
- `default`
## Backward Compatibility Notes
- Incoming legacy mode names (`code`, `pair_programming`, `execute`,
`custom`) are accepted and coerced to `default`.
- Outgoing/public schema surfaces intentionally expose only `plan |
default`.
- This allows tolerant ingestion of older partner payloads while
standardizing new integrations on the reduced mode set.
## Codex author
`codex fork 019c1fae-693b-7840-b16e-9ad38ea0bd00`