**Why We Did This**
- The goal is to reduce MCP tool context pollution by not exposing the
full MCP tool list up front
- It forces an explicit discovery step (`search_tool_bm25`) so the model
narrows tool scope before making MCP calls, which helps relevance and
lowers prompt/tool clutter.
**What It Changed**
- Added a new experimental feature flag `search_tool` in
`core/src/features.rs:90` and `core/src/features.rs:430`.
- Added config/schema support for that flag in
`core/config.schema.json:214` and `core/config.schema.json:1235`.
- Added BM25 dependency (`bm25`) in `Cargo.toml:129` and
`core/Cargo.toml:23`.
- Added new tool handler `search_tool_bm25` in
`core/src/tools/handlers/search_tool_bm25.rs:18`.
- Registered the handler and tool spec in
`core/src/tools/handlers/mod.rs:11` and `core/src/tools/spec.rs:780` and
`core/src/tools/spec.rs:1344`.
- Extended `ToolsConfig` to carry `search_tool` enablement in
`core/src/tools/spec.rs:32` and `core/src/tools/spec.rs:56`.
- Injected dedicated developer instructions for tool-discovery workflow
in `core/src/codex.rs:483` and `core/src/codex.rs:1976`, using
`core/templates/search_tool/developer_instructions.md:1`.
- Added session state to store one-shot selected MCP tools in
`core/src/state/session.rs:27` and `core/src/state/session.rs:131`.
- Added filtering so when feature is enabled, only selected MCP tools
are exposed on the next request (then consumed) in
`core/src/codex.rs:3800` and `core/src/codex.rs:3843`.
- Added E2E suite coverage for
enablement/instructions/hide-until-search/one-turn-selection in
`core/tests/suite/search_tool.rs:72`,
`core/tests/suite/search_tool.rs:109`,
`core/tests/suite/search_tool.rs:147`, and
`core/tests/suite/search_tool.rs:218`.
- Refactored test helper utilities to support config-driven tool
collection in `core/tests/suite/tools.rs:281`.
**Net Behavioral Effect**
- With `search_tool` **off**: existing MCP behavior (tools exposed
normally).
- With `search_tool` **on**: MCP tools start hidden, model must call
`search_tool_bm25`, and only returned `selected_tools` are available for
the next model call.
Instead of storing a special connection on the client level make the
regular task responsible for establishing a normal client session and
open a connection on it.
Then when the turn is started we pass in a pre-established session.
This introduces a `Hooks` service. It registers hooks from config and
dispatches hook events at runtime.
N.B. The hook config is not wired up to this yet. But for legacy
reasons, we wire up `notify` from config and power it using hooks now.
Nothing about the `notify` interface has changed.
I'd start by reviewing `hooks/types.rs`
Some things to note:
- hook names subject to change
- no hook result yet
- stopping semantics yet to be introduced
- additional hooks yet to be introduced
## Summary
When resuming with a different model, we should also append a developer
message with the model instructions
## Testing
- [x] Added unit tests
Make ModelClient a session-scoped object.
Move state that is session level onto the client, and make state that is
per-turn explicit on corresponding methods.
Stop taking a huge Config object, instead only pass in values that are
actually needed.
---------
Co-authored-by: Josh McKinney <joshka@openai.com>
Add a centralized FileWatcher in codex-core (using notify) that watches
skill roots from the config layer stack (recursive)
Send `SkillsChanged` events when relevant file system changes are
detected
On `SkillsChanged`:
* Invalidate the skills cache immediately in ThreadManager
* Emit EventMsg::SkillsUpdateAvailable to active sessions
~~* Broadcast a new app-server notification:
SkillsListUpdatedNotification~~
This change does not inject new items into the event stream. That means
the agent will not know about new skills, so it won't be able to
implicitly invoke new skills. It also won't know about changes to
existing skills, so if it has already read the contents of a modified
skill, it will not honor the new behavior.
This change also does not detect modifications to AGENTS.md.
I plan to address these limitations in a follow-on PR modeled after
#9985. Injection of new skills and AGENTS was deemed to risky, hence the
need to split the feature into two stages. The changes in this PR were
designed to easily accommodate the second stage once we have some other
foundational changes in place.
Testing: In addition to automated tests, I did manual testing to confirm
that newly-created skills, deleted skills, and renamed skills are
reflected in the TUI skill picker menu. Also confirmed that
modifications to behaviors for explicitly-invoked skills are honored.
---------
Co-authored-by: Xin Lin <xl@openai.com>
An experimental flow for env var skill dependencies. Skills can now
declare required env vars in SKILL.md; if missing, the CLI prompts the
user to get the value, and Core will store it in memory (eventually to a
local persistent store)
<img width="790" height="169" alt="image"
src="https://github.com/user-attachments/assets/cd928918-9403-43cb-a7e7-b8d59bcccd9a"
/>
Add a `.sqlite` database to be used to store rollout metatdata (and
later logs)
This PR is phase 1:
* Add the database and the required infrastructure
* Add a backfill of the database
* Persist the newly created rollout both in files and in the DB
* When we need to get metadata or a rollout, consider the `JSONL` as the
source of truth but compare the results with the DB and show any errors
### Overview
Currently calling `thread/resume` will always bump the thread's
`updated_at` timestamp. This PR makes it the `updated_at` timestamp
changes only if a turn is triggered.
### Additonal context
What we typically do on resuming a thread is **always** writing “initial
context” to the rollout file immediately. This initial context includes:
- Developer instructions derived from sandbox/approval policy + cwd
- Optional developer instructions (if provided)
- Optional collaboration-mode instructions
- Optional user instructions (if provided)
- Environment context (cwd, shell, etc.)
This PR defers writing the “initial context” to the rollout file until
the first `turn/start`, so we don't inadvertently bump the thread's
`updated_at` timestamp until a turn is actually triggered.
This works even though both `thread/resume` and `turn/start` accept
overrides (such as `model`, `cwd`, etc.) because the initial context is
seeded from the effective `TurnContext` in memory, computed at
`turn/start` time, after both sets of overrides have been applied.
**NOTE**: This is a very short-lived solution until we introduce sqlite.
Then we can remove this.
## Summary
Add dynamic tool injection to thread startup in API v2, wire dynamic
tool calls through the app server to clients, and plumb responses back
into the model tool pipeline.
### Flow (high level)
- Thread start injects `dynamic_tools` into the model tool list for that
thread (validation is done here).
- When the model emits a tool call for one of those names, core raises a
`DynamicToolCallRequest` event.
- The app server forwards it to the client as `item/tool/call`, waits
for the client’s response, then submits a `DynamicToolResponse` back to
core.
- Core turns that into a `function_call_output` in the next model
request so the model can continue.
### What changed
- Added dynamic tool specs to v2 thread start params and protocol types;
introduced `item/tool/call` (request/response) for dynamic tool
execution.
- Core now registers dynamic tool specs at request time and routes those
calls via a new dynamic tool handler.
- App server validates tool names/schemas, forwards dynamic tool call
requests to clients, and publishes tool outputs back into the session.
- Integration tests
### Summary
* Added `mcpServer/refresh` command to inform app servers and active
threads to refresh mcpServer on next turn event.
* Added `pending_mcp_server_refresh_config` to codex core so that if the
value is populated, we reinitialize the mcp server manager on the thread
level.
* The config is updated on `mcpServer/refresh` command which we iterate
through threads and provide with the latest config value after last
write.
Add metrics capabilities to Codex. The `README.md` is up to date.
This will not be merged with the metrics before this PR of course:
https://github.com/openai/codex/pull/8350
Added an agent control plane that lets sessions spawn or message other
conversations via `AgentControl`.
`AgentBus` (core/src/agent/bus.rs) keeps track of the last known status
of a conversation.
ConversationManager now holds shared state behind an Arc so AgentControl
keeps only a weak back-reference, the goal is just to avoid explicit
cycle reference.
Follow-ups:
* Build a small tool in the TUI to be able to see every agent and send
manual message to each of them
* Handle approval requests in this TUI
* Add tools to spawn/communicate between agents (see related design)
* Define agent types
# External (non-OpenAI) Pull Request Requirements
Before opening this Pull Request, please read the dedicated
"Contributing" markdown file or your PR may be closed:
https://github.com/openai/codex/blob/main/docs/contributing.md
If your PR conforms to our contribution guidelines, replace this text
with a detailed and high quality description of your changes.
Include a link to a bug report or enhancement request.
refactor the way we load and manage skills:
1. Move skill discovery/caching into SkillsManager and reuse it across
sessions.
2. Add the skills/list API (Op::ListSkills/SkillsListResponse) to fetch
skills for one or more cwds. Also update app-server for VSCE/App;
3. Trigger skills/list during session startup so UIs preload skills and
handle errors immediately.
1. Skills load once in core at session start; the cached outcome is
reused across core and surfaced to TUI via SessionConfigured.
2. TUI detects explicit skill selections, and core injects the matching
SKILL.md content into the turn when a selected skill is present.
This PR moves `ModelsFamily` to `openai_models`. It also propagates
`ModelsManager` to session services and use it to drive model family. We
also make `derive_default_model_family` private because it's a step
towards what we want: one place that gives model configuration.
This is a second step at having one source of truth for models
information and config: `ModelsManager`.
Next steps would be to remove `ModelsFamily` from config. That's massive
because it's being used in 41 occasions mostly pre launching `codex`.
Also, we need to make `find_family_for_model` private. It's also big
because it's being used in 21 occasions ~ all tests.
- The total token used returned from the api doesn't account for the
reasoning items before the assistant message
- Account for those for auto compaction
- Add the encrypted reasoning effort in the common tests utils
- Add a test to make sure it works as expected
- This PR is to make it on path for truncating by tokens. This path will
be initially used by unified exec and context manager (responsible for
MCP calls mainly).
- We are exposing new config `calls_output_max_tokens`
- Use `tokens` as the main budget unit but truncate based on the model
family by Introducing `TruncationPolicy`.
- Introduce `truncate_text` as a router for truncation based on the
mode.
In next PRs:
- remove truncate_with_line_bytes_budget
- Add the ability to the model to override the token budget.
This is just a refactor of `conversation_history` file by breaking it up
into multiple smaller ones with helper. This refactor will help us move
more functionality related to context management here. in a clean way.
In this PR, I am exploring migrating task kind to an invocation of
Codex. The main reason would be getting rid off multiple
`ConversationHistory` state and streamlining our context/history
management.
This approach depends on opening a channel between the sub-codex and
codex. This channel is responsible for forwarding `interactive`
(`approvals`) and `non-interactive` events. The `task` is responsible
for handling those events.
This opens the door for implementing `codex as a tool`, replacing
`compact` and `review`, and potentially subagents.
One consideration is this code is very similar to `app-server` specially
in the approval part. If in the future we wanted an interactive
`sub-codex` we should consider using `codex-mcp`
I want to centralize input processing and management to
`ConversationHistory`. This would need `ConversationHistory` to have
access to `token_info` (i.e. preventing adding a big input to the
history). Besides, it makes more sense to have it on
`ConversationHistory` than `state`.
We are doing some ad-hoc logic while dealing with conversation history.
Ideally, we shouldn't mutate `vec[responseitem]` manually at all and
should depend on `ConversationHistory` for those changes.
Those changes are:
- Adding input to the history
- Removing items from the history
- Correcting history
I am also adding some `error` logs for cases we shouldn't ideally face.
For example, we shouldn't be missing `toolcalls` or `outputs`. We
shouldn't hit `ContextWindowExceeded` while performing `compact`
This refactor will give us granular control over our context management.
1. Adds AgentMessage, Reasoning, WebSearch items.
2. Switches the ResponseItem parsing to use new items and then also emit
3. Removes user-item kind and filters out "special" (environment) user
items when returning to clients.
Today `sub_id` is an ID of a single incoming Codex Op submition. We then
associate all events triggered by this operation using the same
`sub_id`.
At the same time we are also creating a TurnContext per submission and
we'd like to start associating some events (item added/item completed)
with an entire turn instead of just the operation that started it.
Using turn context when sending events give us flexibility to change
notification scheme.
Adds a new ItemStarted event and delivers UserMessage as the first item
type (more to come).
Renames `InputItem` to `UserInput` considering we're using the `Item`
suffix for actual items.
The goal of this change:
1. Unify user input and user turn implementation.
2. Have a single place where turn/session setting overrides are applied.
3. Have a single place where turn context is created.
4. Create TurnContext only for actual turn and have a separate structure
for current session settings (reuse ConfigureSession)
In the past, we were treating `input exceeded context window` as a
streaming error and retrying on it. Retrying on it has no point because
it won't change the behavior. In this PR, we surface the error to the
client without retry and also send a token count event to indicate that
the context window is full.
<img width="650" height="125" alt="image"
src="https://github.com/user-attachments/assets/c26b1213-4c27-4bfc-90f4-51a270a3efd5"
/>