Commit Graph

4 Commits

Author SHA1 Message Date
Felipe Coury
9798eb377a feat(cli): add codex doctor diagnostics (#22336)
## Why

Users and support need a single command that captures the local Codex
runtime, configuration, auth, terminal, network, and state shape without
asking the user to know which diagnostic depth to choose first. `codex
doctor` now runs the useful checks by default and makes the detailed
human output the default because the command is usually run when someone
already needs context.

The command also targets concrete support failure modes we have seen
while iterating on the design:

- update-target mismatches like #21956, where the installed package
manager target can differ from the running executable
- terminal and multiplexer issues that depend on `TERM`, tmux/zellij
state, color handling, and TTY metadata
- provider-specific HTTP/WebSocket connectivity, including ChatGPT
WebSocket handshakes and API-key/provider endpoint reachability
- local state/log SQLite integrity problems and large rollout
directories
- feedback reports that need an attached, redacted diagnostic snapshot
without asking the user to run a second command

## What Changed

- Adds `codex doctor` as a grouped CLI diagnostic report with default
detailed output and `--summary` for the compact view.
- Adds stable report sections for Environment, Configuration, Updates,
Connectivity, and Background Server, plus a top Notes block that
promotes anomalies such as available updates, large rollout directories,
optional MCP issues, and mixed auth signals.
- Adds runtime provenance, install consistency, bundled/system search
readiness, terminal/multiplexer metadata, `config.toml` parse status,
auth mode details, sandbox details, feature flag summaries, update
cache/latest-version state, app-server daemon state, SQLite integrity
checks, rollout statistics, and provider-aware network diagnostics.
- Adds ChatGPT WebSocket diagnostics that report the negotiated HTTP
upgrade as `HTTP 101 Switching Protocols` and include timeout, DNS,
auth, and provider context in detailed output.
- Makes reachability provider-aware: API-key OpenAI setups check the API
endpoint, ChatGPT auth checks the ChatGPT path, and custom/AWS/local
providers check configured HTTP endpoints when available.
- Adds structured, redacted JSON output where `checks` is keyed by check
id and `details` is a key/value object for support tooling.
- Integrates doctor with feedback uploads by attaching a best-effort
`codex-doctor-report.json` report and adding derived Sentry tags for
overall status and failing/warning checks.
- Updates the TUI feedback consent copy so users can see that the doctor
report is included when logs/diagnostics are uploaded.
- Updates the CLI bug issue template to ask reporters for `codex doctor
--json` and render pasted reports as JSON.

## Example Output

The examples below are sanitized from local smoke runs with `--no-color`
so the structure is reviewable in plain text.

### `codex doctor`

```text
Codex Doctor v0.0.0 · macos-aarch64

Notes
   ↑ updates      0.130.0 available (current 0.0.0, dismissed 0.128.0)
   ⚠ rollouts     1,526 active files · 2.53 GB on disk
   ⚠ mcp          MCP configuration has optional issues
   ⚠ auth         mixed auth signals: ChatGPT login plus API key env var; HTTP reachability uses API-key mode
─────────────────────────────────────────────────────────────

Environment
  ✓ runtime      local debug build
      version                  0.0.0
      install method           other
      commit                   unknown
      executable               ~/code/codex.fcoury-doct…x-rs/target/debug/codex
  ✓ install      consistent
      context                  other
      managed by               npm: no · bun: no · package root —
      PATH entries (2)         ~/.local/share/mise/installs/node/24/bin/codex
                               ~/.local/share/mise/shims/codex
  ✓ search       ripgrep 15.1.0 (system, `rg`)
  ✓ terminal     Ghostty 1.3.2-main-+b0f827665 · tmux 3.6a · TERM=xterm-256color
      terminal                 Ghostty
      TERM_PROGRAM             ghostty
      terminal version         1.3.2-main-+b0f827665
      TERM                     xterm-256color
      multiplexer              tmux 3.6a
      tmux extended-keys       on
      tmux allow-passthrough   on
      tmux set-clipboard       on
  ✓ state        databases healthy
      CODEX_HOME               ~/.codex (dir)
      state DB                 ~/.codex/state_5.sqlite (file) · integrity ok
      log DB                   ~/.codex/logs_2.sqlite (file) · integrity ok
      active rollouts          1,526 files · 2.53 GB (avg 1.70 MB)
      archived rollouts        8 files · 3.84 MB (avg 491.11 KB)

Configuration
  ✓ config       loaded
      model                    gpt-5.5 · openai
      cwd                      ~/code/codex.fcoury-doctor/codex-rs
      config.toml              ~/.codex/config.toml
      config.toml parse        ok
      MCP servers              1
      feature flags            36 enabled · 7 overridden (full list with --all)
      overrides                code_mode, code_mode_only, memories, chronicle, goals, remote_control, prevent_idle_sleep
  ✓ auth         auth is configured
      auth storage mode        File
      auth file                ~/.codex/auth.json
      auth env vars present    OPENAI_API_KEY
      stored auth mode         chatgpt
      stored API key           false
      stored ChatGPT tokens    true
      stored agent identity    false
  ⚠ mcp          MCP configuration has optional issues — Set the missing MCP env vars or disable the affected server.
      configured servers       1
      disabled servers         0
      streamable_http servers  1
      optional reachability    openaiDeveloperDocs: https://developers.openai.com/mcp (HEAD connect failed; GET connect failed)
  ✓ sandbox      restricted fs + restricted network · approval OnRequest
      approval policy          OnRequest
      filesystem sandbox       restricted
      network sandbox          restricted

Connectivity
  ✓ network      network-related environment looks readable
  ✓ websocket    connected (HTTP 101 Switching Protocols) · 15s timeout
      model provider           openai
      provider name            OpenAI
      wire API                 responses
      supports websockets      true
      connect timeout          15000 ms
      auth mode                chatgpt
      endpoint                 wss://chatgpt.com/backend-api/<redacted>
      DNS                      2 IPv4, 2 IPv6, first IPv6
      handshake result         HTTP 101 Switching Protocols
  ✗ reachability one or more required provider endpoints are unreachable over HTTP — Check proxy, VPN, firewall, DNS, and custom CA configuration.
      reachability mode        API key auth
      openai API               https://api.openai.com/v1 connect failed (required)

Background Server
  ○ app-server   not running (ephemeral mode)

─────────────────────────────────────────────────────────────
11 ok · 1 idle · 4 notes · 1 warn · 1 fail failed

--summary compact output           --all expand truncated lists
--json redacted report
```

### `codex doctor --summary`

```text
Codex Doctor v0.0.0 · macos-aarch64

Notes
   ↑ updates      0.130.0 available (current 0.0.0, dismissed 0.128.0)
   ⚠ rollouts     1,526 active files · 2.53 GB on disk
   ⚠ mcp          MCP configuration has optional issues
   ⚠ auth         mixed auth signals: ChatGPT login plus API key env var; HTTP reachability uses API-key mode
─────────────────────────────────────────────────────────────

Environment
  ✓ runtime      local debug build
  ✓ install      consistent
  ✓ search       ripgrep 15.1.0 (system, `rg`)
  ✓ terminal     Ghostty 1.3.2-main-+b0f827665 · tmux 3.6a · TERM=xterm-256color
  ✓ state        databases healthy

Configuration
  ✓ config       loaded
  ✓ auth         auth is configured
  ⚠ mcp          MCP configuration has optional issues — Set the missing MCP env vars or disable the affected server.
  ✓ sandbox      restricted fs + restricted network · approval OnRequest

Updates
  ✓ updates      update configuration is locally consistent

Connectivity
  ✓ network      network-related environment looks readable
  ✓ websocket    connected (HTTP 101 Switching Protocols) · 15s timeout
  ✗ reachability one or more required provider endpoints are unreachable over HTTP — Check proxy, VPN, firewall, DNS, and custom CA configuration.

Background Server
  ○ app-server   not running (ephemeral mode)

─────────────────────────────────────────────────────────────
11 ok · 1 idle · 4 notes · 1 warn · 1 fail failed

Run codex doctor without --summary for detailed diagnostics.
--all expand truncated lists       --json redacted report
```

### `codex doctor --json` shape

```json
{
  "schema_version": 1,
  "overall_status": "fail",
  "checks": {
    "runtime.provenance": {
      "id": "runtime.provenance",
      "category": "Environment",
      "status": "ok",
      "summary": "local debug build",
      "details": {
        "version": "0.0.0",
        "install method": "other",
        "commit": "unknown"
      }
    },
    "sandbox.helpers": {
      "id": "sandbox.helpers",
      "category": "Configuration",
      "status": "ok",
      "summary": "restricted fs + restricted network · approval OnRequest",
      "details": {
        "approval policy": "OnRequest",
        "filesystem sandbox": "restricted",
        "network sandbox": "restricted"
      }
    }
  }
}
```

### `/feedback` new sentry attachment

<img width="938" height="798" alt="CleanShot 2026-05-13 at 15 36 14"
src="https://github.com/user-attachments/assets/715e62e0-d7b4-4fea-a35a-fd5d5d33c4c0"
/>

### New section in CLI issue template

<img width="1164" height="435" alt="CleanShot 2026-05-13 at 15 47 24"
src="https://github.com/user-attachments/assets/9081dc25-a28c-4afa-8ba1-e299c2b4031d"
/>

## How to Test

1. Run `cargo run --bin codex -- doctor --no-color`.
2. Confirm the detailed report is the default and includes promoted
Notes, grouped sections, terminal details, state DB integrity, rollout
stats, provider reachability, WebSocket diagnostics, and app-server
status.
3. Run `cargo run --bin codex -- doctor --summary --no-color`.
4. Confirm the compact view keeps the same sections and summary counts
but omits detailed key/value rows.
5. Run `cargo run --bin codex -- doctor --json`.
6. Confirm the output is redacted JSON, `checks` is an object keyed by
check id, and each check's `details` is a key/value object.
7. Preview the CLI bug issue template and confirm the `Codex doctor
report` field appears after the terminal field, asks for `codex doctor
--json`, and renders pasted output as JSON.
8. Start a feedback flow that includes logs.
9. Confirm the upload consent copy lists `codex-doctor-report.json`
alongside the log attachments.

Targeted tests:

- `cargo test -p codex-cli doctor`
- `cargo test -p codex-app-server
doctor_report_tags_summarize_status_counts`
- `cargo test -p codex-feedback`
- `cargo test -p codex-tui feedback_view`
- `just argument-comment-lint`
- `git diff --check`
2026-05-13 21:23:19 +00:00
pakrym-oai
acac786d91 [codex] add account id to feedback uploads (#21498)
## Why

Feedback uploads already carry auth-derived context like
`chatgpt_user_id`, but they do not include the authenticated
workspace/account id. Adding `account_id` makes feedback triage easier
when a user can operate across multiple ChatGPT workspaces.

## What changed

- emit auth-derived `account_id` into feedback tags in `app-server`
before the feedback snapshot is uploaded
- preserve that tag through `codex-feedback` upload tag assembly
alongside the existing merge behavior for other tags
- extend `codex-feedback` coverage to assert that snapshot-derived
`account_id` is present in uploaded tags

## Verification

- `cargo test -p codex-feedback
upload_tags_include_client_tags_and_preserve_reserved_fields`
- `cargo test -p codex-app-server --lib feedback_processor`
2026-05-07 08:45:16 -07:00
Ruslan Nigmatullin
4d201e340e state: pass state db handles through consumers (#20561)
## Why

SQLite state was still being opened from consumer paths, including lazy
`OnceCell`-backed thread-store call sites. That let one process
construct multiple state DB connections for the same Codex home, which
makes SQLite lock contention and `database is locked` failures much
easier to hit.

State DB lifetime should be chosen by main-like entrypoints and tests,
then passed through explicitly. Consumers should use the supplied
`Option<StateDbHandle>` or `StateDbHandle` and keep their existing
filesystem fallback or error behavior when no handle is available.

The startup path also needs to keep the rollout crate in charge of
SQLite state initialization. Opening `codex_state::StateRuntime`
directly bypasses rollout metadata backfill, so entrypoints should
initialize through `codex_rollout::state_db` and receive a handle only
after required rollout backfills have completed.

## What Changed

- Initialize the state DB in main-like entrypoints for CLI, TUI,
app-server, exec, MCP server, and the thread-manager sample.
- Pass `Option<StateDbHandle>` through `ThreadManager`,
`LocalThreadStore`, app-server processors, TUI app wiring, rollout
listing/recording, personality migration, shell snapshot cleanup,
session-name lookup, and memory/device-key consumers.
- Remove the lazy local state DB wrapper from the thread store so
non-test consumers use only the supplied handle or their existing
fallback path.
- Make `codex_rollout::state_db::init` the local state startup path: it
opens/migrates SQLite, runs rollout metadata backfill when needed, waits
for concurrent backfill workers up to a bounded timeout, verifies
completion, and then returns the initialized handle.
- Keep optional/non-owning SQLite helpers, such as remote TUI local
reads, as open-only paths that do not run startup backfill.
- Switch app-server startup from direct
`codex_state::StateRuntime::init` to the rollout state initializer so
app-server cannot skip rollout backfill.
- Collapse split rollout lookup/list APIs so callers use the normal
methods with an optional state handle instead of `_with_state_db`
variants.
- Restore `getConversationSummary(ThreadId)` to delegate through
`ThreadStore::read_thread` instead of a LocalThreadStore-specific
rollout path special case.
- Keep DB-backed rollout path lookup keyed on the DB row and file
existence, without imposing the filesystem filename convention on
existing DB rows.
- Verify readable DB-backed rollout paths against `session_meta.id`
before returning them, so a stale SQLite row that points at another
thread's JSONL falls back to filesystem search and read-repairs the DB
row.
- Keep `debug prompt-input` filesystem-only so a one-off debug command
does not initialize or backfill SQLite state just to print prompt input.
- Keep goal-session test Codex homes alive only in the goal-specific
helper, rather than leaking tempdirs from the shared session test
helper.
- Update tests and call sites to pass explicit state handles where DB
behavior is expected and explicit `None` where filesystem-only behavior
is intended.

## Validation

- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo check -p
codex-rollout -p codex-thread-store -p codex-app-server -p codex-core -p
codex-tui -p codex-exec -p codex-cli --tests`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-rollout state_db_`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-rollout find_thread_path`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-rollout find_thread_path -- --nocapture`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-rollout try_init_ -- --nocapture`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-rollout`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo clippy -p
codex-rollout --lib -- -D warnings`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-thread-store
read_thread_falls_back_when_sqlite_path_points_to_another_thread --
--nocapture`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-thread-store`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p codex-core
shell_snapshot`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p codex-core
--test all personality_migration`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p codex-core
--test all rollout_list_find`
- `RUST_MIN_STACK=8388608 CODEX_SKIP_VENDORED_BWRAP=1
CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p codex-core
--test all rollout_list_find::find_prefers_sqlite_path_by_id --
--nocapture`
- `RUST_MIN_STACK=8388608 CODEX_SKIP_VENDORED_BWRAP=1
CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p codex-core
--test all rollout_list_find -- --nocapture`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p codex-core
interrupt_accounts_active_goal_before_pausing`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-app-server get_auth_status -- --test-threads=1`
- `CODEX_SKIP_VENDORED_BWRAP=1
CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo test -p
codex-app-server --lib`
- `CODEX_SKIP_VENDORED_BWRAP=1
CARGO_TARGET_DIR=/tmp/codex-target-state-db cargo check -p codex-rollout
-p codex-app-server --tests`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db just fix -p codex-rollout
-p codex-thread-store -p codex-core -p codex-app-server -p codex-tui -p
codex-exec -p codex-cli`
- `CODEX_SKIP_VENDORED_BWRAP=1
CARGO_TARGET_DIR=/tmp/codex-target-state-db just fix -p codex-rollout -p
codex-app-server`
- `CARGO_TARGET_DIR=/tmp/codex-target-state-db just fix -p
codex-rollout`
- `CODEX_SKIP_VENDORED_BWRAP=1
CARGO_TARGET_DIR=/tmp/codex-target-state-db just fix -p codex-core`
- `just argument-comment-lint -p codex-core`
- `just argument-comment-lint -p codex-rollout`

Focused coverage added in `codex-rollout`:

- `recorder::tests::state_db_init_backfills_before_returning` verifies
the rollout metadata row exists before startup init returns.
- `state_db::tests::try_init_waits_for_concurrent_startup_backfill`
verifies startup waits for another worker to finish backfill instead of
disabling the handle for the process.
-
`state_db::tests::try_init_times_out_waiting_for_stuck_startup_backfill`
verifies startup does not hang indefinitely on a stuck backfill lease.
-
`tests::find_thread_path_accepts_existing_state_db_path_without_canonical_filename`
verifies DB-backed lookup accepts valid existing rollout paths even when
the filename does not include the thread UUID.
-
`tests::find_thread_path_falls_back_when_db_path_points_to_another_thread`
verifies DB-backed lookup ignores a stale row whose existing path
belongs to another thread and read-repairs the row after filesystem
fallback.

Focused coverage updated in `codex-core`:

- `rollout_list_find::find_prefers_sqlite_path_by_id` now uses a
DB-preferred rollout file with matching `session_meta.id`, so it still
verifies that valid SQLite paths win without depending on stale/empty
rollout contents.

`cargo test -p codex-app-server thread_list_respects_search_term_filter
-- --test-threads=1 --nocapture` was attempted locally but timed out
waiting for the app-server test harness `initialize` response before
reaching the changed thread-list code path.

`bazel test //codex-rs/thread-store:thread-store-unit-tests
--test_output=errors` was attempted locally after the thread-store fix,
but this container failed before target analysis while fetching `v8+`
through BuildBuddy/direct GitHub. The equivalent local crate coverage,
including `cargo test -p codex-thread-store`, passes.

A plain local `cargo check -p codex-rollout -p codex-app-server --tests`
also requires system `libcap.pc` for `codex-linux-sandbox`; the
follow-up app-server check above used `CODEX_SKIP_VENDORED_BWRAP=1` in
this container.
2026-05-04 11:46:03 -07:00
pakrym-oai
33b19bcfde [codex] Split app-server request processors (#20940)
## Why

The app-server request path had grown around a large
`CodexMessageProcessor` plus separate API wrapper/helper modules. That
made the dependency graph hard to see and forced unrelated request
families to share broad processor state.

This PR makes the split mechanical and command-prefix oriented so
request families own only the dependencies they use.

## What changed

- Replaced `CodexMessageProcessor` with command-prefix request
processors under `app-server/src/request_processors/`.
- Removed the old config, device-key, external-agent-config, and fs API
wrapper files by moving their API handling into processors.
- Split apps, plugins, marketplace, catalog, account, MCP, command exec,
fs, git, feedback, thread, turn, thread goals, and Windows sandbox
handling into dedicated processors.
- Kept shared lifecycle, summary conversion, token usage replay, and
shared error mapping only where multiple processors use them; single-use
helpers were inlined into their owning processor.
- Removed the fallback processor path and moved processor tests to
`_tests` files.

## Validation

- `cargo test -p codex-app-server`
- `cargo check -p codex-app-server`
- `just fix -p codex-app-server`
2026-05-04 09:34:11 -07:00