Compare commits

...

3 Commits

Author SHA1 Message Date
Ed Bayes
f4c61633e6 core: customize collab config deprecation warning 2026-02-22 23:14:30 -05:00
Michael Bolin
e8949f4507 test: vendor zsh fork via DotSlash and stabilize zsh-fork tests (#12518)
## Why

The zsh integration tests were still brittle in two ways:

- they relied on `CODEX_TEST_ZSH_PATH` / environment-specific setup, so
they often did not exercise the patched zsh fork that `shell-tool-mcp`
ships
- once the tests consistently used the vendored zsh fork, they exposed
real Linux-specific zsh-fork issues in CI

In particular, the Linux failures were not just test noise:

- the zsh-fork launch path was dropping `ExecRequest.arg0`, so Linux
`codex-linux-sandbox` arg0 dispatch did not run and zsh wrapper-mode
could receive malformed arguments
- the
`turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2`
test uses the zsh exec bridge (which talks to the parent over a Unix
socket), but Linux restricted sandbox seccomp denies `connect(2)`,
causing timeouts on `ubuntu-24.04` x86/arm

This PR makes the zsh tests consistently run against the intended
vendored zsh fork and fixes/hardens the zsh-fork path so the Linux CI
signal is meaningful.

## What Changed

- Added a single shared test-only DotSlash file for the patched zsh fork
at `codex-rs/exec-server/tests/suite/zsh` (analogous to the existing
`bash` test resource).
- Updated both app-server and exec-server zsh tests to use that shared
DotSlash zsh (no duplicate zsh DotSlash file, no `CODEX_TEST_ZSH_PATH`
dependency).
- Updated the app-server zsh-fork test helper to resolve the shared
DotSlash zsh and avoid silently falling back to host zsh.
- Kept the app-server zsh-fork tests configured via `config.toml`, using
a test wrapper path where needed to force `zsh -df` (and rewrite `-lc`
to `-c`) for the subcommand-decline test.
- Hardened the app-server subcommand-decline zsh-fork test for CI
variability:
  - tolerate an extra `/responses` POST with a no-op mock response
- tolerate non-target approval ordering while remaining strict on the
two `/usr/bin/true` approvals and decline behavior
- use `DangerFullAccess` on Linux for this one test because it validates
zsh approval flow, not Linux sandbox socket restrictions
- Fixed zsh-fork process launching on Linux by preserving `req.arg0` in
`ZshExecBridge::execute_shell_request(...)` so `codex-linux-sandbox`
arg0 dispatch continues to work.
- Moved `maybe_run_zsh_exec_wrapper_mode()` under
`arg0_dispatch_or_else(...)` in `app-server` and `cli` so wrapper-mode
handling coexists correctly with arg0-dispatched helper modes.
- Consolidated duplicated `dotslash -- fetch` resolution logic into
shared test support (`core/tests/common/lib.rs`).
- Updated `codex-rs/exec-server/tests/suite/accept_elicitation.rs` to
use DotSlash zsh and hardened the zsh elicitation test for Bazel/zsh
differences by:
  - resolving an absolute `git` path
  - running `git init --quiet .`
- asserting success / `.git` creation instead of relying on banner text

## Verification

- `cargo test -p codex-app-server turn_start_zsh_fork -- --nocapture`
- `cargo test -p codex-exec-server accept_elicitation -- --nocapture`
- `bazel test //codex-rs/exec-server:exec-server-all-test
--test_output=streamed --test_arg=--nocapture
--test_arg=accept_elicitation_for_prompt_rule_with_zsh`
- CI (`rust-ci`) on the final cleaned commit: `Tests — ubuntu-24.04 -
x86_64-unknown-linux-gnu` and `Tests — ubuntu-24.04-arm -
aarch64-unknown-linux-gnu` passed in [run
22291424358](https://github.com/openai/codex/actions/runs/22291424358)
2026-02-22 19:39:56 -08:00
Eric Traut
7e569f1162 Add PR babysitting skill for this repo (#12513)
## PR Notes

This PR adds a project-scoped `babysit-pr` skill for ongoing PR
monitoring (CI, reviews, mergeability).

Simply invoke this skill after creating a PR, and codex will do its best
to get it to a mergeable state:

### What the skill does
* Fixes CI failures related to the PR
* Retries CI failures due to flaky tests
* Addresses code review comments if it agrees with them
* Addresses merge conflicts on main branch

### How the skill works
- Polls PR status on a loop (CI checks, workflow runs, review activity,
mergeability, and review decision).
- Detects new review feedback (including inline comments and automated
Codex review comments) and prompts/handles follow-up work.
- Distinguishes pending vs failed vs passed CI and identifies likely
flaky failures.
- Can retry failed checks/workflows when appropriate.
- Prioritizes actionable code review feedback over flaky CI retries (to
avoid rerunning CI on a SHA that is about to be replaced).
- Continues monitoring after fixes are applied and pushed, rather than
stopping after a progress update.
- Uses a slower backoff polling cadence once CI is green, while still
watching for new review feedback or state changes.
- Treats required review/approval as a blocking condition and keeps
watching until the PR is actually merge-ready (or merged/closed, or
human intervention is needed).

### Intended outcome

Keep the PR moving with minimal manual babysitting by continuously
watching for CI failures, reviewer feedback, and merge blockers, and
responding in the right order until the PR is ready to merge.
2026-02-22 15:36:28 -08:00
15 changed files with 1460 additions and 104 deletions

View File

@@ -0,0 +1,185 @@
---
name: babysit-pr
description: Babysit a GitHub pull request after creation by continuously polling CI checks/workflow runs, new review comments, and mergeability state until the PR is ready to merge (or merged/closed). Diagnose failures, retry likely flaky failures up to 3 times, auto-fix/push branch-related issues when appropriate, and stop only when user help is required (for example CI infrastructure issues, exhausted flaky retries, or ambiguous/blocking situations). Use when the user asks Codex to monitor a PR, watch CI, handle review comments, or keep an eye on failures and feedback on an open PR.
---
# PR Babysitter
## Objective
Babysit a PR persistently until one of these terminal outcomes occurs:
- The PR is merged or closed.
- CI is successful, there are no unaddressed review comments surfaced by the watcher, required review approval is not blocking merge, and there are no potential merge conflicts (PR is mergeable / not reporting conflict risk).
- A situation requires user help (for example CI infrastructure issues, repeated flaky failures after retry budget is exhausted, permission problems, or ambiguity that cannot be resolved safely).
Do not stop merely because a single snapshot returns `idle` while checks are still pending.
## Inputs
Accept any of the following:
- No PR argument: infer the PR from the current branch (`--pr auto`)
- PR number
- PR URL
## Core Workflow
1. When the user asks to "monitor"/"watch"/"babysit" a PR, start with the watcher's continuous mode (`--watch`) unless you are intentionally doing a one-shot diagnostic snapshot.
2. Run the watcher script to snapshot PR/CI/review state (or consume each streamed snapshot from `--watch`).
3. Inspect the `actions` list in the JSON response.
4. If `diagnose_ci_failure` is present, inspect failed run logs and classify the failure.
5. If the failure is likely caused by the current branch, patch code locally, commit, and push.
6. If `process_review_comment` is present, inspect surfaced review items and decide whether to address them.
7. If a review item is actionable and correct, patch code locally, commit, and push.
8. If the failure is likely flaky/unrelated and `retry_failed_checks` is present, rerun failed jobs with `--retry-failed-now`.
9. If both actionable review feedback and `retry_failed_checks` are present, prioritize review feedback first; a new commit will retrigger CI, so avoid rerunning flaky checks on the old SHA unless you intentionally defer the review change.
10. On every loop, verify mergeability / merge-conflict status (for example via `gh pr view`) in addition to CI and review state.
11. After any push or rerun action, immediately return to step 1 and continue polling on the updated SHA/state.
12. If you had been using `--watch` before pausing to patch/commit/push, relaunch `--watch` yourself in the same turn immediately after the push (do not wait for the user to re-invoke the skill).
13. Repeat polling until the PR is green + review-clean + mergeable, `stop_pr_closed` appears, or a user-help-required blocker is reached.
14. Maintain terminal/session ownership: while babysitting is active, keep consuming watcher output in the same turn; do not leave a detached `--watch` process running and then end the turn as if monitoring were complete.
## Commands
### One-shot snapshot
```bash
python3 .codex/skills/babysit-pr/scripts/gh_pr_watch.py --pr auto --once
```
### Continuous watch (JSONL)
```bash
python3 .codex/skills/babysit-pr/scripts/gh_pr_watch.py --pr auto --watch
```
### Trigger flaky retry cycle (only when watcher indicates)
```bash
python3 .codex/skills/babysit-pr/scripts/gh_pr_watch.py --pr auto --retry-failed-now
```
### Explicit PR target
```bash
python3 .codex/skills/babysit-pr/scripts/gh_pr_watch.py --pr <number-or-url> --once
```
## CI Failure Classification
Use `gh` commands to inspect failed runs before deciding to rerun.
- `gh run view <run-id> --json jobs,name,workflowName,conclusion,status,url,headSha`
- `gh run view <run-id> --log-failed`
Prefer treating failures as branch-related when logs point to changed code (compile/test/lint/typecheck/snapshots/static analysis in touched areas).
Prefer treating failures as flaky/unrelated when logs show transient infra/external issues (timeouts, runner provisioning failures, registry/network outages, GitHub Actions infra errors).
If classification is ambiguous, perform one manual diagnosis attempt before choosing rerun.
Read `.codex/skills/babysit-pr/references/heuristics.md` for a concise checklist.
## Review Comment Handling
The watcher surfaces review items from:
- PR issue comments
- Inline review comments
- Review submissions (COMMENT / APPROVED / CHANGES_REQUESTED)
It intentionally surfaces Codex reviewer bot feedback (for example comments/reviews from `chatgpt-codex-connector[bot]`) in addition to human reviewer feedback. Most unrelated bot noise should still be ignored.
For safety, the watcher only auto-surfaces trusted human review authors (for example repo OWNER/MEMBER/COLLABORATOR, plus the authenticated operator) and approved review bots such as Codex.
On a fresh watcher state file, existing pending review feedback may be surfaced immediately (not only comments that arrive after monitoring starts). This is intentional so already-open review comments are not missed.
When you agree with a comment and it is actionable:
1. Patch code locally.
2. Commit with `codex: address PR review feedback (#<n>)`.
3. Push to the PR head branch.
4. Resume watching on the new SHA immediately (do not stop after reporting the push).
5. If monitoring was running in `--watch` mode, restart `--watch` immediately after the push in the same turn; do not wait for the user to ask again.
If you disagree or the comment is non-actionable/already addressed, record it as handled by continuing the watcher loop (the script de-duplicates surfaced items via state after surfacing them).
If a code review comment/thread is already marked as resolved in GitHub, treat it as non-actionable and safely ignore it unless new unresolved follow-up feedback appears.
## Git Safety Rules
- Work only on the PR head branch.
- Avoid destructive git commands.
- Do not switch branches unless necessary to recover context.
- Before editing, check for unrelated uncommitted changes. If present, stop and ask the user.
- After each successful fix, commit and `git push`, then re-run the watcher.
- If you interrupted a live `--watch` session to make the fix, restart `--watch` immediately after the push in the same turn.
- Do not run multiple concurrent `--watch` processes for the same PR/state file; keep one watcher session active and reuse it until it stops or you intentionally restart it.
- A push is not a terminal outcome; continue the monitoring loop unless a strict stop condition is met.
Commit message defaults:
- `codex: fix CI failure on PR #<n>`
- `codex: address PR review feedback (#<n>)`
## Monitoring Loop Pattern
Use this loop in a live Codex session:
1. Run `--once`.
2. Read `actions`.
3. First check whether the PR is now merged or otherwise closed; if so, report that terminal state and stop polling immediately.
4. Check CI summary, new review items, and mergeability/conflict status.
5. Diagnose CI failures and classify branch-related vs flaky/unrelated.
6. Process actionable review comments before flaky reruns when both are present; if a review fix requires a commit, push it and skip rerunning failed checks on the old SHA.
7. Retry failed checks only when `retry_failed_checks` is present and you are not about to replace the current SHA with a review/CI fix commit.
8. If you pushed a commit or triggered a rerun, report the action briefly and continue polling (do not stop).
9. After a review-fix push, proactively restart continuous monitoring (`--watch`) in the same turn unless a strict stop condition has already been reached.
10. If everything is passing, mergeable, not blocked on required review approval, and there are no unaddressed review items, report success and stop.
11. If blocked on a user-help-required issue (infra outage, exhausted flaky retries, unclear reviewer request, permissions), report the blocker and stop.
12. Otherwise sleep according to the polling cadence below and repeat.
When the user explicitly asks to monitor/watch/babysit a PR, prefer `--watch` so polling continues autonomously in one command. Use repeated `--once` snapshots only for debugging, local testing, or when the user explicitly asks for a one-shot check.
Do not stop to ask the user whether to continue polling; continue autonomously until a strict stop condition is met or the user explicitly interrupts.
Do not hand control back to the user after a review-fix push just because a new SHA was created; restarting the watcher and re-entering the poll loop is part of the same babysitting task.
If a `--watch` process is still running and no strict stop condition has been reached, the babysitting task is still in progress; keep streaming/consuming watcher output instead of ending the turn.
## Polling Cadence
Use adaptive polling and continue monitoring even after CI turns green:
- While CI is not green (pending/running/queued or failing): poll every 1 minute.
- After CI turns green: start at every 1 minute, then back off exponentially when there is no change (for example 1m, 2m, 4m, 8m, 16m, 32m), capping at every 1 hour.
- Reset the green-state polling interval back to 1 minute whenever anything changes (new commit/SHA, check status changes, new review comments, mergeability changes, review decision changes).
- If CI stops being green again (new commit, rerun, or regression): return to 1-minute polling.
- If any poll shows the PR is merged or otherwise closed: stop polling immediately and report the terminal state.
## Stop Conditions (Strict)
Stop only when one of the following is true:
- PR merged or closed (stop as soon as a poll/snapshot confirms this).
- PR is ready to merge: CI succeeded, no surfaced unaddressed review comments, not blocked on required review approval, and no merge conflict risk.
- User intervention is required and Codex cannot safely proceed alone.
Keep polling when:
- `actions` contains only `idle` but checks are still pending.
- CI is still running/queued.
- Review state is quiet but CI is not terminal.
- CI is green but mergeability is unknown/pending.
- CI is green and mergeable, but the PR is still open and you are waiting for possible new review comments or merge-conflict changes per the green-state cadence.
- The PR is green but blocked on review approval (`REVIEW_REQUIRED` / similar); continue polling on the green-state cadence and surface any new review comments without asking for confirmation to keep watching.
## Output Expectations
Provide concise progress updates while monitoring and a final summary that includes:
- During long unchanged monitoring periods, avoid emitting a full update on every poll; summarize only status changes plus occasional heartbeat updates.
- Treat push confirmations, intermediate CI snapshots, and review-action updates as progress updates only; do not emit the final summary or end the babysitting session unless a strict stop condition is met.
- A user request to "monitor" is not satisfied by a couple of sample polls; remain in the loop until a strict stop condition or an explicit user interruption.
- A review-fix commit + push is not a completion event; immediately resume live monitoring (`--watch`) in the same turn and continue reporting progress updates.
- When CI first transitions to all green for the current SHA, emit a one-time celebratory progress update (do not repeat it on every green poll). Preferred style: `🚀 CI is all green! 33/33 passed. Still on watch for review approval.`
- Do not send the final summary while a watcher terminal is still running unless the watcher has emitted/confirmed a strict stop condition; otherwise continue with progress updates.
- Final PR SHA
- CI status summary
- Mergeability / conflict status
- Fixes pushed
- Flaky retry cycles used
- Remaining unresolved failures or review comments
## References
- Heuristics and decision tree: `.codex/skills/babysit-pr/references/heuristics.md`
- GitHub CLI/API details used by the watcher: `.codex/skills/babysit-pr/references/github-api-notes.md`

View File

@@ -0,0 +1,4 @@
interface:
display_name: "PR Babysitter"
short_description: "Watch PR CI, reviews, and merge conflicts"
default_prompt: "Babysit the current PR: monitor CI, reviewer comments, and merge-conflict status (prefer the watchers --watch mode for live monitoring); fix valid issues, push updates, and rerun flaky failures up to 3 times. Keep exactly one watcher session active for the PR (do not leave duplicate --watch terminals running). If you pause monitoring to patch review/CI feedback, restart --watch yourself immediately after the push in the same turn. If a watcher is still running and no strict stop condition has been reached, the task is still in progress: keep consuming watcher output and sending progress updates instead of ending the turn. Continue polling autonomously after any push/rerun until a strict terminal stop condition is reached or the user interrupts."

View File

@@ -0,0 +1,72 @@
# GitHub CLI / API Notes For `babysit-pr`
## Primary commands used
### PR metadata
- `gh pr view --json number,url,state,mergedAt,closedAt,headRefName,headRefOid,headRepository,headRepositoryOwner`
Used to resolve PR number, URL, branch, head SHA, and closed/merged state.
### PR checks summary
- `gh pr checks --json name,state,bucket,link,workflow,event,startedAt,completedAt`
Used to compute pending/failed/passed counts and whether the current CI round is terminal.
### Workflow runs for head SHA
- `gh api repos/{owner}/{repo}/actions/runs -X GET -f head_sha=<sha> -f per_page=100`
Used to discover failed workflow runs and rerunnable run IDs.
### Failed log inspection
- `gh run view <run-id> --json jobs,name,workflowName,conclusion,status,url,headSha`
- `gh run view <run-id> --log-failed`
Used by Codex to classify branch-related vs flaky/unrelated failures.
### Retry failed jobs only
- `gh run rerun <run-id> --failed`
Reruns only failed jobs (and dependencies) for a workflow run.
## Review-related endpoints
- Issue comments on PR:
- `gh api repos/{owner}/{repo}/issues/<pr_number>/comments?per_page=100`
- Inline PR review comments:
- `gh api repos/{owner}/{repo}/pulls/<pr_number>/comments?per_page=100`
- Review submissions:
- `gh api repos/{owner}/{repo}/pulls/<pr_number>/reviews?per_page=100`
## JSON fields consumed by the watcher
### `gh pr view`
- `number`
- `url`
- `state`
- `mergedAt`
- `closedAt`
- `headRefName`
- `headRefOid`
### `gh pr checks`
- `bucket` (`pass`, `fail`, `pending`, `skipping`)
- `state`
- `name`
- `workflow`
- `link`
### Actions runs API (`workflow_runs[]`)
- `id`
- `name`
- `status`
- `conclusion`
- `html_url`
- `head_sha`

View File

@@ -0,0 +1,58 @@
# CI / Review Heuristics
## CI classification checklist
Treat as **branch-related** when logs clearly indicate a regression caused by the PR branch:
- Compile/typecheck/lint failures in files or modules touched by the branch
- Deterministic unit/integration test failures in changed areas
- Snapshot output changes caused by UI/text changes in the branch
- Static analysis violations introduced by the latest push
- Build script/config changes in the PR causing a deterministic failure
Treat as **likely flaky or unrelated** when evidence points to transient or external issues:
- DNS/network/registry timeout errors while fetching dependencies
- Runner image provisioning or startup failures
- GitHub Actions infrastructure/service outages
- Cloud/service rate limits or transient API outages
- Non-deterministic failures in unrelated integration tests with known flake patterns
If uncertain, inspect failed logs once before choosing rerun.
## Decision tree (fix vs rerun vs stop)
1. If PR is merged/closed: stop.
2. If there are failed checks:
- Diagnose first.
- If branch-related: fix locally, commit, push.
- If likely flaky/unrelated and all checks for the current SHA are terminal: rerun failed jobs.
- If checks are still pending: wait.
3. If flaky reruns for the same SHA reach the configured limit (default 3): stop and report persistent failure.
4. Independently, process any new human review comments.
## Review comment agreement criteria
Address the comment when:
- The comment is technically correct.
- The change is actionable in the current branch.
- The requested change does not conflict with the users intent or recent guidance.
- The change can be made safely without unrelated refactors.
Do not auto-fix when:
- The comment is ambiguous and needs clarification.
- The request conflicts with explicit user instructions.
- The proposed change requires product/design decisions the user has not made.
- The codebase is in a dirty/unrelated state that makes safe editing uncertain.
## Stop-and-ask conditions
Stop and ask the user instead of continuing automatically when:
- The local worktree has unrelated uncommitted changes.
- `gh` auth/permissions fail.
- The PR branch cannot be pushed.
- CI failures persist after the flaky retry budget.
- Reviewer feedback requires a product decision or cross-team coordination.

View File

@@ -0,0 +1,805 @@
#!/usr/bin/env python3
"""Watch GitHub PR CI and review activity for Codex PR babysitting workflows."""
import argparse
import json
import os
import re
import subprocess
import sys
import tempfile
import time
from pathlib import Path
from urllib.parse import urlparse
FAILED_RUN_CONCLUSIONS = {
"failure",
"timed_out",
"cancelled",
"action_required",
"startup_failure",
"stale",
}
PENDING_CHECK_STATES = {
"QUEUED",
"IN_PROGRESS",
"PENDING",
"WAITING",
"REQUESTED",
}
REVIEW_BOT_LOGIN_KEYWORDS = {
"codex",
}
TRUSTED_AUTHOR_ASSOCIATIONS = {
"OWNER",
"MEMBER",
"COLLABORATOR",
}
MERGE_BLOCKING_REVIEW_DECISIONS = {
"REVIEW_REQUIRED",
"CHANGES_REQUESTED",
}
MERGE_CONFLICT_OR_BLOCKING_STATES = {
"BLOCKED",
"DIRTY",
"DRAFT",
"UNKNOWN",
}
GREEN_STATE_MAX_POLL_SECONDS = 60 * 60
class GhCommandError(RuntimeError):
pass
def parse_args():
parser = argparse.ArgumentParser(
description=(
"Normalize PR/CI/review state for Codex PR babysitting and optionally "
"trigger flaky reruns."
)
)
parser.add_argument("--pr", default="auto", help="auto, PR number, or PR URL")
parser.add_argument("--repo", help="Optional OWNER/REPO override")
parser.add_argument("--poll-seconds", type=int, default=30, help="Watch poll interval")
parser.add_argument(
"--max-flaky-retries",
type=int,
default=3,
help="Max rerun cycles per head SHA before stop recommendation",
)
parser.add_argument("--state-file", help="Path to state JSON file")
parser.add_argument("--once", action="store_true", help="Emit one snapshot and exit")
parser.add_argument("--watch", action="store_true", help="Continuously emit JSONL snapshots")
parser.add_argument(
"--retry-failed-now",
action="store_true",
help="Rerun failed jobs for current failed workflow runs when policy allows",
)
parser.add_argument(
"--json",
action="store_true",
help="Emit machine-readable output (default behavior for --once and --retry-failed-now)",
)
args = parser.parse_args()
if args.poll_seconds <= 0:
parser.error("--poll-seconds must be > 0")
if args.max_flaky_retries < 0:
parser.error("--max-flaky-retries must be >= 0")
if args.watch and args.retry_failed_now:
parser.error("--watch cannot be combined with --retry-failed-now")
if not args.once and not args.watch and not args.retry_failed_now:
args.once = True
return args
def _format_gh_error(cmd, err):
stdout = (err.stdout or "").strip()
stderr = (err.stderr or "").strip()
parts = [f"GitHub CLI command failed: {' '.join(cmd)}"]
if stdout:
parts.append(f"stdout: {stdout}")
if stderr:
parts.append(f"stderr: {stderr}")
return "\n".join(parts)
def gh_text(args, repo=None):
cmd = ["gh"]
# `gh api` does not accept `-R/--repo` on all gh versions. The watcher's
# API calls use explicit endpoints (e.g. repos/{owner}/{repo}/...), so the
# repo flag is unnecessary there.
if repo and (not args or args[0] != "api"):
cmd.extend(["-R", repo])
cmd.extend(args)
try:
proc = subprocess.run(cmd, check=True, capture_output=True, text=True)
except FileNotFoundError as err:
raise GhCommandError("`gh` command not found") from err
except subprocess.CalledProcessError as err:
raise GhCommandError(_format_gh_error(cmd, err)) from err
return proc.stdout
def gh_json(args, repo=None):
raw = gh_text(args, repo=repo).strip()
if not raw:
return None
try:
return json.loads(raw)
except json.JSONDecodeError as err:
raise GhCommandError(f"Failed to parse JSON from gh output for {' '.join(args)}") from err
def parse_pr_spec(pr_spec):
if pr_spec == "auto":
return {"mode": "auto", "value": None}
if re.fullmatch(r"\d+", pr_spec):
return {"mode": "number", "value": pr_spec}
parsed = urlparse(pr_spec)
if parsed.scheme and parsed.netloc and "/pull/" in parsed.path:
return {"mode": "url", "value": pr_spec}
raise ValueError("--pr must be 'auto', a PR number, or a PR URL")
def pr_view_fields():
return (
"number,url,state,mergedAt,closedAt,headRefName,headRefOid,"
"headRepository,headRepositoryOwner,mergeable,mergeStateStatus,reviewDecision"
)
def checks_fields():
return "name,state,bucket,link,workflow,event,startedAt,completedAt"
def resolve_pr(pr_spec, repo_override=None):
parsed = parse_pr_spec(pr_spec)
cmd = ["pr", "view"]
if parsed["value"] is not None:
cmd.append(parsed["value"])
cmd.extend(["--json", pr_view_fields()])
data = gh_json(cmd, repo=repo_override)
if not isinstance(data, dict):
raise GhCommandError("Unexpected PR payload from `gh pr view`")
pr_url = str(data.get("url") or "")
repo = (
repo_override
or extract_repo_from_pr_url(pr_url)
or extract_repo_from_pr_view(data)
)
if not repo:
raise GhCommandError("Unable to determine OWNER/REPO for the PR")
state = str(data.get("state") or "")
merged = bool(data.get("mergedAt"))
closed = bool(data.get("closedAt")) or state.upper() == "CLOSED"
return {
"number": int(data["number"]),
"url": pr_url,
"repo": repo,
"head_sha": str(data.get("headRefOid") or ""),
"head_branch": str(data.get("headRefName") or ""),
"state": state,
"merged": merged,
"closed": closed,
"mergeable": str(data.get("mergeable") or ""),
"merge_state_status": str(data.get("mergeStateStatus") or ""),
"review_decision": str(data.get("reviewDecision") or ""),
}
def extract_repo_from_pr_view(data):
head_repo = data.get("headRepository")
head_owner = data.get("headRepositoryOwner")
owner = None
name = None
if isinstance(head_owner, dict):
owner = head_owner.get("login") or head_owner.get("name")
elif isinstance(head_owner, str):
owner = head_owner
if isinstance(head_repo, dict):
name = head_repo.get("name")
repo_owner = head_repo.get("owner")
if not owner and isinstance(repo_owner, dict):
owner = repo_owner.get("login") or repo_owner.get("name")
elif isinstance(head_repo, str):
name = head_repo
if owner and name:
return f"{owner}/{name}"
return None
def extract_repo_from_pr_url(pr_url):
parsed = urlparse(pr_url)
parts = [p for p in parsed.path.split("/") if p]
if len(parts) >= 4 and parts[2] == "pull":
return f"{parts[0]}/{parts[1]}"
return None
def load_state(path):
if path.exists():
try:
data = json.loads(path.read_text())
except json.JSONDecodeError as err:
raise RuntimeError(f"State file is not valid JSON: {path}") from err
if not isinstance(data, dict):
raise RuntimeError(f"State file must contain an object: {path}")
return data, False
return {
"pr": {},
"started_at": None,
"last_seen_head_sha": None,
"retries_by_sha": {},
"seen_issue_comment_ids": [],
"seen_review_comment_ids": [],
"seen_review_ids": [],
"last_snapshot_at": None,
}, True
def save_state(path, state):
path.parent.mkdir(parents=True, exist_ok=True)
payload = json.dumps(state, indent=2, sort_keys=True) + "\n"
fd, tmp_name = tempfile.mkstemp(prefix=f"{path.name}.", suffix=".tmp", dir=path.parent)
tmp_path = Path(tmp_name)
try:
with os.fdopen(fd, "w", encoding="utf-8") as tmp_file:
tmp_file.write(payload)
os.replace(tmp_path, path)
except Exception:
try:
tmp_path.unlink(missing_ok=True)
except OSError:
pass
raise
def default_state_file_for(pr):
repo_slug = pr["repo"].replace("/", "-")
return Path(f"/tmp/codex-babysit-pr-{repo_slug}-pr{pr['number']}.json")
def get_pr_checks(pr_spec, repo):
parsed = parse_pr_spec(pr_spec)
cmd = ["pr", "checks"]
if parsed["value"] is not None:
cmd.append(parsed["value"])
cmd.extend(["--json", checks_fields()])
data = gh_json(cmd, repo=repo)
if data is None:
return []
if not isinstance(data, list):
raise GhCommandError("Unexpected payload from `gh pr checks`")
return data
def is_pending_check(check):
bucket = str(check.get("bucket") or "").lower()
state = str(check.get("state") or "").upper()
return bucket == "pending" or state in PENDING_CHECK_STATES
def summarize_checks(checks):
pending_count = 0
failed_count = 0
passed_count = 0
for check in checks:
bucket = str(check.get("bucket") or "").lower()
if is_pending_check(check):
pending_count += 1
if bucket == "fail":
failed_count += 1
if bucket == "pass":
passed_count += 1
return {
"pending_count": pending_count,
"failed_count": failed_count,
"passed_count": passed_count,
"all_terminal": pending_count == 0,
}
def get_workflow_runs_for_sha(repo, head_sha):
endpoint = f"repos/{repo}/actions/runs"
data = gh_json(
["api", endpoint, "-X", "GET", "-f", f"head_sha={head_sha}", "-f", "per_page=100"],
repo=repo,
)
if not isinstance(data, dict):
raise GhCommandError("Unexpected payload from actions runs API")
runs = data.get("workflow_runs") or []
if not isinstance(runs, list):
raise GhCommandError("Expected `workflow_runs` to be a list")
return runs
def failed_runs_from_workflow_runs(runs, head_sha):
failed_runs = []
for run in runs:
if not isinstance(run, dict):
continue
if str(run.get("head_sha") or "") != head_sha:
continue
conclusion = str(run.get("conclusion") or "")
if conclusion not in FAILED_RUN_CONCLUSIONS:
continue
failed_runs.append(
{
"run_id": run.get("id"),
"workflow_name": run.get("name") or run.get("display_title") or "",
"status": str(run.get("status") or ""),
"conclusion": conclusion,
"html_url": str(run.get("html_url") or ""),
}
)
failed_runs.sort(key=lambda item: (str(item.get("workflow_name") or ""), str(item.get("run_id") or "")))
return failed_runs
def get_authenticated_login():
data = gh_json(["api", "user"])
if not isinstance(data, dict) or not data.get("login"):
raise GhCommandError("Unable to determine authenticated GitHub login from `gh api user`")
return str(data["login"])
def comment_endpoints(repo, pr_number):
return {
"issue_comment": f"repos/{repo}/issues/{pr_number}/comments",
"review_comment": f"repos/{repo}/pulls/{pr_number}/comments",
"review": f"repos/{repo}/pulls/{pr_number}/reviews",
}
def gh_api_list_paginated(endpoint, repo=None, per_page=100):
items = []
page = 1
while True:
sep = "&" if "?" in endpoint else "?"
page_endpoint = f"{endpoint}{sep}per_page={per_page}&page={page}"
payload = gh_json(["api", page_endpoint], repo=repo)
if payload is None:
break
if not isinstance(payload, list):
raise GhCommandError(f"Unexpected paginated payload from gh api {endpoint}")
items.extend(payload)
if len(payload) < per_page:
break
page += 1
return items
def normalize_issue_comments(items):
out = []
for item in items:
if not isinstance(item, dict):
continue
out.append(
{
"kind": "issue_comment",
"id": str(item.get("id") or ""),
"author": extract_login(item.get("user")),
"author_association": str(item.get("author_association") or ""),
"created_at": str(item.get("created_at") or ""),
"body": str(item.get("body") or ""),
"path": None,
"line": None,
"url": str(item.get("html_url") or ""),
}
)
return out
def normalize_review_comments(items):
out = []
for item in items:
if not isinstance(item, dict):
continue
line = item.get("line")
if line is None:
line = item.get("original_line")
out.append(
{
"kind": "review_comment",
"id": str(item.get("id") or ""),
"author": extract_login(item.get("user")),
"author_association": str(item.get("author_association") or ""),
"created_at": str(item.get("created_at") or ""),
"body": str(item.get("body") or ""),
"path": item.get("path"),
"line": line,
"url": str(item.get("html_url") or ""),
}
)
return out
def normalize_reviews(items):
out = []
for item in items:
if not isinstance(item, dict):
continue
out.append(
{
"kind": "review",
"id": str(item.get("id") or ""),
"author": extract_login(item.get("user")),
"author_association": str(item.get("author_association") or ""),
"created_at": str(item.get("submitted_at") or item.get("created_at") or ""),
"body": str(item.get("body") or ""),
"path": None,
"line": None,
"url": str(item.get("html_url") or ""),
}
)
return out
def extract_login(user_obj):
if isinstance(user_obj, dict):
return str(user_obj.get("login") or "")
return ""
def is_bot_login(login):
return bool(login) and login.endswith("[bot]")
def is_actionable_review_bot_login(login):
if not is_bot_login(login):
return False
lower_login = login.lower()
return any(keyword in lower_login for keyword in REVIEW_BOT_LOGIN_KEYWORDS)
def is_trusted_human_review_author(item, authenticated_login):
author = str(item.get("author") or "")
if not author:
return False
if authenticated_login and author == authenticated_login:
return True
association = str(item.get("author_association") or "").upper()
return association in TRUSTED_AUTHOR_ASSOCIATIONS
def fetch_new_review_items(pr, state, fresh_state, authenticated_login=None):
repo = pr["repo"]
pr_number = pr["number"]
endpoints = comment_endpoints(repo, pr_number)
issue_payload = gh_api_list_paginated(endpoints["issue_comment"], repo=repo)
review_comment_payload = gh_api_list_paginated(endpoints["review_comment"], repo=repo)
review_payload = gh_api_list_paginated(endpoints["review"], repo=repo)
issue_items = normalize_issue_comments(issue_payload)
review_comment_items = normalize_review_comments(review_comment_payload)
review_items = normalize_reviews(review_payload)
all_items = issue_items + review_comment_items + review_items
seen_issue = {str(x) for x in state.get("seen_issue_comment_ids") or []}
seen_review_comment = {str(x) for x in state.get("seen_review_comment_ids") or []}
seen_review = {str(x) for x in state.get("seen_review_ids") or []}
# On a brand-new state file, surface existing review activity instead of
# silently treating it as seen. This avoids missing already-pending review
# feedback when monitoring starts after comments were posted.
new_items = []
for item in all_items:
item_id = item.get("id")
if not item_id:
continue
author = item.get("author") or ""
if not author:
continue
if is_bot_login(author):
if not is_actionable_review_bot_login(author):
continue
elif not is_trusted_human_review_author(item, authenticated_login):
continue
kind = item["kind"]
if kind == "issue_comment" and item_id in seen_issue:
continue
if kind == "review_comment" and item_id in seen_review_comment:
continue
if kind == "review" and item_id in seen_review:
continue
new_items.append(item)
if kind == "issue_comment":
seen_issue.add(item_id)
elif kind == "review_comment":
seen_review_comment.add(item_id)
elif kind == "review":
seen_review.add(item_id)
new_items.sort(key=lambda item: (item.get("created_at") or "", item.get("kind") or "", item.get("id") or ""))
state["seen_issue_comment_ids"] = sorted(seen_issue)
state["seen_review_comment_ids"] = sorted(seen_review_comment)
state["seen_review_ids"] = sorted(seen_review)
return new_items
def current_retry_count(state, head_sha):
retries = state.get("retries_by_sha") or {}
value = retries.get(head_sha, 0)
try:
return int(value)
except (TypeError, ValueError):
return 0
def set_retry_count(state, head_sha, count):
retries = state.get("retries_by_sha")
if not isinstance(retries, dict):
retries = {}
retries[head_sha] = int(count)
state["retries_by_sha"] = retries
def unique_actions(actions):
out = []
seen = set()
for action in actions:
if action not in seen:
out.append(action)
seen.add(action)
return out
def is_pr_ready_to_merge(pr, checks_summary, new_review_items):
if pr["closed"] or pr["merged"]:
return False
if not checks_summary["all_terminal"]:
return False
if checks_summary["failed_count"] > 0 or checks_summary["pending_count"] > 0:
return False
if new_review_items:
return False
if str(pr.get("mergeable") or "") != "MERGEABLE":
return False
if str(pr.get("merge_state_status") or "") in MERGE_CONFLICT_OR_BLOCKING_STATES:
return False
if str(pr.get("review_decision") or "") in MERGE_BLOCKING_REVIEW_DECISIONS:
return False
return True
def recommend_actions(pr, checks_summary, failed_runs, new_review_items, retries_used, max_retries):
actions = []
if pr["closed"] or pr["merged"]:
if new_review_items:
actions.append("process_review_comment")
actions.append("stop_pr_closed")
return unique_actions(actions)
if is_pr_ready_to_merge(pr, checks_summary, new_review_items):
actions.append("stop_ready_to_merge")
return unique_actions(actions)
if new_review_items:
actions.append("process_review_comment")
has_failed_pr_checks = checks_summary["failed_count"] > 0
if has_failed_pr_checks:
if checks_summary["all_terminal"] and retries_used >= max_retries:
actions.append("stop_exhausted_retries")
else:
actions.append("diagnose_ci_failure")
if checks_summary["all_terminal"] and failed_runs and retries_used < max_retries:
actions.append("retry_failed_checks")
if not actions:
actions.append("idle")
return unique_actions(actions)
def collect_snapshot(args):
pr = resolve_pr(args.pr, repo_override=args.repo)
state_path = Path(args.state_file) if args.state_file else default_state_file_for(pr)
state, fresh_state = load_state(state_path)
if not state.get("started_at"):
state["started_at"] = int(time.time())
# `gh pr checks -R <repo>` requires an explicit PR/branch/url argument.
# After resolving `--pr auto`, reuse the concrete PR number.
checks = get_pr_checks(str(pr["number"]), repo=pr["repo"])
checks_summary = summarize_checks(checks)
workflow_runs = get_workflow_runs_for_sha(pr["repo"], pr["head_sha"])
failed_runs = failed_runs_from_workflow_runs(workflow_runs, pr["head_sha"])
authenticated_login = get_authenticated_login()
new_review_items = fetch_new_review_items(
pr,
state,
fresh_state=fresh_state,
authenticated_login=authenticated_login,
)
retries_used = current_retry_count(state, pr["head_sha"])
actions = recommend_actions(
pr,
checks_summary,
failed_runs,
new_review_items,
retries_used,
args.max_flaky_retries,
)
state["pr"] = {"repo": pr["repo"], "number": pr["number"]}
state["last_seen_head_sha"] = pr["head_sha"]
state["last_snapshot_at"] = int(time.time())
save_state(state_path, state)
snapshot = {
"pr": pr,
"checks": checks_summary,
"failed_runs": failed_runs,
"new_review_items": new_review_items,
"actions": actions,
"retry_state": {
"current_sha_retries_used": retries_used,
"max_flaky_retries": args.max_flaky_retries,
},
}
return snapshot, state_path
def retry_failed_now(args):
snapshot, state_path = collect_snapshot(args)
pr = snapshot["pr"]
checks_summary = snapshot["checks"]
failed_runs = snapshot["failed_runs"]
retries_used = snapshot["retry_state"]["current_sha_retries_used"]
max_retries = snapshot["retry_state"]["max_flaky_retries"]
result = {
"snapshot": snapshot,
"state_file": str(state_path),
"rerun_attempted": False,
"rerun_count": 0,
"rerun_run_ids": [],
"reason": None,
}
if pr["closed"] or pr["merged"]:
result["reason"] = "pr_closed"
return result
if checks_summary["failed_count"] <= 0:
result["reason"] = "no_failed_pr_checks"
return result
if not failed_runs:
result["reason"] = "no_failed_runs"
return result
if not checks_summary["all_terminal"]:
result["reason"] = "checks_still_pending"
return result
if retries_used >= max_retries:
result["reason"] = "retry_budget_exhausted"
return result
for run in failed_runs:
run_id = run.get("run_id")
if run_id in (None, ""):
continue
gh_text(["run", "rerun", str(run_id), "--failed"], repo=pr["repo"])
result["rerun_run_ids"].append(run_id)
if result["rerun_run_ids"]:
state, _ = load_state(state_path)
new_count = current_retry_count(state, pr["head_sha"]) + 1
set_retry_count(state, pr["head_sha"], new_count)
state["last_snapshot_at"] = int(time.time())
save_state(state_path, state)
result["rerun_attempted"] = True
result["rerun_count"] = len(result["rerun_run_ids"])
result["reason"] = "rerun_triggered"
else:
result["reason"] = "failed_runs_missing_ids"
return result
def print_json(obj):
sys.stdout.write(json.dumps(obj, sort_keys=True) + "\n")
sys.stdout.flush()
def print_event(event, payload):
print_json({"event": event, "payload": payload})
def is_ci_green(snapshot):
checks = snapshot.get("checks") or {}
return (
bool(checks.get("all_terminal"))
and int(checks.get("failed_count") or 0) == 0
and int(checks.get("pending_count") or 0) == 0
)
def snapshot_change_key(snapshot):
pr = snapshot.get("pr") or {}
checks = snapshot.get("checks") or {}
review_items = snapshot.get("new_review_items") or []
return (
str(pr.get("head_sha") or ""),
str(pr.get("state") or ""),
str(pr.get("mergeable") or ""),
str(pr.get("merge_state_status") or ""),
str(pr.get("review_decision") or ""),
int(checks.get("passed_count") or 0),
int(checks.get("failed_count") or 0),
int(checks.get("pending_count") or 0),
tuple(
(str(item.get("kind") or ""), str(item.get("id") or ""))
for item in review_items
if isinstance(item, dict)
),
tuple(snapshot.get("actions") or []),
)
def run_watch(args):
poll_seconds = args.poll_seconds
last_change_key = None
while True:
snapshot, state_path = collect_snapshot(args)
print_event(
"snapshot",
{
"snapshot": snapshot,
"state_file": str(state_path),
"next_poll_seconds": poll_seconds,
},
)
actions = set(snapshot.get("actions") or [])
if (
"stop_pr_closed" in actions
or "stop_exhausted_retries" in actions
or "stop_ready_to_merge" in actions
):
print_event("stop", {"actions": snapshot.get("actions"), "pr": snapshot.get("pr")})
return 0
current_change_key = snapshot_change_key(snapshot)
changed = current_change_key != last_change_key
green = is_ci_green(snapshot)
if not green:
poll_seconds = args.poll_seconds
elif changed or last_change_key is None:
poll_seconds = args.poll_seconds
else:
poll_seconds = min(poll_seconds * 2, GREEN_STATE_MAX_POLL_SECONDS)
last_change_key = current_change_key
time.sleep(poll_seconds)
def main():
args = parse_args()
try:
if args.retry_failed_now:
print_json(retry_failed_now(args))
return 0
if args.watch:
return run_watch(args)
snapshot, state_path = collect_snapshot(args)
snapshot["state_file"] = str(state_path)
print_json(snapshot)
return 0
except (GhCommandError, RuntimeError, ValueError) as err:
sys.stderr.write(f"gh_pr_watch.py error: {err}\n")
return 1
except KeyboardInterrupt:
sys.stderr.write("gh_pr_watch.py interrupted\n")
return 130
if __name__ == "__main__":
raise SystemExit(main())

1
codex-rs/Cargo.lock generated
View File

@@ -1787,6 +1787,7 @@ dependencies = [
"codex-protocol",
"codex-shell-command",
"codex-utils-cargo-bin",
"core_test_support",
"exec_server_test_support",
"libc",
"maplit",

View File

@@ -23,10 +23,12 @@ struct AppServerArgs {
}
fn main() -> anyhow::Result<()> {
if codex_core::maybe_run_zsh_exec_wrapper_mode()? {
return Ok(());
}
arg0_dispatch_or_else(|codex_linux_sandbox_exe| async move {
// Run wrapper mode only after arg0 dispatch so `codex-linux-sandbox`
// invocations don't get misclassified as zsh exec-wrapper calls.
if codex_core::maybe_run_zsh_exec_wrapper_mode()? {
return Ok(());
}
let args = AppServerArgs::parse();
let managed_config_path = managed_config_path_from_debug_env();
let loader_overrides = LoaderOverrides {

View File

@@ -2,18 +2,15 @@
//
// Running these tests with the patched zsh fork:
//
// The suite uses `CODEX_TEST_ZSH_PATH` when set. Example:
// CODEX_TEST_ZSH_PATH="$HOME/.local/codex-zsh-77045ef/bin/zsh" \
// cargo test -p codex-app-server turn_start_zsh_fork -- --nocapture
//
// For a single test:
// CODEX_TEST_ZSH_PATH="$HOME/.local/codex-zsh-77045ef/bin/zsh" \
// cargo test -p codex-app-server turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2 -- --nocapture
// The suite resolves the shared test-only zsh DotSlash file at
// `exec-server/tests/suite/zsh` via DotSlash on first use, so `dotslash` and
// network access are required the first time the artifact is fetched.
use anyhow::Result;
use app_test_support::McpProcess;
use app_test_support::create_final_assistant_message_sse_response;
use app_test_support::create_mock_responses_server_sequence;
use app_test_support::create_mock_responses_server_sequence_unchecked;
use app_test_support::create_shell_command_sse_response;
use app_test_support::to_response;
use codex_app_server_protocol::CommandExecutionApprovalDecision;
@@ -38,6 +35,7 @@ use core_test_support::responses;
use core_test_support::skip_if_no_network;
use pretty_assertions::assert_eq;
use std::collections::BTreeMap;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use tempfile::TempDir;
use tokio::time::timeout;
@@ -57,7 +55,7 @@ async fn turn_start_shell_zsh_fork_executes_command_v2() -> Result<()> {
let workspace = tmp.path().join("workspace");
std::fs::create_dir(&workspace)?;
let Some(zsh_path) = find_test_zsh_path() else {
let Some(zsh_path) = find_test_zsh_path()? else {
eprintln!("skipping zsh fork test: no zsh executable found");
return Ok(());
};
@@ -82,7 +80,7 @@ async fn turn_start_shell_zsh_fork_executes_command_v2() -> Result<()> {
&zsh_path,
)?;
let mut mcp = McpProcess::new(&codex_home).await?;
let mut mcp = create_zsh_test_mcp_process(&codex_home, &workspace).await?;
timeout(DEFAULT_READ_TIMEOUT, mcp.initialize()).await??;
let start_id = mcp
@@ -167,7 +165,7 @@ async fn turn_start_shell_zsh_fork_exec_approval_decline_v2() -> Result<()> {
let workspace = tmp.path().join("workspace");
std::fs::create_dir(&workspace)?;
let Some(zsh_path) = find_test_zsh_path() else {
let Some(zsh_path) = find_test_zsh_path()? else {
eprintln!("skipping zsh fork decline test: no zsh executable found");
return Ok(());
};
@@ -199,7 +197,7 @@ async fn turn_start_shell_zsh_fork_exec_approval_decline_v2() -> Result<()> {
&zsh_path,
)?;
let mut mcp = McpProcess::new(&codex_home).await?;
let mut mcp = create_zsh_test_mcp_process(&codex_home, &workspace).await?;
timeout(DEFAULT_READ_TIMEOUT, mcp.initialize()).await??;
let start_id = mcp
@@ -303,7 +301,7 @@ async fn turn_start_shell_zsh_fork_exec_approval_cancel_v2() -> Result<()> {
let workspace = tmp.path().join("workspace");
std::fs::create_dir(&workspace)?;
let Some(zsh_path) = find_test_zsh_path() else {
let Some(zsh_path) = find_test_zsh_path()? else {
eprintln!("skipping zsh fork cancel test: no zsh executable found");
return Ok(());
};
@@ -332,7 +330,7 @@ async fn turn_start_shell_zsh_fork_exec_approval_cancel_v2() -> Result<()> {
&zsh_path,
)?;
let mut mcp = McpProcess::new(&codex_home).await?;
let mut mcp = create_zsh_test_mcp_process(&codex_home, &workspace).await?;
timeout(DEFAULT_READ_TIMEOUT, mcp.initialize()).await??;
let start_id = mcp
@@ -434,7 +432,7 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
let workspace = tmp.path().join("workspace");
std::fs::create_dir(&workspace)?;
let Some(zsh_path) = find_test_zsh_path() else {
let Some(zsh_path) = find_test_zsh_path()? else {
eprintln!("skipping zsh fork subcommand decline test: no zsh executable found");
return Ok(());
};
@@ -446,6 +444,29 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
return Ok(());
}
eprintln!("using zsh path for zsh-fork test: {}", zsh_path.display());
let zsh_path_for_config = {
// App-server config accepts only a zsh path, not extra argv. Use a
// wrapper so this test can force `-df` and downgrade `-lc` to `-c`
// to avoid rc/login-shell startup noise.
let path = workspace.join("zsh-no-rc");
std::fs::write(
&path,
format!(
r#"#!/bin/sh
if [ "$1" = "-lc" ]; then
shift
set -- -c "$@"
fi
exec "{}" -df "$@"
"#,
zsh_path.display()
),
)?;
let mut permissions = std::fs::metadata(&path)?.permissions();
permissions.set_mode(0o755);
std::fs::set_permissions(&path, permissions)?;
path
};
let tool_call_arguments = serde_json::to_string(&serde_json::json!({
"command": "/usr/bin/true && /usr/bin/true",
@@ -461,7 +482,16 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
),
responses::ev_completed("resp-1"),
]);
let server = create_mock_responses_server_sequence(vec![response]).await;
let no_op_response = responses::sse(vec![
responses::ev_response_created("resp-2"),
responses::ev_completed("resp-2"),
]);
// Linux CI has occasionally issued a second `/responses` POST after the
// subcommand-decline flow. This test is about approval/decline behavior in
// the zsh fork, not exact model request count, so allow an extra request
// and return a harmless no-op response if it arrives.
let server =
create_mock_responses_server_sequence_unchecked(vec![response, no_op_response]).await;
create_config_toml(
&codex_home,
&server.uri(),
@@ -471,10 +501,10 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
(Feature::UnifiedExec, false),
(Feature::ShellSnapshot, false),
]),
&zsh_path,
&zsh_path_for_config,
)?;
let mut mcp = McpProcess::new(&codex_home).await?;
let mut mcp = create_zsh_test_mcp_process(&codex_home, &workspace).await?;
timeout(DEFAULT_READ_TIMEOUT, mcp.initialize()).await??;
let start_id = mcp
@@ -500,8 +530,16 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
}],
cwd: Some(workspace.clone()),
approval_policy: Some(codex_app_server_protocol::AskForApproval::OnRequest),
sandbox_policy: Some(codex_app_server_protocol::SandboxPolicy::ReadOnly {
access: codex_app_server_protocol::ReadOnlyAccess::FullAccess,
sandbox_policy: Some(if cfg!(target_os = "linux") {
// The zsh exec-bridge wrapper uses a Unix socket back to the parent
// process. Linux restricted sandbox seccomp denies connect(2), so use
// full access here; this test is validating zsh approval/decline
// behavior, not Linux sandboxing.
codex_app_server_protocol::SandboxPolicy::DangerFullAccess
} else {
codex_app_server_protocol::SandboxPolicy::ReadOnly {
access: codex_app_server_protocol::ReadOnlyAccess::FullAccess,
}
}),
model: Some("mock-model".to_string()),
effort: Some(codex_protocol::openai_models::ReasoningEffort::Medium),
@@ -517,10 +555,13 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
let TurnStartResponse { turn } = to_response::<TurnStartResponse>(turn_resp)?;
let mut approval_ids = Vec::new();
for decision in [
let mut saw_parent_approval = false;
let target_decisions = [
CommandExecutionApprovalDecision::Accept,
CommandExecutionApprovalDecision::Cancel,
] {
];
let mut target_decision_index = 0;
while target_decision_index < target_decisions.len() {
let server_req = timeout(
DEFAULT_READ_TIMEOUT,
mcp.read_stream_until_request_message(),
@@ -531,13 +572,40 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
panic!("expected CommandExecutionRequestApproval request");
};
assert_eq!(params.item_id, "call-zsh-fork-subcommand-decline");
approval_ids.push(
params
.approval_id
.clone()
.expect("approval_id must be present for zsh subcommand approvals"),
);
assert_eq!(params.thread_id, thread.id);
let is_target_subcommand = params.command.as_deref() == Some("/usr/bin/true");
if is_target_subcommand {
approval_ids.push(
params
.approval_id
.clone()
.expect("approval_id must be present for zsh subcommand approvals"),
);
}
let decision = if is_target_subcommand {
let decision = target_decisions[target_decision_index].clone();
target_decision_index += 1;
decision
} else {
let command = params
.command
.as_deref()
.expect("approval command should be present");
assert!(
!saw_parent_approval,
"unexpected extra non-target approval: {command}"
);
assert!(
command.contains("zsh-no-rc"),
"expected parent zsh wrapper approval, got: {command}"
);
assert!(
command.contains("/usr/bin/true && /usr/bin/true"),
"expected tool command in parent approval, got: {command}"
);
saw_parent_approval = true;
CommandExecutionApprovalDecision::Accept
};
mcp.send_response(
request_id,
serde_json::to_value(CommandExecutionRequestApprovalResponse { decision })?,
@@ -545,6 +613,8 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
.await?;
}
assert_eq!(approval_ids.len(), 2);
assert_ne!(approval_ids[0], approval_ids[1]);
let parent_completed_command_execution = timeout(DEFAULT_READ_TIMEOUT, async {
loop {
let completed_notif = mcp
@@ -563,32 +633,61 @@ async fn turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2()
}
}
})
.await??;
.await;
let ThreadItem::CommandExecution {
id,
status,
aggregated_output,
..
} = parent_completed_command_execution
else {
unreachable!("loop ensures we break on parent command execution item");
};
assert_eq!(id, "call-zsh-fork-subcommand-decline");
assert_eq!(status, CommandExecutionStatus::Declined);
assert!(
aggregated_output.is_none()
|| aggregated_output == Some("exec command rejected by user".to_string())
);
assert_eq!(approval_ids.len(), 2);
assert_ne!(approval_ids[0], approval_ids[1]);
match parent_completed_command_execution {
Ok(Ok(parent_completed_command_execution)) => {
let ThreadItem::CommandExecution {
id,
status,
aggregated_output,
..
} = parent_completed_command_execution
else {
unreachable!("loop ensures we break on parent command execution item");
};
assert_eq!(id, "call-zsh-fork-subcommand-decline");
assert_eq!(status, CommandExecutionStatus::Declined);
assert!(
aggregated_output.is_none()
|| aggregated_output == Some("exec command rejected by user".to_string())
);
mcp.interrupt_turn_and_wait_for_aborted(thread.id, turn.id, DEFAULT_READ_TIMEOUT)
.await?;
mcp.interrupt_turn_and_wait_for_aborted(
thread.id.clone(),
turn.id.clone(),
DEFAULT_READ_TIMEOUT,
)
.await?;
}
Ok(Err(error)) => return Err(error),
Err(_) => {
// Some zsh builds abort the turn immediately after the rejected
// subcommand without emitting a parent `item/completed`.
let completed_notif = timeout(
DEFAULT_READ_TIMEOUT,
mcp.read_stream_until_notification_message("turn/completed"),
)
.await??;
let completed: TurnCompletedNotification = serde_json::from_value(
completed_notif
.params
.expect("turn/completed params must be present"),
)?;
assert_eq!(completed.thread_id, thread.id);
assert_eq!(completed.turn.id, turn.id);
assert_eq!(completed.turn.status, TurnStatus::Interrupted);
}
}
Ok(())
}
async fn create_zsh_test_mcp_process(codex_home: &Path, zdotdir: &Path) -> Result<McpProcess> {
let zdotdir = zdotdir.to_string_lossy().into_owned();
McpProcess::new_with_env(codex_home, &[("ZDOTDIR", Some(zdotdir.as_str()))]).await
}
fn create_config_toml(
codex_home: &Path,
server_uri: &str,
@@ -640,36 +739,24 @@ stream_max_retries = 0
)
}
fn find_test_zsh_path() -> Option<std::path::PathBuf> {
if let Some(path) = std::env::var_os("CODEX_TEST_ZSH_PATH") {
let path = std::path::PathBuf::from(path);
if path.is_file() {
return Some(path);
}
panic!(
"CODEX_TEST_ZSH_PATH is set but is not a file: {}",
path.display()
fn find_test_zsh_path() -> Result<Option<std::path::PathBuf>> {
let repo_root = codex_utils_cargo_bin::repo_root()?;
let dotslash_zsh = repo_root.join("codex-rs/exec-server/tests/suite/zsh");
if !dotslash_zsh.is_file() {
eprintln!(
"skipping zsh fork test: shared zsh DotSlash file not found at {}",
dotslash_zsh.display()
);
return Ok(None);
}
for candidate in ["/bin/zsh", "/usr/bin/zsh"] {
let path = Path::new(candidate);
if path.is_file() {
return Some(path.to_path_buf());
match core_test_support::fetch_dotslash_file(&dotslash_zsh, None) {
Ok(path) => return Ok(Some(path)),
Err(error) => {
eprintln!("failed to fetch vendored zsh via dotslash: {error:#}");
}
}
let shell = std::env::var_os("SHELL")?;
let shell_path = std::path::PathBuf::from(shell);
if shell_path
.file_name()
.is_some_and(|file_name| file_name == "zsh")
&& shell_path.is_file()
{
return Some(shell_path);
}
None
Ok(None)
}
fn supports_exec_wrapper_intercept(zsh_path: &Path) -> bool {

View File

@@ -543,10 +543,12 @@ fn stage_str(stage: codex_core::features::Stage) -> &'static str {
}
fn main() -> anyhow::Result<()> {
if codex_core::maybe_run_zsh_exec_wrapper_mode()? {
return Ok(());
}
arg0_dispatch_or_else(|codex_linux_sandbox_exe| async move {
// Run wrapper mode only after arg0 dispatch so `codex-linux-sandbox`
// invocations don't get misclassified as zsh exec-wrapper calls.
if codex_core::maybe_run_zsh_exec_wrapper_mode()? {
return Ok(());
}
cli_main(codex_linux_sandbox_exe).await?;
Ok(())
})

View File

@@ -368,13 +368,25 @@ fn legacy_usage_notice(alias: &str, feature: Feature) -> (String, Option<String>
(summary, Some(web_search_details().to_string()))
}
_ => {
let summary = format!("`{alias}` is deprecated. Use `[features].{canonical}` instead.");
let details = if alias == canonical {
None
let (summary, details) = if alias == "collab" && feature == Feature::Collab {
(
"Your configuration file has an error.".to_string(),
Some(
"Change collab=true to multi_agent=true in your config.toml or enable it by running codex --enable multi_agent".to_string(),
),
)
} else if alias == canonical {
(
format!("`{alias}` is deprecated. Use `[features].{canonical}` instead."),
None,
)
} else {
Some(format!(
"Enable it with `--enable {canonical}` or `[features].{canonical}` in config.toml. See https://developers.openai.com/codex/config-basic#feature-flags for details."
))
(
format!("`{alias}` is deprecated. Use `[features].{canonical}` instead."),
Some(format!(
"Enable it with `--enable {canonical}` or `[features].{canonical}` in config.toml. See https://developers.openai.com/codex/config-basic#feature-flags for details."
)),
)
};
(summary, details)
}
@@ -764,4 +776,17 @@ mod tests {
assert_eq!(feature_for_key("multi_agent"), Some(Feature::Collab));
assert_eq!(feature_for_key("collab"), Some(Feature::Collab));
}
#[test]
fn collab_legacy_notice_uses_config_error_text() {
let (summary, details) = legacy_usage_notice("collab", Feature::Collab);
assert_eq!(summary, "Your configuration file has an error.".to_string());
assert_eq!(
details,
Some(
"Change collab=true to multi_agent=true in your config.toml or enable it by running codex --enable multi_agent".to_string()
)
);
}
}

View File

@@ -166,6 +166,10 @@ impl ZshExecBridge {
})?;
let mut cmd = tokio::process::Command::new(&command[0]);
#[cfg(unix)]
if let Some(arg0) = &req.arg0 {
cmd.arg0(arg0);
}
if command.len() > 1 {
cmd.args(&command[1..]);
}
@@ -459,7 +463,6 @@ fn run_exec_wrapper_mode() -> anyhow::Result<()> {
argv: argv.clone(),
cwd,
};
let mut stream = StdUnixStream::connect(&socket_path)
.with_context(|| format!("connect to wrapper socket at {socket_path}"))?;
let encoded = serde_json::to_string(&request).context("serialize wrapper request")?;

View File

@@ -1,5 +1,7 @@
#![expect(clippy::expect_used)]
use anyhow::Context as _;
use anyhow::ensure;
use codex_utils_cargo_bin::CargoBinError;
use ctor::ctor;
use tempfile::TempDir;
@@ -99,6 +101,42 @@ pub fn test_tmp_path_buf() -> PathBuf {
test_tmp_path().into_path_buf()
}
/// Fetch a DotSlash resource and return the resolved executable/file path.
pub fn fetch_dotslash_file(
dotslash_file: &std::path::Path,
dotslash_cache: Option<&std::path::Path>,
) -> anyhow::Result<PathBuf> {
let mut command = std::process::Command::new("dotslash");
command.arg("--").arg("fetch").arg(dotslash_file);
if let Some(dotslash_cache) = dotslash_cache {
command.env("DOTSLASH_CACHE", dotslash_cache);
}
let output = command.output().with_context(|| {
format!(
"failed to run dotslash to fetch resource {}",
dotslash_file.display()
)
})?;
ensure!(
output.status.success(),
"dotslash fetch failed for {}: {}",
dotslash_file.display(),
String::from_utf8_lossy(&output.stderr).trim()
);
let fetched_path = String::from_utf8(output.stdout)
.context("dotslash fetch output was not utf8")?
.trim()
.to_string();
ensure!(!fetched_path.is_empty(), "dotslash fetch output was empty");
let fetched_path = PathBuf::from(fetched_path);
ensure!(
fetched_path.is_file(),
"dotslash returned non-file path: {}",
fetched_path.display()
);
Ok(fetched_path)
}
/// Returns a default `Config` whose on-disk state is confined to the provided
/// temporary directory. Using a per-test directory keeps tests hermetic and
/// avoids clobbering a developers real `~/.codex`.

View File

@@ -58,6 +58,7 @@ tracing = { workspace = true }
tracing-subscriber = { workspace = true, features = ["env-filter", "fmt"] }
[dev-dependencies]
core_test_support = { workspace = true }
codex-utils-cargo-bin = { workspace = true }
codex-protocol = { workspace = true }
exec_server_test_support = { workspace = true }

View File

@@ -61,15 +61,9 @@ prefix_rule(
/// Verify the same prompt/escalation flow works when the server is launched
/// with a patched zsh binary.
///
/// Set CODEX_TEST_ZSH_PATH to enable this test locally or in CI.
/// The suite resolves `tests/suite/zsh` via DotSlash on first use.
#[tokio::test(flavor = "current_thread")]
async fn accept_elicitation_for_prompt_rule_with_zsh() -> Result<()> {
let Some(zsh_path) = std::env::var_os("CODEX_TEST_ZSH_PATH") else {
eprintln!("skipping zsh test: CODEX_TEST_ZSH_PATH is not set");
return Ok(());
};
let zsh_path = PathBuf::from(zsh_path);
let codex_home = TempDir::new()?;
write_default_execpolicy(
r#"
@@ -87,6 +81,11 @@ prefix_rule(
.await?;
let dotslash_cache_temp_dir = TempDir::new()?;
let dotslash_cache = dotslash_cache_temp_dir.path();
let zsh_path = resolve_test_zsh_path(dotslash_cache).await?;
eprintln!(
"using zsh path for exec-server test: {}",
zsh_path.display()
);
let transport =
create_transport_with_shell_path(codex_home.as_ref(), dotslash_cache, &zsh_path).await?;
run_accept_elicitation_for_prompt_rule_with_transport(transport).await
@@ -95,13 +94,13 @@ prefix_rule(
async fn run_accept_elicitation_for_prompt_rule_with_transport(
transport: rmcp::transport::TokioChildProcess,
) -> Result<()> {
// Create an MCP client that approves expected elicitation messages.
// Create an MCP client that approves the expected elicitation message.
let project_root = TempDir::new()?;
let project_root_path = project_root.path().canonicalize().unwrap();
let git_path = resolve_git_path(USE_LOGIN_SHELL).await?;
let git_init_command = format!("{git_path} init --quiet .");
let expected_elicitation_message = format!(
"Allow agent to run `{} init .` in `{}`?",
git_path,
"Allow agent to run `{git_path} init --quiet .` in `{}`?",
project_root_path.display()
);
let elicitation_requests: Arc<Mutex<Vec<CreateElicitationRequestParams>>> = Default::default();
@@ -142,7 +141,7 @@ async fn run_accept_elicitation_for_prompt_rule_with_transport(
arguments: Some(object(json!(
{
"login": USE_LOGIN_SHELL,
"command": "git init .",
"command": git_init_command,
"workdir": project_root_path.to_string_lossy(),
}
))),
@@ -157,15 +156,11 @@ async fn run_accept_elicitation_for_prompt_rule_with_transport(
let ExecResult {
exit_code, output, ..
} = serde_json::from_str::<ExecResult>(&tool_call_content.text)?;
let git_init_succeeded = format!(
"Initialized empty Git repository in {}/.git/\n",
project_root_path.display()
);
// Normally, this would be an exact match, but it might include extra output
// if `git config set advice.defaultBranchName false` has not been set.
// `git init --quiet` is expected to suppress the usual initialization
// banner, so assert on success and filesystem effects instead of output.
assert!(
output.contains(&git_init_succeeded),
"expected output `{output}` to contain `{git_init_succeeded}`"
output.is_empty(),
"expected no output from `git init --quiet .`, got `{output}`"
);
assert_eq!(exit_code, 0, "command should succeed");
assert_eq!(is_error, Some(false), "command should succeed");
@@ -192,6 +187,12 @@ async fn run_accept_elicitation_for_prompt_rule_with_transport(
Ok(())
}
async fn resolve_test_zsh_path(dotslash_cache: &std::path::Path) -> Result<PathBuf> {
let dotslash_zsh = codex_utils_cargo_bin::find_resource!("tests/suite/zsh")?;
core_test_support::fetch_dotslash_file(&dotslash_zsh, Some(dotslash_cache))
.with_context(|| format!("failed to fetch test zsh from {}", dotslash_zsh.display()))
}
fn ensure_codex_cli() -> Result<PathBuf> {
let codex_cli = codex_utils_cargo_bin::cargo_bin("codex")?;

View File

@@ -0,0 +1,72 @@
#!/usr/bin/env dotslash
// This is the patched zsh fork built by
// `.github/workflows/shell-tool-mcp.yml` for the shell-tool-mcp package.
// Fetching the prebuilt version via DotSlash makes it easier to write
// integration tests that exercise the zsh fork behavior in exec-server tests.
//
// TODO(mbolin): Currently, we use a .tgz artifact that includes binaries for
// multiple platforms, but we could save a bit of space by making arch-specific
// artifacts available in the GitHub releases and referencing those here.
{
"name": "codex-zsh",
"platforms": {
// macOS 13 builds (and therefore x86_64) were dropped in
// https://github.com/openai/codex/pull/7295, so we only provide an
// Apple Silicon build for now.
"macos-aarch64": {
"size": 53771483,
"hash": "blake3",
"digest": "ff664f63f5e1fa62762c9aff0aafa66cf196faf9b157f98ec98f59c152fc7bd3",
"format": "tar.gz",
"path": "package/vendor/aarch64-apple-darwin/zsh/macos-15/zsh",
"providers": [
{
"url": "https://github.com/openai/codex/releases/download/rust-v0.104.0/codex-shell-tool-mcp-npm-0.104.0.tgz"
},
{
"type": "github-release",
"repo": "openai/codex",
"tag": "rust-v0.104.0",
"name": "codex-shell-tool-mcp-npm-0.104.0.tgz"
}
]
},
"linux-x86_64": {
"size": 53771483,
"hash": "blake3",
"digest": "ff664f63f5e1fa62762c9aff0aafa66cf196faf9b157f98ec98f59c152fc7bd3",
"format": "tar.gz",
"path": "package/vendor/x86_64-unknown-linux-musl/zsh/ubuntu-24.04/zsh",
"providers": [
{
"url": "https://github.com/openai/codex/releases/download/rust-v0.104.0/codex-shell-tool-mcp-npm-0.104.0.tgz"
},
{
"type": "github-release",
"repo": "openai/codex",
"tag": "rust-v0.104.0",
"name": "codex-shell-tool-mcp-npm-0.104.0.tgz"
}
]
},
"linux-aarch64": {
"size": 53771483,
"hash": "blake3",
"digest": "ff664f63f5e1fa62762c9aff0aafa66cf196faf9b157f98ec98f59c152fc7bd3",
"format": "tar.gz",
"path": "package/vendor/aarch64-unknown-linux-musl/zsh/ubuntu-24.04/zsh",
"providers": [
{
"url": "https://github.com/openai/codex/releases/download/rust-v0.104.0/codex-shell-tool-mcp-npm-0.104.0.tgz"
},
{
"type": "github-release",
"repo": "openai/codex",
"tag": "rust-v0.104.0",
"name": "codex-shell-tool-mcp-npm-0.104.0.tgz"
}
]
},
}
}