Commit Graph

19 Commits

Author SHA1 Message Date
pakrym-oai
413c1e1fdf [codex] reduce module visibility (#16978)
## Summary
- reduce public module visibility across Rust crates, preferring private
or crate-private modules with explicit crate-root public exports
- update external call sites and tests to use the intended public crate
APIs instead of reaching through module trees
- add the module visibility guideline to AGENTS.md

## Validation
- `cargo check --workspace --all-targets --message-format=short` passed
before the final fix/format pass
- `just fix` completed successfully
- `just fmt` completed successfully
- `git diff --check` passed
2026-04-07 08:03:35 -07:00
xl-openai
81abb44f68 plugins: Clean up stale curated plugin sync temp dirs and add sync metrics (#16035)
1. Keep curated plugin staging directories under TempDir ownership until
activation succeeds, so failed git/HTTP sync attempts do not leak
plugins-clone-*.
2. Best-effort clean up stale plugins-clone-* directories before
creating a new staged repo, using a conservative age threshold.
3. Emit OTEL counters for curated plugin startup sync transport attempts
and final outcome across git and HTTP paths.
2026-03-27 14:21:18 -07:00
jif-oai
047ea642d2 chore: tty metric (#15766) 2026-03-25 13:34:43 +00:00
jif-oai
2cf4d5ef35 chore: add metrics for profile (#15180) 2026-03-19 15:48:02 +00:00
Owen Lin
6ea041032b fix(core): prevent hanging turn/start due to websocket warming issues (#14838)
## Description

This PR fixes a bad first-turn failure mode in app-server when the
startup websocket prewarm hangs. Before this change, `initialize ->
thread/start -> turn/start` could sit behind the prewarm for up to five
minutes, so the client would not see `turn/started`, and even
`turn/interrupt` would block because the turn had not actually started
yet.

Now, we:
- set a (configurable) timeout of 15s for websocket startup time,
exposed as `websocket_startup_timeout_ms` in config.toml
- `turn/started` is sent immediately on `turn/start` even if the
websocket is still connecting
- `turn/interrupt` can be used to cancel a turn that is still waiting on
the websocket warmup
- the turn task will wait for the full 15s websocket warming timeout
before falling back

## Why

The old behavior made app-server feel stuck at exactly the moment the
client expects turn lifecycle events to start flowing. That was
especially painful for external clients, because from their point of
view the server had accepted the request but then went silent for
minutes.

## Configuring the websocket startup timeout
Can set it in config.toml like this:
```
[model_providers.openai]
supports_websockets = true
websocket_connect_timeout_ms = 15000
```
2026-03-17 10:07:46 -07:00
viyatb-oai
52a3bde6cc feat(core): emit turn metric for network proxy state (#14250)
## Summary
- add a per-turn `codex.turn.network_proxy` metric constant
- emit the metric from turn completion using the live managed proxy
enabled state
- add focused tests for active and inactive tag emission
2026-03-11 12:33:10 -07:00
Owen Lin
da991bdf3a feat(otel): Centralize OTEL metric names and shared tag builders (#14117)
This cleans up a bunch of metric plumbing that had started to drift.

The main change is making `codex-otel` the canonical home for shared
metric definitions and metric tag helpers. I moved the `turn/thread`
metric names that were still duplicated into the OTEL metric registry,
added a shared `metrics::tags` module for common tag keys and session
tag construction, and updated `SessionTelemetry` to build its metadata
tags through that shared path.

On the codex-core side, TTFT/TTFM now use the shared metric-name
constants instead of local string definitions. I also switched the
obvious remaining turn/thread metric callsites over to the shared
constants, and added a small helper so TTFT/TTFM can attach an optional
sanitized client.name tag from TurnContext.

This should make follow-on telemetry work less ad hoc:
- one canonical place for metric names
- one canonical place for common metric tag keys/builders
- less duplication between `codex-core` and `codex-otel`
2026-03-09 12:46:42 -07:00
Owen Lin
3449e00bc9 feat(otel, core): record turn TTFT and TTFM metrics in codex-core (#13630)
### Summary
This adds turn-level latency metrics for the first model output and the
first completed agent message.
- `codex.turn.ttft.duration_ms` starts at turn start and records on the
first output signal we see from the model. That includes normal
assistant text, reasoning deltas, and non-text outputs like tool-call
items.
- `codex.turn.ttfm.duration_ms` also starts at turn start, but it
records when the first agent message finishes streaming rather than when
its first delta arrives.

### Implementation notes
The timing is tracked in codex-core, not app-server, so the definition
stays consistent across CLI, TUI, and app-server clients.

I reused the existing turn lifecycle boundary that already drives
`codex.turn.e2e_duration_ms`, stored the turn start timestamp in turn
state, and record each metric once per turn.

I also wired the new metric names into the OTEL runtime metrics summary
so they show up in the same in-memory/debug snapshot path as the
existing timing metrics.
2026-03-06 10:23:48 -08:00
Anton Panasenko
4ee039744e feat: expose detailed metrics to runtime metrics (#10699) 2026-02-05 18:22:30 -08:00
iceweasel-oai
f2ffc4e5d0 Include real OS info in metrics. (#10425)
calculated a hashed user ID from either auth user id or API key
Also correctly populates OS.

These will make our metrics more useful and powerful for analysis.
2026-02-05 06:30:31 -08:00
Anton Panasenko
fcaed4cb88 feat: log webscocket timing into runtime metrics (#10577) 2026-02-03 18:04:07 -08:00
Anton Panasenko
101d359cd7 Add websocket telemetry metrics and labels (#10316)
Summary
- expose websocket telemetry hooks through the responses client so
request durations and event processing can be reported
- record websocket request/event metrics and emit runtime telemetry
events that the history UI now surfaces
- improve tests to cover websocket telemetry reporting and guard runtime
summary updates


<img width="824" height="79" alt="Screenshot 2026-01-31 at 5 28 12 PM"
src="https://github.com/user-attachments/assets/ea9a7965-d8b4-4e3c-a984-ef4fdc44c81d"
/>
2026-01-31 19:16:44 -08:00
Anton Panasenko
8660ad6c64 feat: show runtime metrics in console (#10278)
Summary of changes:

- Adds a new feature flag: runtime_metrics
  - Declared in core/src/features.rs
  - Added to core/config.schema.json
  - Wired into OTEL init in core/src/otel_init.rs

- Enables on-demand runtime metric snapshots in OTEL
  - Adds runtime_metrics: bool to otel/src/config.rs
  - Enables experimental custom reader features in otel/Cargo.toml
  - Adds snapshot/reset/summary APIs in:
    - otel/src/lib.rs
    - otel/src/metrics/client.rs
    - otel/src/metrics/config.rs
    - otel/src/metrics/error.rs

- Defines metric names and a runtime summary builder
  - New files:
    - otel/src/metrics/names.rs
    - otel/src/metrics/runtime_metrics.rs
  - Summarizes totals for:
    - Tool calls
    - API requests
    - SSE/streaming events

- Instruments metrics collection in OTEL manager
  - otel/src/traces/otel_manager.rs now records:
    - API call counts + durations
    - SSE event counts + durations (success/failure)
    - Tool call metrics now use shared constants

- Surfaces runtime metrics in the TUI
  - Resets runtime metrics at turn start in tui/src/chatwidget.rs
- Displays metrics in the final separator line in
tui/src/history_cell.rs

- Adds tests
  - New OTEL tests:
    - otel/tests/suite/snapshot.rs
    - otel/tests/suite/runtime_summary.rs
  - New TUI test:
- final_message_separator_includes_runtime_metrics in
tui/src/history_cell.rs

Scope:
- 19 files changed
- ~652 insertions, 38 deletions


<img width="922" height="169" alt="Screenshot 2026-01-30 at 4 11 34 PM"
src="https://github.com/user-attachments/assets/1efd754d-a16d-4564-83a5-f4442fd2f998"
/>
2026-01-30 22:20:02 -08:00
jif-oai
129787493f feat: backfill timing metric (#10218)
1. Add a metric to measure the backfill time
2. Add a unit to the timing histogram
2026-01-30 10:19:41 +01:00
jif-oai
3878c3dc7c feat: sqlite 1 (#10004)
Add a `.sqlite` database to be used to store rollout metatdata (and
later logs)
This PR is phase 1:
* Add the database and the required infrastructure
* Add a backfill of the database
* Persist the newly created rollout both in files and in the DB
* When we need to get metadata or a rollout, consider the `JSONL` as the
source of truth but compare the results with the DB and show any errors
2026-01-28 15:29:14 +01:00
jif-oai
a3a97f3ea9 feat: record timer with additional tags (#9529) 2026-01-20 13:01:55 +00:00
jif-oai
7ebe13f692 feat: timer total turn metrics (#9382) 2026-01-19 10:44:31 +00:00
jif-oai
bc92dc5cf0 chore: update metrics temporality (#8901) 2026-01-09 14:57:42 +00:00
jif-oai
634650dd25 feat: metrics capabilities (#8318)
Add metrics capabilities to Codex. The `README.md` is up to date.

This will not be merged with the metrics before this PR of course:
https://github.com/openai/codex/pull/8350
2026-01-08 11:47:36 +00:00