## Why While investigating `codex exec hi` startup latency, the useful questions were not "is startup slow?" but "which durable bucket is slow in production?" The path we observed has a few distinct stages: 1. `thread/start` creates the session 2. startup prewarm builds the turn context, tools, and prompt 3. startup prewarm warms the websocket 4. the first real turn resolves the prewarm 5. the model produces the first token Before this PR, production telemetry had some of the raw measurements already: - aggregate startup-prewarm duration / age-at-first-turn metrics - TTFT as a metric - websocket request telemetry But there was no coherent production event stream for the startup breakdown itself, and TTFT was metric-only. That made it hard to answer the same latency questions from OpenTelemetry-backed logs without adding one-off local instrumentation. ## What changed Add durable production telemetry on the existing `SessionTelemetry` path: - new `codex.startup_phase` OTel log/trace events plus `codex.startup.phase.duration_ms` - new `codex.turn_ttft` OTel log/trace events while preserving the existing TTFT metric The startup phase event is emitted for the coarse buckets we actually observed while running `exec hi`: - `thread_start_create_thread` - `startup_prewarm_total` - `startup_prewarm_create_turn_context` - `startup_prewarm_build_tools` - `startup_prewarm_build_prompt` - `startup_prewarm_websocket_warmup` - `startup_prewarm_resolve` These phases are intentionally low-cardinality so they remain safe as production telemetry tags. ## Why this shape This keeps the instrumentation on the same production path as the rest of the session telemetry instead of adding a local debug-only trace mode. It also avoids changing startup behavior: - prewarm still runs - no control flow changes - no extra remote calls - no user-visible behavior changes One boundary is intentional: very early process bootstrap that happens before a session exists is not included here, because this PR uses session-scoped production telemetry. The expensive buckets we were trying to understand after `thread/start` are now covered durably. ## Verification - `cargo test -p codex-otel` - `cargo test -p codex-core turn_timing` - `cargo test -p codex-core regular_turn_emits_turn_started_without_waiting_for_startup_prewarm` - `cargo test -p codex-core interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted` - `cargo test -p codex-app-server thread_start` - `just fix -p codex-otel -p codex-core -p codex-app-server` I also ran `cargo test -p codex-core`; it built successfully and then hit an existing unrelated stack overflow in `tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`.
codex-otel
codex-otel is the OpenTelemetry integration crate for Codex. It provides:
- Provider wiring for log/trace/metric exporters (
codex_otel::OtelProviderandcodex_otel::provider). - Session-scoped business event emission via
codex_otel::SessionTelemetry. - Low-level metrics APIs via
codex_otel::metrics. - Trace-context helpers via
codex_otel::trace_contextand crate-root re-exports.
Tracing and logs
Create an OTEL provider from OtelSettings. The provider also configures
metrics (when enabled), then attach its layers to your tracing_subscriber
registry:
use codex_otel::config::OtelExporter;
use codex_otel::config::OtelHttpProtocol;
use codex_otel::config::OtelSettings;
use codex_otel::OtelProvider;
use tracing_subscriber::prelude::*;
let settings = OtelSettings {
environment: "dev".to_string(),
service_name: "codex-cli".to_string(),
service_version: env!("CARGO_PKG_VERSION").to_string(),
codex_home: std::path::PathBuf::from("/tmp"),
exporter: OtelExporter::OtlpHttp {
endpoint: "https://otlp.example.com".to_string(),
headers: std::collections::HashMap::new(),
protocol: OtelHttpProtocol::Binary,
tls: None,
},
trace_exporter: OtelExporter::OtlpHttp {
endpoint: "https://otlp.example.com".to_string(),
headers: std::collections::HashMap::new(),
protocol: OtelHttpProtocol::Binary,
tls: None,
},
metrics_exporter: OtelExporter::None,
span_attributes: std::collections::BTreeMap::new(),
tracestate: std::collections::BTreeMap::new(),
};
if let Some(provider) = OtelProvider::from(&settings)? {
let registry = tracing_subscriber::registry()
.with(provider.logger_layer())
.with(provider.tracing_layer());
registry.init();
}
Configured span attributes and W3C tracestate member fields are applied to exported trace spans and propagated trace context:
[otel.span_attributes]
"example.trace_attr" = "enabled"
[otel.tracestate.example]
alpha = "one"
beta = "two"
Configured tracestate members and encoded values must be valid W3C tracestate.
Each nested table is encoded as semicolon-separated key:value fields inside
that member. If propagated trace context already has the named member, Codex
upserts configured fields and preserves other fields in that member. This
config shape does not support setting opaque tracestate member values. Invalid
trace metadata entries are ignored during config load and reported as startup
warnings.
SessionTelemetry (events)
SessionTelemetry adds consistent metadata to tracing events and helps record
Codex-specific session events. Rich session/business events should go through
SessionTelemetry; subsystem-owned audit events can stay with the owning subsystem.
use codex_otel::SessionTelemetry;
let manager = SessionTelemetry::new(
conversation_id,
model,
slug,
account_id,
account_email,
auth_mode,
originator,
log_user_prompts,
terminal_type,
session_source,
);
manager.user_prompt(&prompt_items);
Metrics (OTLP or in-memory)
Modes:
- OTLP: exports metrics via the OpenTelemetry OTLP exporter (HTTP or gRPC).
- In-memory: records via
opentelemetry_sdk::metrics::InMemoryMetricExporterfor tests/assertions; callshutdown()to flush.
codex-otel also provides OtelExporter::Statsig, a shorthand for exporting OTLP/HTTP JSON metrics
to Statsig using Codex-internal defaults.
Statsig ingestion (OTLP/HTTP JSON) example:
use codex_otel::config::{OtelExporter, OtelHttpProtocol};
let metrics = MetricsClient::new(MetricsConfig::otlp(
"dev",
"codex-cli",
env!("CARGO_PKG_VERSION"),
OtelExporter::OtlpHttp {
endpoint: "https://api.statsig.com/otlp".to_string(),
headers: std::collections::HashMap::from([(
"statsig-api-key".to_string(),
std::env::var("STATSIG_SERVER_SDK_SECRET")?,
)]),
protocol: OtelHttpProtocol::Json,
tls: None,
},
))?;
metrics.counter("codex.session_started", 1, &[("source", "tui")])?;
metrics.histogram("codex.request_latency", 83, &[("route", "chat")])?;
In-memory (tests):
let exporter = InMemoryMetricExporter::default();
let metrics = MetricsClient::new(MetricsConfig::in_memory(
"test",
"codex-cli",
env!("CARGO_PKG_VERSION"),
exporter.clone(),
))?;
metrics.counter("codex.turns", 1, &[("model", "gpt-5.1")])?;
metrics.shutdown()?; // flushes in-memory exporter
Trace context
Trace propagation helpers remain separate from the session event emitter:
use codex_otel::current_span_w3c_trace_context;
use codex_otel::set_parent_from_w3c_trace_context;
Shutdown
OtelProvider::shutdown()stops the OTEL exporter.SessionTelemetry::shutdown_metrics()flushes and shuts down the metrics provider.
Both are optional because drop performs best-effort shutdown, but calling them explicitly gives deterministic flushing (or a shutdown error if flushing does not complete in time).