Add production startup and TTFT telemetry (#22198)

## Why While investigating `codex exec hi` startup latency, the useful questions were not "is startup slow?" but "which durable bucket is slow in production?" The path we observed has a few distinct stages: 1. `thread/start` creates the session 2. startup prewarm builds the turn context, tools, and prompt 3. startup prewarm warms the websocket 4. the first real turn resolves the prewarm 5. the model produces the first token Before this PR, production telemetry had some of the raw measurements already: - aggregate startup-prewarm duration / age-at-first-turn metrics - TTFT as a metric - websocket request telemetry But there was no coherent production event stream for the startup breakdown itself, and TTFT was metric-only. That made it hard to answer the same latency questions from OpenTelemetry-backed logs without adding one-off local instrumentation. ## What changed Add durable production telemetry on the existing `SessionTelemetry` path: - new `codex.startup_phase` OTel log/trace events plus `codex.startup.phase.duration_ms` - new `codex.turn_ttft` OTel log/trace events while preserving the existing TTFT metric The startup phase event is emitted for the coarse buckets we actually observed while running `exec hi`: - `thread_start_create_thread` - `startup_prewarm_total` - `startup_prewarm_create_turn_context` - `startup_prewarm_build_tools` - `startup_prewarm_build_prompt` - `startup_prewarm_websocket_warmup` - `startup_prewarm_resolve` These phases are intentionally low-cardinality so they remain safe as production telemetry tags. ## Why this shape This keeps the instrumentation on the same production path as the rest of the session telemetry instead of adding a local debug-only trace mode. It also avoids changing startup behavior: - prewarm still runs - no control flow changes - no extra remote calls - no user-visible behavior changes One boundary is intentional: very early process bootstrap that happens before a session exists is not included here, because this PR uses session-scoped production telemetry. The expensive buckets we were trying to understand after `thread/start` are now covered durably. ## Verification - `cargo test -p codex-otel` - `cargo test -p codex-core turn_timing` - `cargo test -p codex-core regular_turn_emits_turn_started_without_waiting_for_startup_prewarm` - `cargo test -p codex-core interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted` - `cargo test -p codex-app-server thread_start` - `just fix -p codex-otel -p codex-core -p codex-app-server` I also ran `cargo test -p codex-core`; it built successfully and then hit an existing unrelated stack overflow in `tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`.
2026-05-15 00:32:51 +00:00 · 2026-05-11 16:58:36 -07:00
parent 22e84c49d0
commit e15ecc9c35
6 changed files with 108 additions and 4 deletions
--- a/codex-rs/app-server/src/request_processors/thread_processor.rs
+++ b/codex-rs/app-server/src/request_processors/thread_processor.rs
@@ -960,6 +960,7 @@ impl ThreadRequestProcessor {
        experimental_raw_events: bool,
        request_trace: Option<W3cTraceContext>,
    ) -> Result<(), JSONRPCErrorError> {
+        let thread_start_started_at = std::time::Instant::now();
        let requested_cwd = typesafe_overrides.cwd.clone();
        let mut config = config_manager
            .load_with_overrides(config_overrides.clone(), typesafe_overrides.clone())
@@ -1054,6 +1055,7 @@ impl ThreadRequestProcessor {
                .collect()
        };
        let core_dynamic_tool_count = core_dynamic_tools.len();
+        let create_thread_started_at = std::time::Instant::now();
        let NewThread {
            thread_id,
            thread,
@@ -1088,6 +1090,12 @@ impl ThreadRequestProcessor {
                CodexErr::InvalidRequest(message) => invalid_request(message),
                err => internal_error(format!("error creating thread: {err}")),
            })?;
+        let session_telemetry = thread.session_telemetry();
+        session_telemetry.record_startup_phase(
+            "thread_start_create_thread",
+            create_thread_started_at.elapsed(),
+            Some("ready"),
+        );

        Self::set_app_server_client_info(
            thread.as_ref(),
@@ -1189,6 +1197,11 @@ impl ThreadRequestProcessor {
                otel.name = "app_server.thread_start.notify_started",
            ))
            .await;
+        session_telemetry.record_startup_phase(
+            "thread_start_total",
+            thread_start_started_at.elapsed(),
+            Some("ready"),
+        );
        Ok(())
    }

--- a/codex-rs/core/src/codex_thread.rs
+++ b/codex-rs/core/src/codex_thread.rs
@@ -6,6 +6,7 @@ use crate::session::Codex;
 use crate::session::SessionSettingsUpdate;
 use crate::session::SteerInputError;
 use codex_features::Feature;
+use codex_otel::SessionTelemetry;
 use codex_protocol::config_types::ApprovalsReviewer;
 use codex_protocol::config_types::CollaborationMode;
 use codex_protocol::config_types::Personality;
@@ -125,6 +126,11 @@ impl CodexThread {
        self.codex.submit(op).await
    }

+    /// Returns the session telemetry handle for thread-scoped production instrumentation.
+    pub fn session_telemetry(&self) -> SessionTelemetry {
+        self.codex.session.services.session_telemetry.clone()
+    }
+
    pub async fn shutdown_and_wait(&self) -> CodexResult<()> {
        self.codex.shutdown_and_wait().await
    }
--- a/codex-rs/core/src/session_startup_prewarm.rs
+++ b/codex-rs/core/src/session_startup_prewarm.rs
@@ -52,6 +52,7 @@ impl SessionStartupPrewarmHandle {
        session_telemetry: &SessionTelemetry,
        cancellation_token: &CancellationToken,
    ) -> SessionStartupPrewarmResolution {
+        let resolve_started_at = Instant::now();
        let Self {
            mut task,
            started_at,
@@ -78,6 +79,11 @@ impl SessionStartupPrewarmHandle {
                }
                None => {
                    task.abort();
+                    session_telemetry.record_startup_phase(
+                        "startup_prewarm_resolve",
+                        resolve_started_at.elapsed(),
+                        Some("cancelled"),
+                    );
                    session_telemetry.record_duration(
                        STARTUP_PREWARM_AGE_AT_FIRST_TURN_METRIC,
                        age_at_first_turn,
@@ -92,6 +98,16 @@ impl SessionStartupPrewarmHandle {
                }
            }
        };
+        let status = match &resolution {
+            SessionStartupPrewarmResolution::Cancelled => "cancelled",
+            SessionStartupPrewarmResolution::Ready(_) => "ready",
+            SessionStartupPrewarmResolution::Unavailable { status, .. } => status,
+        };
+        session_telemetry.record_startup_phase(
+            "startup_prewarm_resolve",
+            resolve_started_at.elapsed(),
+            Some(status),
+        );

        match resolution {
            SessionStartupPrewarmResolution::Cancelled => {
@@ -165,6 +181,11 @@ impl Session {
            let result =
                schedule_startup_prewarm_inner(startup_prewarm_session, base_instructions).await;
            let status = if result.is_ok() { "ready" } else { "failed" };
+            session_telemetry.record_startup_phase(
+                "startup_prewarm_total",
+                started_at.elapsed(),
+                Some(status),
+            );
            session_telemetry.record_duration(
                STARTUP_PREWARM_DURATION_METRIC,
                started_at.elapsed(),
@@ -200,10 +221,17 @@ async fn schedule_startup_prewarm_inner(
    session: Arc<Session>,
    base_instructions: String,
 ) -> CodexResult<ModelClientSession> {
+    let prewarm_started_at = Instant::now();
    let startup_turn_context = session
        .new_default_turn_with_sub_id(INITIAL_SUBMIT_ID.to_owned())
        .await;
+    startup_turn_context.session_telemetry.record_startup_phase(
+        "startup_prewarm_create_turn_context",
+        prewarm_started_at.elapsed(),
+        /*status*/ None,
+    );
    let startup_cancellation_token = CancellationToken::new();
+    let built_tools_started_at = Instant::now();
    let startup_router = built_tools(
        session.as_ref(),
        startup_turn_context.as_ref(),
@@ -213,6 +241,12 @@ async fn schedule_startup_prewarm_inner(
        &startup_cancellation_token,
    )
    .await?;
+    startup_turn_context.session_telemetry.record_startup_phase(
+        "startup_prewarm_build_tools",
+        built_tools_started_at.elapsed(),
+        /*status*/ None,
+    );
+    let build_prompt_started_at = Instant::now();
    let startup_prompt = build_prompt(
        Vec::new(),
        startup_router.as_ref(),
@@ -221,10 +255,16 @@ async fn schedule_startup_prewarm_inner(
            text: base_instructions,
        },
    );
+    startup_turn_context.session_telemetry.record_startup_phase(
+        "startup_prewarm_build_prompt",
+        build_prompt_started_at.elapsed(),
+        /*status*/ None,
+    );
    let startup_turn_metadata_header = startup_turn_context
        .turn_metadata_state
        .current_header_value();
    let mut client_session = session.services.model_client.new_session();
+    let websocket_warmup_started_at = Instant::now();
    client_session
        .prewarm_websocket(
            &startup_prompt,
@@ -236,6 +276,11 @@ async fn schedule_startup_prewarm_inner(
            startup_turn_metadata_header.as_deref(),
        )
        .await?;
+    startup_turn_context.session_telemetry.record_startup_phase(
+        "startup_prewarm_websocket_warmup",
+        websocket_warmup_started_at.elapsed(),
+        /*status*/ None,
+    );

    Ok(client_session)
 }
--- a/codex-rs/core/src/turn_timing.rs
+++ b/codex-rs/core/src/turn_timing.rs
@@ -4,7 +4,6 @@ use std::time::SystemTime;
 use std::time::UNIX_EPOCH;

 use codex_otel::TURN_TTFM_DURATION_METRIC;
-use codex_otel::TURN_TTFT_DURATION_METRIC;
 use codex_protocol::items::TurnItem;
 use codex_protocol::models::ResponseItem;
 use tokio::sync::Mutex;
@@ -21,9 +20,7 @@ pub(crate) async fn record_turn_ttft_metric(turn_context: &TurnContext, event: &
    else {
        return;
    };
-    turn_context
-        .session_telemetry
-        .record_duration(TURN_TTFT_DURATION_METRIC, duration, &[]);
+    turn_context.session_telemetry.record_turn_ttft(duration);
 }

 pub(crate) async fn record_turn_ttfm_metric(turn_context: &TurnContext, item: &TurnItem) {
--- a/codex-rs/otel/src/events/session_telemetry.rs
+++ b/codex-rs/otel/src/events/session_telemetry.rs
@@ -18,9 +18,11 @@ use crate::metrics::RESPONSES_API_OVERHEAD_DURATION_METRIC;
 use crate::metrics::Result as MetricsResult;
 use crate::metrics::SSE_EVENT_COUNT_METRIC;
 use crate::metrics::SSE_EVENT_DURATION_METRIC;
+use crate::metrics::STARTUP_PHASE_DURATION_METRIC;
 use crate::metrics::SessionMetricTagValues;
 use crate::metrics::TOOL_CALL_COUNT_METRIC;
 use crate::metrics::TOOL_CALL_DURATION_METRIC;
+use crate::metrics::TURN_TTFT_DURATION_METRIC;
 use crate::metrics::WEBSOCKET_EVENT_COUNT_METRIC;
 use crate::metrics::WEBSOCKET_EVENT_DURATION_METRIC;
 use crate::metrics::WEBSOCKET_REQUEST_COUNT_METRIC;
@@ -183,6 +185,45 @@ impl SessionTelemetry {
        }
    }

+    /// Records a coarse startup phase for production latency breakdowns.
+    pub fn record_startup_phase(
+        &self,
+        phase: &'static str,
+        duration: Duration,
+        status: Option<&'static str>,
+    ) {
+        let tags = match status {
+            Some(status) => vec![("phase", phase), ("status", status)],
+            None => vec![("phase", phase)],
+        };
+        self.record_duration(STARTUP_PHASE_DURATION_METRIC, duration, &tags);
+        log_and_trace_event!(
+            self,
+            common: {
+                event.name = "codex.startup_phase",
+                startup.phase = phase,
+                startup.status = status,
+                duration_ms = %duration.as_millis(),
+            },
+            log: {},
+            trace: {},
+        );
+    }
+
+    /// Records time to first token as both a metric and a production telemetry event.
+    pub fn record_turn_ttft(&self, duration: Duration) {
+        self.record_duration(TURN_TTFT_DURATION_METRIC, duration, &[]);
+        log_and_trace_event!(
+            self,
+            common: {
+                event.name = "codex.turn_ttft",
+                duration_ms = %duration.as_millis(),
+            },
+            log: {},
+            trace: {},
+        );
+    }
+
    pub fn start_timer(&self, name: &str, tags: &[(&str, &str)]) -> Result<Timer, MetricsError> {
        let Some(metrics) = &self.metrics else {
            return Err(MetricsError::ExporterDisabled);
--- a/codex-rs/otel/src/metrics/names.rs
+++ b/codex-rs/otel/src/metrics/names.rs
@@ -38,6 +38,8 @@ pub const CURATED_PLUGINS_STARTUP_SYNC_METRIC: &str = "codex.plugins.startup_syn
 pub const CURATED_PLUGINS_STARTUP_SYNC_FINAL_METRIC: &str = "codex.plugins.startup_sync.final";
 pub const HOOK_RUN_METRIC: &str = "codex.hooks.run";
 pub const HOOK_RUN_DURATION_METRIC: &str = "codex.hooks.run.duration_ms";
+/// Duration for coarse startup phases, tagged by low-cardinality phase and status.
+pub const STARTUP_PHASE_DURATION_METRIC: &str = "codex.startup.phase.duration_ms";
 /// Total runtime of a startup prewarm attempt until it completes, tagged by final status.
 pub const STARTUP_PREWARM_DURATION_METRIC: &str = "codex.startup_prewarm.duration_ms";
 /// Age of the startup prewarm attempt when the first real turn resolves it, tagged by outcome.