fix(tui_app_server): preserve transcript events under backpressure (#15759)

## TL;DR

When running codex with `-c features.tui_app_server=true` we see
corruption when streaming large amounts of data. This PR marks other
event types as _critical_ by making them _must-deliver_.

## Problem

When the TUI consumer falls behind the app-server event stream, the
bounded `mpsc` channel fills up and the forwarding layer drops events
via `try_send`. Previously only `TurnCompleted` was marked as
must-deliver. Streamed assistant text (`AgentMessageDelta`) and the
authoritative final item (`ItemCompleted`) were treated as droppable —
the same as ephemeral command output deltas. Because the TUI renders
markdown incrementally from these deltas, dropping any of them produces
permanently corrupted or incomplete paragraphs that persist for the rest
of the session.

## Mental model

The app-server event stream has two tiers of importance:

1. **Lossless (transcript + terminal):** Events that form the
authoritative record of what the assistant said or that signal turn
lifecycle transitions. Losing any of these corrupts the visible output
or leaves surfaces waiting forever. These are: `AgentMessageDelta`,
`PlanDelta`, `ReasoningSummaryTextDelta`, `ReasoningTextDelta`,
`ItemCompleted`, and `TurnCompleted`.

2. **Best-effort (everything else):** Ephemeral status events like
`CommandExecutionOutputDelta` and progress notifications. Dropping these
under load causes cosmetic gaps but no permanent corruption.

The forwarding layer uses `try_send` for best-effort events (dropping on
backpressure) and blocking `send().await` for lossless events (applying
back-pressure to the producer until the consumer catches up).

## Non-goals

- Eliminating backpressure entirely. The bounded queue is intentional;
this change only widens the set of events that survive it.
- Changing the event protocol or adding new notification types.
- Addressing root causes of consumer slowness (e.g. TUI render cost).

## Tradeoffs

Blocking on transcript events means a slow consumer can now stall the
producer for the duration of those events. This is acceptable because:
(a) the alternative is permanently broken output, which is worse; (b)
the consumer already had to keep up with `TurnCompleted` blocking sends;
and (c) transcript events arrive at model-output speed, not burst speed,
so sustained saturation is unlikely in practice.

## Architecture

Two parallel changes, one per transport:

- **In-process path** (`lib.rs`): The inline forwarding logic was
extracted into `forward_in_process_event`, a standalone async function
that encapsulates the lag-marker / must-deliver / try-send decision
tree. The worker loop now delegates to it. A new
`server_notification_requires_delivery` function (shared `pub(crate)`)
centralizes the notification classification.

- **Remote path** (`remote.rs`): The local `event_requires_delivery` now
delegates to the same shared `server_notification_requires_delivery`,
keeping both transports in sync.

## Observability

No new metrics or log lines. The existing `warn!` on event drops
continues to fire for best-effort events. Lossless events that block
will not produce a log line (they simply wait).

## Tests

- `event_requires_delivery_marks_transcript_and_terminal_events`: unit
test confirming the expanded classification covers `AgentMessageDelta`,
`ItemCompleted`, `TurnCompleted`, and excludes
`CommandExecutionOutputDelta` and `Lagged`.
-
`forward_in_process_event_preserves_transcript_notifications_under_backpressure`:
integration-style test that fills a capacity-1 channel, verifies a
best-effort event is dropped (skipped count increments), then sends
lossless transcript events and confirms they all arrive in order with
the correct lag marker preceding them.
- `remote_backpressure_preserves_transcript_notifications`: end-to-end
test over a real websocket that verifies the remote transport preserves
transcript events under the same backpressure scenario.
- `event_requires_delivery_marks_transcript_and_disconnect_events`
(remote): unit test confirming the remote-side classification covers
transcript events and `Disconnected`.

---------

Co-authored-by: Eric Traut <etraut@openai.com>
This commit is contained in:
Felipe Coury
2026-03-25 16:50:39 -03:00
committed by GitHub
parent 6124564297
commit e9996ec62a
2 changed files with 457 additions and 89 deletions

View File

@@ -21,6 +21,7 @@ use crate::RequestResult;
use crate::SHUTDOWN_TIMEOUT;
use crate::TypedRequestError;
use crate::request_method_name;
use crate::server_notification_requires_delivery;
use codex_app_server_protocol::ClientInfo;
use codex_app_server_protocol::ClientNotification;
use codex_app_server_protocol::ClientRequest;
@@ -854,11 +855,11 @@ async fn reject_if_server_request_dropped(
fn event_requires_delivery(event: &AppServerEvent) -> bool {
match event {
AppServerEvent::ServerNotification(ServerNotification::TurnCompleted(_)) => true,
AppServerEvent::ServerNotification(notification) => {
server_notification_requires_delivery(notification)
}
AppServerEvent::Disconnected { .. } => true,
AppServerEvent::Lagged { .. }
| AppServerEvent::ServerNotification(_)
| AppServerEvent::ServerRequest(_) => false,
AppServerEvent::Lagged { .. } | AppServerEvent::ServerRequest(_) => false,
}
}
@@ -905,3 +906,40 @@ async fn write_jsonrpc_message(
))
})
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn event_requires_delivery_marks_transcript_and_disconnect_events() {
assert!(event_requires_delivery(
&AppServerEvent::ServerNotification(ServerNotification::AgentMessageDelta(
codex_app_server_protocol::AgentMessageDeltaNotification {
thread_id: "thread".to_string(),
turn_id: "turn".to_string(),
item_id: "item".to_string(),
delta: "hello".to_string(),
},
),)
));
assert!(event_requires_delivery(
&AppServerEvent::ServerNotification(ServerNotification::ItemCompleted(
codex_app_server_protocol::ItemCompletedNotification {
thread_id: "thread".to_string(),
turn_id: "turn".to_string(),
item: codex_app_server_protocol::ThreadItem::Plan {
id: "item".to_string(),
text: "step".to_string(),
},
}
),)
));
assert!(event_requires_delivery(&AppServerEvent::Disconnected {
message: "closed".to_string(),
}));
assert!(!event_requires_delivery(&AppServerEvent::Lagged {
skipped: 1
}));
}
}