mirror of
https://github.com/openai/codex.git
synced 2026-04-29 00:55:38 +00:00
fix(tui_app_server): preserve transcript events under backpressure (#15759)
## TL;DR When running codex with `-c features.tui_app_server=true` we see corruption when streaming large amounts of data. This PR marks other event types as _critical_ by making them _must-deliver_. ## Problem When the TUI consumer falls behind the app-server event stream, the bounded `mpsc` channel fills up and the forwarding layer drops events via `try_send`. Previously only `TurnCompleted` was marked as must-deliver. Streamed assistant text (`AgentMessageDelta`) and the authoritative final item (`ItemCompleted`) were treated as droppable — the same as ephemeral command output deltas. Because the TUI renders markdown incrementally from these deltas, dropping any of them produces permanently corrupted or incomplete paragraphs that persist for the rest of the session. ## Mental model The app-server event stream has two tiers of importance: 1. **Lossless (transcript + terminal):** Events that form the authoritative record of what the assistant said or that signal turn lifecycle transitions. Losing any of these corrupts the visible output or leaves surfaces waiting forever. These are: `AgentMessageDelta`, `PlanDelta`, `ReasoningSummaryTextDelta`, `ReasoningTextDelta`, `ItemCompleted`, and `TurnCompleted`. 2. **Best-effort (everything else):** Ephemeral status events like `CommandExecutionOutputDelta` and progress notifications. Dropping these under load causes cosmetic gaps but no permanent corruption. The forwarding layer uses `try_send` for best-effort events (dropping on backpressure) and blocking `send().await` for lossless events (applying back-pressure to the producer until the consumer catches up). ## Non-goals - Eliminating backpressure entirely. The bounded queue is intentional; this change only widens the set of events that survive it. - Changing the event protocol or adding new notification types. - Addressing root causes of consumer slowness (e.g. TUI render cost). ## Tradeoffs Blocking on transcript events means a slow consumer can now stall the producer for the duration of those events. This is acceptable because: (a) the alternative is permanently broken output, which is worse; (b) the consumer already had to keep up with `TurnCompleted` blocking sends; and (c) transcript events arrive at model-output speed, not burst speed, so sustained saturation is unlikely in practice. ## Architecture Two parallel changes, one per transport: - **In-process path** (`lib.rs`): The inline forwarding logic was extracted into `forward_in_process_event`, a standalone async function that encapsulates the lag-marker / must-deliver / try-send decision tree. The worker loop now delegates to it. A new `server_notification_requires_delivery` function (shared `pub(crate)`) centralizes the notification classification. - **Remote path** (`remote.rs`): The local `event_requires_delivery` now delegates to the same shared `server_notification_requires_delivery`, keeping both transports in sync. ## Observability No new metrics or log lines. The existing `warn!` on event drops continues to fire for best-effort events. Lossless events that block will not produce a log line (they simply wait). ## Tests - `event_requires_delivery_marks_transcript_and_terminal_events`: unit test confirming the expanded classification covers `AgentMessageDelta`, `ItemCompleted`, `TurnCompleted`, and excludes `CommandExecutionOutputDelta` and `Lagged`. - `forward_in_process_event_preserves_transcript_notifications_under_backpressure`: integration-style test that fills a capacity-1 channel, verifies a best-effort event is dropped (skipped count increments), then sends lossless transcript events and confirms they all arrive in order with the correct lag marker preceding them. - `remote_backpressure_preserves_transcript_notifications`: end-to-end test over a real websocket that verifies the remote transport preserves transcript events under the same backpressure scenario. - `event_requires_delivery_marks_transcript_and_disconnect_events` (remote): unit test confirming the remote-side classification covers transcript events and `Disconnected`. --------- Co-authored-by: Eric Traut <etraut@openai.com>
This commit is contained in:
@@ -21,6 +21,7 @@ use crate::RequestResult;
|
||||
use crate::SHUTDOWN_TIMEOUT;
|
||||
use crate::TypedRequestError;
|
||||
use crate::request_method_name;
|
||||
use crate::server_notification_requires_delivery;
|
||||
use codex_app_server_protocol::ClientInfo;
|
||||
use codex_app_server_protocol::ClientNotification;
|
||||
use codex_app_server_protocol::ClientRequest;
|
||||
@@ -854,11 +855,11 @@ async fn reject_if_server_request_dropped(
|
||||
|
||||
fn event_requires_delivery(event: &AppServerEvent) -> bool {
|
||||
match event {
|
||||
AppServerEvent::ServerNotification(ServerNotification::TurnCompleted(_)) => true,
|
||||
AppServerEvent::ServerNotification(notification) => {
|
||||
server_notification_requires_delivery(notification)
|
||||
}
|
||||
AppServerEvent::Disconnected { .. } => true,
|
||||
AppServerEvent::Lagged { .. }
|
||||
| AppServerEvent::ServerNotification(_)
|
||||
| AppServerEvent::ServerRequest(_) => false,
|
||||
AppServerEvent::Lagged { .. } | AppServerEvent::ServerRequest(_) => false,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -905,3 +906,40 @@ async fn write_jsonrpc_message(
|
||||
))
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn event_requires_delivery_marks_transcript_and_disconnect_events() {
|
||||
assert!(event_requires_delivery(
|
||||
&AppServerEvent::ServerNotification(ServerNotification::AgentMessageDelta(
|
||||
codex_app_server_protocol::AgentMessageDeltaNotification {
|
||||
thread_id: "thread".to_string(),
|
||||
turn_id: "turn".to_string(),
|
||||
item_id: "item".to_string(),
|
||||
delta: "hello".to_string(),
|
||||
},
|
||||
),)
|
||||
));
|
||||
assert!(event_requires_delivery(
|
||||
&AppServerEvent::ServerNotification(ServerNotification::ItemCompleted(
|
||||
codex_app_server_protocol::ItemCompletedNotification {
|
||||
thread_id: "thread".to_string(),
|
||||
turn_id: "turn".to_string(),
|
||||
item: codex_app_server_protocol::ThreadItem::Plan {
|
||||
id: "item".to_string(),
|
||||
text: "step".to_string(),
|
||||
},
|
||||
}
|
||||
),)
|
||||
));
|
||||
assert!(event_requires_delivery(&AppServerEvent::Disconnected {
|
||||
message: "closed".to_string(),
|
||||
}));
|
||||
assert!(!event_requires_delivery(&AppServerEvent::Lagged {
|
||||
skipped: 1
|
||||
}));
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user