mirror of
https://github.com/openai/codex.git
synced 2026-04-30 17:36:40 +00:00
## 🐛 Problem Users running commands with non-ASCII characters (like Russian text "пример") in Windows/WSL environments experience garbled text in VSCode's shell preview window, with Unicode replacement characters (�) appearing instead of the actual text. **Issue**: https://github.com/openai/codex/issues/6178 ## 🔧 Root Cause The issue was in `StreamOutput<Vec<u8>>::from_utf8_lossy()` method in `codex-rs/core/src/exec.rs`, which used `String::from_utf8_lossy()` to convert shell output bytes to strings. This function immediately replaces any invalid UTF-8 byte sequences with replacement characters, without attempting to decode using other common encodings. In Windows/WSL environments, shell output often uses encodings like: - Windows-1252 (common Windows encoding) - Latin-1/ISO-8859-1 (extended ASCII) ## 🛠️ Solution Replaced the simple `String::from_utf8_lossy()` call with intelligent encoding detection via a new `bytes_to_string_smart()` function that tries multiple encoding strategies: 1. **UTF-8** (fast path for valid UTF-8) 2. **Windows-1252** (handles Windows-specific characters in 0x80-0x9F range) 3. **Latin-1** (fallback for extended ASCII) 4. **Lossy UTF-8** (final fallback, same as before) ## 📁 Changes ### New Files - `codex-rs/core/src/text_encoding.rs` - Smart encoding detection module - `codex-rs/core/tests/suite/text_encoding_fix.rs` - Integration tests ### Modified Files - `codex-rs/core/src/lib.rs` - Added text_encoding module - `codex-rs/core/src/exec.rs` - Updated StreamOutput::from_utf8_lossy() - `codex-rs/core/tests/suite/mod.rs` - Registered new test module ## ✅ Testing - **5 unit tests** covering UTF-8, Windows-1252, Latin-1, and fallback scenarios - **2 integration tests** simulating the exact Issue #6178 scenario - **Demonstrates improvement** over the previous `String::from_utf8_lossy()` approach All tests pass: ```bash cargo test -p codex-core text_encoding cargo test -p codex-core test_shell_output_encoding_issue_6178 ``` ## 🎯 Impact - ✅ **Eliminates garbled text** in VSCode shell preview for non-ASCII content - ✅ **Supports Windows/WSL environments** with proper encoding detection - ✅ **Zero performance impact** for UTF-8 text (fast path) - ✅ **Backward compatible** - UTF-8 content works exactly as before - ✅ **Handles edge cases** with robust fallback mechanism ## 🧪 Test Scenarios The fix has been tested with: - Russian text ("пример") - Windows-1252 quotation marks (""test") - Latin-1 accented characters ("café") - Mixed encoding content - Invalid byte sequences (graceful fallback) ## 📋 Checklist - [X] Addresses the reported issue - [X] Includes comprehensive tests - [X] Maintains backward compatibility - [X] Follows project coding conventions - [X] No breaking changes --------- Co-authored-by: Josh McKinney <joshka@openai.com>
This commit is contained in:
@@ -49,6 +49,7 @@ mod seatbelt;
|
||||
mod shell_serialization;
|
||||
mod stream_error_allows_next_turn;
|
||||
mod stream_no_completed;
|
||||
mod text_encoding_fix;
|
||||
mod tool_harness;
|
||||
mod tool_parallelism;
|
||||
mod tools;
|
||||
|
||||
77
codex-rs/core/tests/suite/text_encoding_fix.rs
Normal file
77
codex-rs/core/tests/suite/text_encoding_fix.rs
Normal file
@@ -0,0 +1,77 @@
|
||||
//! Integration test for the text encoding fix for issue #6178.
|
||||
//!
|
||||
//! These tests simulate VSCode's shell preview on Windows/WSL where the output
|
||||
//! may be encoded with a legacy code page before it reaches Codex.
|
||||
|
||||
use codex_core::exec::StreamOutput;
|
||||
use pretty_assertions::assert_eq;
|
||||
|
||||
#[test]
|
||||
fn test_utf8_shell_output() {
|
||||
// Baseline: UTF-8 output should bypass the detector and remain unchanged.
|
||||
assert_eq!(decode_shell_output("пример".as_bytes()), "пример");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cp1251_shell_output() {
|
||||
// VS Code shells on Windows frequently surface CP1251 bytes for Cyrillic text.
|
||||
assert_eq!(decode_shell_output(b"\xEF\xF0\xE8\xEC\xE5\xF0"), "пример");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cp866_shell_output() {
|
||||
// Native cmd.exe still defaults to CP866; make sure we recognize that too.
|
||||
assert_eq!(decode_shell_output(b"\xAF\xE0\xA8\xAC\xA5\xE0"), "пример");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_windows_1252_smart_decoding() {
|
||||
// Smart detection should turn fancy quotes/dashes into the proper Unicode glyphs.
|
||||
assert_eq!(
|
||||
decode_shell_output(b"\x93\x94 test \x96 dash"),
|
||||
"\u{201C}\u{201D} test \u{2013} dash"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_smart_decoding_improves_over_lossy_utf8() {
|
||||
// Regression guard: String::from_utf8_lossy() alone used to emit replacement chars here.
|
||||
let bytes = b"\x93\x94 test \x96 dash";
|
||||
assert!(
|
||||
String::from_utf8_lossy(bytes).contains('\u{FFFD}'),
|
||||
"lossy UTF-8 should inject replacement chars"
|
||||
);
|
||||
assert_eq!(
|
||||
decode_shell_output(bytes),
|
||||
"\u{201C}\u{201D} test \u{2013} dash",
|
||||
"smart decoding should keep curly quotes intact"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_mixed_ascii_and_legacy_encoding() {
|
||||
// Commands tend to mix ASCII status text with Latin-1 bytes (e.g. café).
|
||||
assert_eq!(decode_shell_output(b"Output: caf\xE9"), "Output: café"); // codespell:ignore caf
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_pure_latin1_shell_output() {
|
||||
// Latin-1 by itself should still decode correctly (regression coverage for the older tests).
|
||||
assert_eq!(decode_shell_output(b"caf\xE9"), "café"); // codespell:ignore caf
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_invalid_bytes_still_fall_back_to_lossy() {
|
||||
// If detection fails, we still want the user to see replacement characters.
|
||||
let bytes = b"\xFF\xFE\xFD";
|
||||
assert_eq!(decode_shell_output(bytes), String::from_utf8_lossy(bytes));
|
||||
}
|
||||
|
||||
fn decode_shell_output(bytes: &[u8]) -> String {
|
||||
StreamOutput {
|
||||
text: bytes.to_vec(),
|
||||
truncated_after_lines: None,
|
||||
}
|
||||
.from_utf8_lossy()
|
||||
.text
|
||||
}
|
||||
Reference in New Issue
Block a user