feat: resumable backfill (#10745)

## Summary

This PR makes SQLite rollout backfill resumable and repeatable instead
of one-shot-on-db-create.

## What changed

- Added a persisted backfill state table:
  - state/migrations/0008_backfill_state.sql
- Tracks status (pending|running|complete), last_watermark, and
last_success_at.
- Added backfill state model/types in codex-state:
  - BackfillState, BackfillStatus (state/src/model/backfill_state.rs)
- Added runtime APIs to manage backfill lifecycle/progress:
  - get_backfill_state
  - mark_backfill_running
  - checkpoint_backfill
  - mark_backfill_complete
- Updated core startup behavior:
- Backfill now runs whenever state is not Complete (not only when DB
file is newly created).
- Reworked backfill execution:
- Collect rollout files, derive deterministic watermark per path, sort,
resume from last_watermark.
- Process in batches (BACKFILL_BATCH_SIZE = 200), checkpoint after each
batch.
  - Mark complete with last_success_at at the end.

## Why

Previous behavior could leave users permanently partially backfilled if
the process exited during initial async backfill. This change allows
safe continuation across restarts and avoids restarting from scratch.
This commit is contained in:
jif-oai
2026-02-05 14:34:34 +00:00
committed by GitHub
parent f2ffc4e5d0
commit 4033f905c6
8 changed files with 528 additions and 68 deletions

View File

@@ -31,11 +31,9 @@ pub(crate) async fn init_if_enabled(
config: &Config,
otel: Option<&OtelManager>,
) -> Option<StateDbHandle> {
let state_path = codex_state::state_db_path(config.codex_home.as_path());
if !config.features.enabled(Feature::Sqlite) {
return None;
}
let existed = tokio::fs::try_exists(&state_path).await.unwrap_or(false);
let runtime = match codex_state::StateRuntime::init(
config.codex_home.clone(),
config.model_provider_id.clone(),
@@ -55,7 +53,17 @@ pub(crate) async fn init_if_enabled(
return None;
}
};
if !existed {
let should_backfill = match runtime.get_backfill_state().await {
Ok(state) => state.status != codex_state::BackfillStatus::Complete,
Err(err) => {
warn!(
"failed to read backfill state at {}: {err}",
config.codex_home.display()
);
true
}
};
if should_backfill {
let runtime_for_backfill = Arc::clone(&runtime);
let config_for_backfill = config.clone();
let otel_for_backfill = otel.cloned();