diff --git a/docs/agent-guide/db-worker/003-server-list-write-lock.md b/docs/agent-guide/db-worker/003-server-list-write-lock.md new file mode 100644 index 0000000000..06d7406cc6 --- /dev/null +++ b/docs/agent-guide/db-worker/003-server-list-write-lock.md @@ -0,0 +1,534 @@ +# db-worker-node Server List Write Lock Implementation Plan + +Goal: Make `db-worker-node` server-list reads and writes safe across multiple CLI and Electron processes without blocking readers. + +Architecture: Keep `server-list` as the shared discovery file under the configured root-dir, and serialize every mutation with a sibling `server-list.lock` file. +Architecture: Reads remain lock-free and parse the latest complete `server-list` content, while writes acquire the lock, read the current file inside the critical section, write through a temporary file, and atomically rename it into place. + +Tech Stack: ClojureScript, Node.js `fs` and `path`, Promesa, existing `logseq.db-worker.server-list`, existing `logseq.cli.server`, existing `frontend.worker.db-worker-node` lifecycle code. + +Related: Builds on `docs/agent-guide/db-worker/001-db-worker-node-restart-on-version-mismatch.md`. +Related: Relates to `docs/agent-guide/db-worker/002-desktop-db-worker-request-cap-switch-graph.md`. +Related: Relates to `docs/cli/logseq-cli.md` server lifecycle documentation. + +## Problem statement + +The current `server-list` file is a root-dir level text file used to discover running `db-worker-node` servers. + +The current file path is derived by `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` through `logseq.db-worker.server-list/path`, which returns `/server-list`. + +Each line currently contains one `pid port` pair. + +`logseq.db-worker.server-list/read-entries` currently checks existence, reads the whole file with `fs/readFileSync`, splits lines, parses valid `pid port` pairs, ignores invalid lines, and returns a vector. + +`logseq.db-worker.server-list/append-entry!` currently creates the parent directory and calls `fs/appendFileSync` with one line. + +`logseq.db-worker.server-list/remove-entry!` currently reads all entries, removes the matching `pid port`, and calls `rewrite-entries!`. + +`logseq.db-worker.server-list/rewrite-entries!` currently creates the parent directory and overwrites the entire file with `fs/writeFileSync`. + +`/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` appends the current process entry after the HTTP server starts listening. + +`/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` removes the current process entry when the daemon runtime clears state during stop. + +`/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs` discovers servers by reading `server-list`, checking process and `/healthz` state for each entry, then rewriting the list with retained entries. + +The current read-modify-write cleanup path can lose data when another process appends or removes entries between the initial read and the later rewrite. + +A concrete lost-update sequence is possible today. + +1. Process A calls `discover-servers` and reads stale entry `S` from `server-list`. +2. Process B starts a new `db-worker-node` and appends live entry `B` to `server-list`. +3. Process A finishes stale-entry cleanup using its old snapshot and rewrites `server-list` without entry `B`. +4. Later discovery cannot see process B even though process B is alive. + +The requested fix is to add a lock when writing `server-list`. + +The requested lock file name is exactly `server-list.lock`. + +The requested lock file is only used by write operations. + +Read operations must not acquire, wait for, or delete `server-list.lock`. + +## Testing Plan + +I will use @Test-Driven Development (TDD) before changing implementation behavior. + +I will add unit tests in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs` for lock path derivation and lock-free reads. + +I will add a unit test in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs` proving `read-entries` can read `server-list` while `server-list.lock` already exists. + +I will add behavior tests in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs` for `append-entry!` and `remove-entry!` preserving unrelated entries through the new locked update path. + +I will add a regression test in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/cli/server_test.cljs` that simulates stale cleanup racing with a concurrent append. + +The CLI regression test should create an initial stale entry, make the stale check append a live entry before cleanup finishes, call `logseq.cli.server/list-servers`, and assert that the live entry remains in `server-list` after stale cleanup. + +This test should fail on the current implementation because `discover-servers` rewrites from a stale snapshot. + +I will add a regression test in `/Users/rcmerci/gh-repos/logseq/src/test/frontend/worker/db_worker_node_test.cljs` only if the public `append-entry!` or `remove-entry!` return contract changes. + +The preferred implementation should keep those return contracts stable, so existing daemon registration tests should continue to cover daemon integration. + +I will run the focused RED tests before implementation and confirm they fail because of missing write locking or stale-snapshot cleanup behavior. + +I will run the same focused tests after implementation and confirm they pass. + +I will run the existing daemon and CLI server-list tests to ensure the lifecycle still registers, unregisters, lists, and lazily cleans stale entries. + +NOTE: I will write *all* tests before I add any implementation behavior. + +## Current implementation snapshot + +| Concern | Current location | Current behavior | Risk | +|---------|------------------|------------------|------| +| Path derivation | `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` | `path` returns `/server-list`. | No lock path exists. | +| Reads | `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` | `read-entries` reads without locking and ignores invalid lines. | Existence check plus read has a small TOCTOU window. | +| Append | `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` | `append-entry!` appends one line without coordination. | Append can race with a full rewrite. | +| Remove | `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` | `remove-entry!` reads and rewrites without coordination. | Remove can overwrite a concurrent append. | +| Full cleanup rewrite | `/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs` | `discover-servers` rewrites retained entries after async checks. | Cleanup can overwrite changes made after the initial read. | +| Daemon publish | `/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` | A started daemon calls `append-entry!`. | Publish can be lost by concurrent cleanup. | +| Daemon unpublish | `/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` | A stopped daemon calls `remove-entry!`. | Unpublish can lose unrelated entries. | + +## Target behavior + +All writes to `/server-list` must be serialized by `/server-list.lock`. + +The lock file basename must be exactly `server-list.lock`. + +The lock file must live in the same directory as `server-list`. + +Reads must remain lock-free. + +Reads must not block when `server-list.lock` exists. + +Reads must not remove stale `server-list.lock` files. + +Writes must acquire `server-list.lock` before reading current entries for a mutation. + +Writes must release `server-list.lock` in a `finally` path after the file mutation finishes or fails. + +Writes must fail fast with a structured error after a bounded lock acquisition timeout. + +Writes must not silently skip server-list publication or cleanup when lock acquisition fails. + +Stale cleanup must remove only entries that the cleanup pass proved stale. + +Stale cleanup must not rewrite the file from an old snapshot that can exclude newer entries. + +Readers must only ever observe either the previous complete `server-list` file or the next complete `server-list` file. + +The implementation must not add backward compatibility for old lock names or old `--server-list-file` arguments. + +## Design + +### 1. Add server-list lock path derivation + +Add `logseq.db-worker.server-list/lock-path` in `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs`. + +`lock-path` should accept the existing `server-list` file path and return `(node-path/join (node-path/dirname file-path) "server-list.lock")`. + +`lock-path` should reject a missing or blank file path with a clear `js/Error` or `ex-info` error. + +Do not derive the lock path as `server-list.cljs.lock`, `server-list.locked`, or `.lock` if that can produce a basename other than `server-list.lock`. + +### 2. Add a write-lock helper + +Add a private helper such as `with-write-lock!` in `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs`. + +The helper should create the parent directory before acquiring the lock. + +The helper should acquire `server-list.lock` with exclusive file creation through Node `fs.openSync` using an exclusive mode such as `"wx"`. + +The helper should write small metadata into the lock file after acquisition. + +The metadata should include the owning process pid and a unique lock id. + +The metadata makes stale-lock cleanup safer and prevents a releasing writer from deleting a lock owned by a later writer. + +The helper should close the lock file descriptor after writing metadata. + +The helper should run the supplied write operation only after acquiring the lock. + +The helper should release the lock in `finally` by reading the lock metadata and unlinking only when the lock id still matches the current writer. + +The helper should tolerate `ENOENT` while releasing because another process may have removed a stale lock after a crash. + +The helper should treat `EEXIST` as normal contention while waiting for the lock. + +The helper should retry acquisition on a short fixed interval until timeout. + +Use a small poll interval such as `25` milliseconds. + +Use a bounded timeout of `2000` milliseconds. + +If timeout expires and the lock owner process is still alive or cannot be inspected, throw `ex-info` with `:code :server-list-lock-timeout` and include `:file-path`, `:lock-path`, and lock metadata when available. + +If timeout sees a stale lock whose metadata pid is not alive, remove that stale lock and retry acquisition. + +If the lock file exists but metadata cannot be parsed, treat it as active until timeout and then fail fast rather than guessing. + +Do not use `js/Buffer` in this helper. + +### 3. Make writes atomic + +Add a private `write-entries-unlocked!` helper in `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs`. + +The helper should write the next payload to a unique temporary file in the same directory as `server-list`. + +The helper should then call `fs/renameSync` from the temporary file to `server-list`. + +The temporary file name should include `server-list`, the current pid, and a unique suffix. + +The temporary file must not be named `server-list.lock`. + +The helper should remove the temporary file in a best-effort `catch` path if writing or renaming fails. + +Atomic rename keeps lock-free readers from observing a partially truncated `server-list` file. + +### 4. Centralize mutations through an update helper + +Add a public helper such as `update-entries!` in `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs`. + +`update-entries!` should acquire `server-list.lock`. + +`update-entries!` should call `read-entries` after the lock is acquired. + +`update-entries!` should apply a pure transform to the current entries. + +`update-entries!` should normalize the result before writing. + +Normalization should keep only valid positive integer `:pid` and `:port` pairs. + +Normalization should remove exact duplicate `pid port` entries while preserving first-seen order. + +`update-entries!` should write with `write-entries-unlocked!`. + +`update-entries!` should return the normalized entries that were written. + +`append-entry!` should become a thin wrapper around `update-entries!`. + +`append-entry!` should append the entry only when both `pid` and `port` are positive integers. + +`append-entry!` should avoid adding a duplicate exact `pid port` entry. + +`append-entry!` should keep returning the appended entry for valid input to preserve current callers. + +`remove-entry!` should become a thin wrapper around `update-entries!`. + +`remove-entry!` should remove only exact matches for the target `pid port` pair. + +`remove-entry!` should preserve unrelated entries added by other writers. + +`rewrite-entries!` should remain public only as a locked full replacement helper for compatibility with explicit full-replacement callers. + +Production code must not use `rewrite-entries!` with a stale snapshot from before asynchronous work. + +Prefer mutation-specific helpers such as `update-entries!` and `remove-entries!` for normal production server-list changes. + +### 5. Replace stale cleanup rewrite in discovery + +Update `/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs`. + +In `logseq.cli.server/discover-servers`, keep the existing lock-free read for discovery input. + +Keep health checks outside the write lock because health checks are asynchronous and can take up to the request timeout. + +Compute the set of entries that should be removed because they are proven stale. + +After checks finish, call a server-list helper such as `remove-entries!` or `update-entries!` to remove only that stale set from the current file under `server-list.lock`. + +Do not rewrite `server-list` from `cleaned-entries` computed from the initial snapshot. + +Do not hold `server-list.lock` while calling `pid-status` or `/healthz`. + +Return discovered server payloads from the original discovery pass as today. + +The purpose of the write lock is to protect file mutations, not to serialize discovery network work. + +### 6. Keep daemon call sites simple + +Keep `/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` call sites using `server-list/append-entry!` on publish and `server-list/remove-entry!` on stop if the public signatures stay the same. + +Do not add a separate lock concern in the daemon namespace. + +The server-list namespace should own all lock-file details. + +### 7. Improve lock-free reads without changing semantics + +Update `read-entries` in `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` to avoid a separate `fs/existsSync` check before reading. + +Read the file directly and catch only missing-file errors such as `ENOENT` to return `[]`. + +Continue returning `[]` for blank or missing paths only if that is the existing accepted behavior. + +Continue ignoring invalid lines unless implementation review decides to fail fast for corrupt server-list content. + +Do not inspect `server-list.lock` from `read-entries`. + +This removes one filesystem call from the common read path and avoids an existence-check TOCTOU race. + +## Integration overview + +```text +CLI or Electron process + | + | lock-free read + v +/server-list + | + | parse pid port entries + v +pid-status and /healthz checks + | + | write mutation only + v +acquire /server-list.lock + | + | read current server-list inside lock + | apply append/remove/prune transform + | write temp file + | rename temp file to server-list + v +release /server-list.lock +``` + +## Affected files + +| File | Planned change | +|------|----------------| +| `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs` | Add `lock-path`, write-lock helper, atomic write helper, update helper, locked append/remove/prune behavior, and optimized read behavior. | +| `/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs` | Replace stale full rewrite in `discover-servers` with locked removal of only proven stale entries. | +| `/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` | Prefer no code change, but verify existing publish and unpublish call sites use the updated locked helpers. | +| `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs` | Add focused server-list lock, read, append, remove, dedupe, and stale-lock tests. | +| `/Users/rcmerci/gh-repos/logseq/src/test/logseq/cli/server_test.cljs` | Add a regression test for discovery cleanup preserving entries added during cleanup. | +| `/Users/rcmerci/gh-repos/logseq/src/test/frontend/worker/db_worker_node_test.cljs` | Keep existing daemon registration tests green, and add tests only if signatures or return contracts change. | +| `/Users/rcmerci/gh-repos/logseq/docs/cli/logseq-cli.md` | Update only if user-facing server-list troubleshooting needs to mention `server-list.lock`. | + +## Implementation tasks + +### Phase 0: Preparation + +1. Read `/Users/rcmerci/gh-repos/logseq/AGENTS.md`. +2. Check for directory-specific `AGENTS.md` files before editing each target file. +3. Load @Test-Driven Development (TDD) before implementation. +4. Re-read `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs`. +5. Re-read `/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs` around `discover-servers`. +6. Re-read `/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs` around server-list append and remove call sites. + +### Phase 1: Write RED tests + +1. Add `server-list-lock-path-derives-sibling-lock-file` in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs`. +2. Assert `(server-list/lock-path "/tmp/logseq-root/server-list")` returns `/tmp/logseq-root/server-list.lock`. +3. Add `read-entries-ignores-server-list-lock` in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs`. +4. Create both `server-list` and `server-list.lock` in a temporary root-dir. +5. Assert `server-list/read-entries` returns parsed entries without waiting for or deleting `server-list.lock`. +6. Add `append-entry-deduplicates-valid-entry-under-update` in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs`. +7. Assert two identical appends leave one `pid port` line. +8. Add `remove-entry-preserves-unrelated-current-entry` in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs`. +9. Assert removing one entry does not remove another existing entry. +10. Add `list-servers-preserves-concurrent-server-list-writes` in `/Users/rcmerci/gh-repos/logseq/src/test/logseq/cli/server_test.cljs`. +11. In that test, write one stale entry to `server-list`. +12. In that test, redefine `daemon/pid-status` so checking the stale pid appends a second live entry to `server-list` before returning `:not-found`. +13. In that test, call `cli-server/list-servers`. +14. In that test, assert the final `server-list` still contains the second live entry. +15. Run the focused tests and confirm the new race regression test fails on current behavior. + +Expected RED command examples: + +```shell +bb dev:test -v logseq.db-worker.server-list-test/server-list-lock-path-derives-sibling-lock-file +bb dev:test -v logseq.db-worker.server-list-test/read-entries-ignores-server-list-lock +bb dev:test -v logseq.cli.server-test/list-servers-preserves-concurrent-server-list-writes +``` + +The first command should fail until `lock-path` exists. + +The race regression command should fail until discovery cleanup stops rewriting stale snapshots. + +### Phase 2: Implement locked server-list mutations + +1. Add `lock-path` to `/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs`. +2. Add private constants for write-lock timeout and poll interval. +3. Set the write-lock timeout constant to `2000` milliseconds. +4. Set the write-lock poll interval constant to `25` milliseconds unless tests show this is too aggressive. +5. Add a private lock metadata writer. +6. Add a private lock metadata reader. +7. Add a private stale-lock predicate using current lock metadata and process status. +8. Use a local process-status helper rather than depending on `logseq.db-worker.daemon` if adding the dependency would create a cycle. +9. Add `acquire-write-lock!` with exclusive file creation. +10. Add `release-write-lock!` with lock-id verification. +11. Add `with-write-lock!` that wraps acquire, body execution, and release. +12. Add `read-entries-unlocked` only if it keeps the public `read-entries` simple and testable. +13. Update `read-entries` to read directly and catch missing-file errors. +14. Add `entry-key` or equivalent exact `pid port` identity helper. +15. Add `normalize-entries` to validate entries and remove exact duplicate `pid port` entries while preserving order. +16. Add `payload-for-entries` or equivalent formatting helper. +17. Add `write-entries-unlocked!` that writes through a temporary file and renames to `server-list`. +18. Add `update-entries!` that reads current entries inside the lock and writes normalized transformed entries. +19. Reimplement `append-entry!` through `update-entries!`. +20. Reimplement `remove-entry!` through `update-entries!`. +21. Keep `rewrite-entries!` public as a locked full-replacement helper. +22. Add `remove-entries!` or `prune-entries!` if this makes `discover-servers` clearer. +23. Keep public return values compatible for current daemon call sites unless a test proves a change is needed. + +### Phase 3: Replace stale discovery cleanup + +1. Update `logseq.cli.server/discover-servers` in `/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs`. +2. Keep `entries` as the discovery input read before health checks. +3. Keep the existing `p/all` health-check structure. +4. Compute `stale-entries` as entries whose result has `:retain? false`. +5. Call `server-list/remove-entries!` or `server-list/update-entries!` only when `stale-entries` is non-empty. +6. Remove the current `cleaned-entries` full rewrite based on the initial snapshot. +7. Preserve existing `retained-results` and returned `:server` values. +8. Verify `list-servers-lazily-cleans-stale-server-list-entries` still passes. +9. Verify `ensure-server-preserves-live-server-list-entry-after-transient-healthz-failure` still passes. + +### Phase 4: Verify integration behavior + +1. Run the new focused server-list tests. +2. Run the new focused CLI race regression test. +3. Run existing CLI server-list cleanup tests. +4. Run existing db-worker-node daemon registration tests. +5. Run the full lint and unit test suite when the focused tests are green. + +Expected GREEN command examples: + +```shell +bb dev:test -v logseq.db-worker.server-list-test +bb dev:test -v logseq.cli.server-test/list-servers-preserves-concurrent-server-list-writes +bb dev:test -v logseq.cli.server-test/list-servers-lazily-cleans-stale-server-list-entries +bb dev:test -v logseq.cli.server-test/ensure-server-preserves-live-server-list-entry-after-transient-healthz-failure +bb dev:test -v frontend.worker.db-worker-node-test/db-worker-node-start-daemon-registers-and-unregisters-derived-server-list-entry +bb dev:lint-and-test +``` + +Each focused command should pass without warnings. + +`bb dev:lint-and-test` should pass before opening a PR. + +## Edge cases + +A reader may run while a writer holds `server-list.lock`. + +The reader must ignore the lock and return entries from the current complete `server-list` file. + +A reader may run while a writer is replacing `server-list` through atomic rename. + +The reader should see either the old complete file or the new complete file. + +A writer may crash after acquiring `server-list.lock`. + +A later writer should remove the stale lock only when metadata proves the owner pid is gone. + +A writer may see `server-list.lock` from a live process. + +The writer should wait for the `2000` millisecond bounded timeout and then fail fast with `:server-list-lock-timeout`. + +A process may not have permission to inspect the pid in the lock file. + +Treat that lock as active until timeout rather than deleting it. + +A lock file may contain malformed metadata. + +Treat malformed metadata as active until timeout and then fail fast rather than deleting it speculatively. + +Two processes may append the same `pid port` entry. + +The normalized mutation path should avoid duplicate exact entries. + +Discovery may prove entry `S` stale while another process appends entry `B`. + +The cleanup mutation must remove only `S` from the current file and preserve `B`. + +Discovery may fail `/healthz` transiently for a live process. + +The current behavior retains entries after transient health failure, and the new implementation must preserve that behavior. + +A daemon may stop and remove its entry while discovery is checking health. + +The locked remove path should remove only the daemon entry and preserve unrelated entries. + +The root-dir directory may not exist before the first write. + +The write path should create the directory before lock acquisition. + +The root-dir may be missing during read. + +The read path should return an empty vector for a missing `server-list` file. + +## Acceptance criteria + +`server-list.lock` is the only lock filename used for `server-list` writes. + +Every production mutation of `server-list` acquires `server-list.lock` first. + +No production read of `server-list` acquires or waits for `server-list.lock`. + +Write-lock acquisition times out after `2000` milliseconds with a structured `:server-list-lock-timeout` error. + +Malformed `server-list.lock` metadata is not deleted speculatively and fails fast after the timeout. + +`append-entry!` does not lose entries when another process is cleaning stale entries. + +`remove-entry!` does not lose unrelated entries. + +`discover-servers` no longer rewrites the list from a stale pre-health-check snapshot. + +A stale cleanup pass preserves entries appended after the pass started. + +Exact duplicate `pid port` entries are normalized away while preserving first-seen order. + +`rewrite-entries!` remains public only as a locked full-replacement helper. + +Readers never intentionally observe temporary files. + +Focused tests for `logseq.db-worker.server-list-test` pass. + +Focused tests for `logseq.cli.server-test` server-list discovery behavior pass. + +Existing `frontend.worker.db-worker-node-test` daemon registration behavior passes. + +`bb dev:lint-and-test` passes before PR submission. + +## Testing Details + +The server-list unit tests should verify externally visible file behavior rather than private helper details wherever possible. + +The lock path test is acceptable because the requested lock filename is part of the behavior contract. + +The lock-free read test should create `server-list.lock` and prove reads still return entries without deleting the lock. + +The CLI race regression should test the actual `list-servers` or `discover-servers` behavior and prove a live entry appended during stale cleanup survives. + +The daemon registration test should verify the existing start and stop behavior still publishes and unpublishes the current process entry through the updated helpers. + +## Implementation Details + +- Keep the lock implementation inside `logseq.db-worker.server-list` so callers cannot bypass it accidentally. +- Use `/server-list.lock` and no other lock filename. +- Use exclusive file creation to acquire the lock. +- Use a `2000` millisecond bounded wait and fail fast with structured error data when the lock cannot be acquired. +- Use lock metadata with pid and lock id to support safe release and stale cleanup. +- Treat malformed `server-list.lock` metadata as active until timeout and then fail fast. +- Keep reads lock-free and independent from `server-list.lock`. +- Read current entries inside the write lock before applying append, remove, or prune transforms. +- Normalize exact duplicate `pid port` entries while preserving first-seen order. +- Keep `rewrite-entries!` public only as a locked full-replacement helper. +- Replace full stale-snapshot rewrites in discovery with targeted stale-entry removal. +- Write through a temporary file and atomic rename so readers do not see partial writes. +- Preserve current public call-site simplicity in `frontend.worker.db-worker-node`. + +## Question + +No open questions remain for this plan. + +The write-lock timeout is `2000` milliseconds. + +Malformed `server-list.lock` metadata fails fast after the normal timeout and is not deleted speculatively. + +`rewrite-entries!` remains public only as a locked full-replacement helper, while production code should prefer mutation-specific helpers. + +Exact duplicate `pid port` entries are normalized away while preserving first-seen order. + +--- diff --git a/src/main/logseq/cli/server.cljs b/src/main/logseq/cli/server.cljs index 87ba28bc49..d073a7d991 100644 --- a/src/main/logseq/cli/server.cljs +++ b/src/main/logseq/cli/server.cljs @@ -247,9 +247,12 @@ (update :owner-source normalize-owner-source))})) (p/catch (fn [_] {:entry entry :retain? true}))))))) - retained-results (filter :retain? results) - cleaned-entries (mapv :entry retained-results) - _ (server-list/rewrite-entries! path cleaned-entries)] + retained-results (filterv :retain? results) + stale-entries (->> results + (remove :retain?) + (mapv :entry)) + _ (when (seq stale-entries) + (server-list/remove-entries! path stale-entries))] (->> retained-results (keep :server) vec)))) diff --git a/src/main/logseq/db_worker/server_list.cljs b/src/main/logseq/db_worker/server_list.cljs index c4cbd0bb18..cf469b3df1 100644 --- a/src/main/logseq/db_worker/server_list.cljs +++ b/src/main/logseq/db_worker/server_list.cljs @@ -4,12 +4,21 @@ ["fs" :as fs] ["path" :as node-path])) +(def ^:private write-lock-timeout-ms 2000) +(def ^:private write-lock-poll-interval-ms 25) + (defn path [root-dir-path] (when-not (seq root-dir-path) (throw (js/Error. "root-dir is required"))) (node-path/join root-dir-path "server-list")) +(defn lock-path + [file-path] + (when-not (seq file-path) + (throw (js/Error. "server-list file path is required"))) + (node-path/join (node-path/dirname file-path) "server-list.lock")) + (defn- parse-int [value] (when (re-matches #"\d+" value) @@ -28,38 +37,223 @@ (defn read-entries [file-path] - (if (and (seq file-path) (fs/existsSync file-path)) - (->> (.toString (fs/readFileSync file-path) "utf8") - string/split-lines - (keep parse-line) - vec) + (if (seq file-path) + (try + (->> (.toString (fs/readFileSync file-path) "utf8") + string/split-lines + (keep parse-line) + vec) + (catch :default e + (if (= "ENOENT" (.-code e)) + [] + (throw e)))) [])) +(defn- entry-key + [{:keys [pid port]}] + [pid port]) + +(defn- valid-entry? + [{:keys [pid port]}] + (and (pos-int? pid) (pos-int? port))) + +(defn- normalize-entries + [entries] + (:entries + (reduce (fn [{:keys [seen] :as acc} entry] + (let [key (entry-key entry)] + (if (and (valid-entry? entry) + (not (contains? seen key))) + (-> acc + (update :seen conj key) + (update :entries conj {:pid (:pid entry) + :port (:port entry)})) + acc))) + {:seen #{} + :entries []} + (or entries [])))) + +(defn- payload-for-entries + [entries] + (if (seq entries) + (str (string/join "\n" (map (fn [{:keys [pid port]}] + (str pid " " port)) + entries)) + "\n") + "")) + +(defn- sleep-sync! + [ms] + (let [shared (js/SharedArrayBuffer. 4) + view (js/Int32Array. shared)] + (js/Atomics.wait view 0 0 ms))) + +(defn- process-status + [pid] + (try + (.kill js/process pid 0) + :alive + (catch :default e + (case (.-code e) + "ESRCH" :not-found + "EPERM" :no-permission + :unknown)))) + +(defn- read-lock-metadata + [lock-file] + (try + (let [raw (.toString (fs/readFileSync lock-file) "utf8")] + (try + {:raw raw + :metadata (js->clj (js/JSON.parse raw) :keywordize-keys true)} + (catch :default e + {:raw raw + :parse-error e}))) + (catch :default e + (if (= "ENOENT" (.-code e)) + nil + {:read-error e})))) + +(defn- lock-stale? + [lock-info] + (let [pid (get-in lock-info [:metadata :pid])] + (and (pos-int? pid) + (= :not-found (process-status pid))))) + +(defn- unlink-if-exists! + [file-path] + (try + (fs/unlinkSync file-path) + (catch :default e + (when-not (= "ENOENT" (.-code e)) + (throw e))))) + +(defn- lock-timeout-error + [file-path lock-file lock-info] + (ex-info (str "Timed out acquiring server-list lock: " lock-file) + (cond-> {:code :server-list-lock-timeout + :file-path file-path + :lock-path lock-file} + (:metadata lock-info) (assoc :lock-metadata (:metadata lock-info)) + (:raw lock-info) (assoc :lock-raw (:raw lock-info))))) + +(defn- write-lock-metadata! + [fd metadata] + (fs/writeFileSync fd (js/JSON.stringify (clj->js metadata)) "utf8")) + +(defn- acquire-write-lock! + [file-path] + (fs/mkdirSync (node-path/dirname file-path) #js {:recursive true}) + (let [lock-file (lock-path file-path) + deadline (+ (js/Date.now) write-lock-timeout-ms)] + (loop [] + (let [lock-id (str (random-uuid)) + metadata {:pid (.-pid js/process) + :lock-id lock-id + :created-at (.toISOString (js/Date.))} + result (try + (let [fd (fs/openSync lock-file "wx") + write-error (try + (write-lock-metadata! fd metadata) + nil + (catch :default e + e))] + (fs/closeSync fd) + (when write-error + (unlink-if-exists! lock-file) + (throw write-error)) + {:file-path file-path + :lock-path lock-file + :metadata metadata}) + (catch :default e + (if (= "EEXIST" (.-code e)) + (let [lock-info (read-lock-metadata lock-file)] + (when (lock-stale? lock-info) + (unlink-if-exists! lock-file)) + (if (>= (js/Date.now) deadline) + (throw (lock-timeout-error file-path lock-file lock-info)) + ::retry)) + (throw e))))] + (if (= ::retry result) + (do + (sleep-sync! write-lock-poll-interval-ms) + (recur)) + result))))) + +(defn- release-write-lock! + [{lock-file :lock-path metadata :metadata}] + (let [lock-info (read-lock-metadata lock-file)] + (when (= (:lock-id metadata) (get-in lock-info [:metadata :lock-id])) + (unlink-if-exists! lock-file)))) + +(defn- with-write-lock! + [file-path f] + (let [lock (acquire-write-lock! file-path)] + (try + (f) + (finally + (release-write-lock! lock))))) + +(defn- write-entries-unlocked! + [file-path entries] + (let [dir (node-path/dirname file-path) + tmp-file (node-path/join dir (str "server-list." + (.-pid js/process) + "." + (random-uuid) + ".tmp")) + payload (payload-for-entries entries)] + (try + (fs/writeFileSync tmp-file payload "utf8") + (fs/renameSync tmp-file file-path) + (catch :default e + (try + (fs/rmSync tmp-file #js {:force true}) + (catch :default _)) + (throw e))))) + +(defn update-entries! + [file-path transform] + (when (seq file-path) + (with-write-lock! + file-path + (fn [] + (let [entries (read-entries file-path) + next-entries (normalize-entries (transform entries))] + (write-entries-unlocked! file-path next-entries) + next-entries))))) + (defn rewrite-entries! [file-path entries] (when (seq file-path) - (fs/mkdirSync (node-path/dirname file-path) #js {:recursive true}) - (let [payload (if (seq entries) - (str (string/join "\n" (map (fn [{:keys [pid port]}] - (str pid " " port)) - entries)) - "\n") - "")] - (fs/writeFileSync file-path payload "utf8")))) + (update-entries! file-path (constantly entries)) + nil)) (defn append-entry! [file-path {:keys [pid port] :as entry}] (when (and (seq file-path) (pos-int? pid) (pos-int? port)) - (fs/mkdirSync (node-path/dirname file-path) #js {:recursive true}) - (fs/appendFileSync file-path (str pid " " port "\n") "utf8") + (update-entries! file-path #(conj (vec %) {:pid pid :port port})) entry)) (defn remove-entry! [file-path {:keys [pid port]}] (when (seq file-path) - (let [entries (->> (read-entries file-path) - (remove (fn [entry] - (and (= pid (:pid entry)) - (= port (:port entry))))) - vec)] - (rewrite-entries! file-path entries)))) + (update-entries! + file-path + (fn [entries] + (remove (fn [entry] + (and (= pid (:pid entry)) + (= port (:port entry)))) + entries))) + nil)) + +(defn remove-entries! + [file-path entries-to-remove] + (when (seq file-path) + (let [keys-to-remove (set (map entry-key (normalize-entries entries-to-remove)))] + (update-entries! + file-path + (fn [entries] + (remove (fn [entry] + (contains? keys-to-remove (entry-key entry))) + entries)))))) diff --git a/src/test/logseq/cli/server_test.cljs b/src/test/logseq/cli/server_test.cljs index 30c5b9f772..e1572a48d3 100644 --- a/src/test/logseq/cli/server_test.cljs +++ b/src/test/logseq/cli/server_test.cljs @@ -13,6 +13,7 @@ [logseq.cli.test-helper :as test-helper] [logseq.common.version :as version] [logseq.db-worker.daemon :as daemon] + [logseq.db-worker.server-list :as server-list] [promesa.core :as p])) (deftest spawn-server-omits-host-and-port-flags @@ -969,6 +970,34 @@ (is false (str "unexpected error: " e)))) (p/finally done))))) +(deftest list-servers-preserves-concurrent-server-list-writes + (async done + (let [root-dir (node-helper/create-tmp-dir "cli-server-list-race") + config-path (node-path/join root-dir "cli.edn") + server-list-file (cli-config/server-list-path root-dir) + stale-entry {:pid 999999 :port 65535} + live-entry {:pid (.-pid js/process) :port 65432} + appended? (atom false)] + (fs/writeFileSync server-list-file + (str (:pid stale-entry) " " (:port stale-entry) "\n") + "utf8") + (-> (p/with-redefs [daemon/pid-status (fn [pid] + (when (and (= (:pid stale-entry) pid) + (not @appended?)) + (reset! appended? true) + (server-list/append-entry! server-list-file live-entry)) + :not-found)] + (cli-server/list-servers {:root-dir root-dir + :config-path config-path})) + (p/then (fn [servers] + (is (empty? servers)) + (is @appended?) + (is (= [live-entry] + (server-list/read-entries server-list-file))))) + (p/catch (fn [e] + (is false (str "unexpected error: " e)))) + (p/finally done))))) + (deftest cleanup-revision-mismatched-servers-kills-only-cli-owned-targets (async done (let [stop-calls (atom [])] diff --git a/src/test/logseq/db_worker/server_list_test.cljs b/src/test/logseq/db_worker/server_list_test.cljs index 2f79364c82..7188a7e7fc 100644 --- a/src/test/logseq/db_worker/server_list_test.cljs +++ b/src/test/logseq/db_worker/server_list_test.cljs @@ -1,5 +1,7 @@ (ns logseq.db-worker.server-list-test - (:require [cljs.test :refer [deftest is]] + (:require ["fs" :as fs] + [cljs.test :refer [deftest is]] + [frontend.test.node-helper :as node-helper] [logseq.db-worker.server-list :as server-list])) (deftest path-derives-server-list-from-root-dir @@ -10,3 +12,69 @@ (is (thrown-with-msg? js/Error #"root-dir is required" (server-list/path nil)))) + +(deftest server-list-lock-path-derives-sibling-lock-file + (is (= "/tmp/logseq-root/server-list.lock" + (server-list/lock-path "/tmp/logseq-root/server-list")))) + +(deftest read-entries-ignores-server-list-lock + (let [root-dir (node-helper/create-tmp-dir "server-list-read-lock-free") + file-path (server-list/path root-dir) + lock-file (server-list/lock-path file-path)] + (fs/writeFileSync file-path "123 456\n" "utf8") + (fs/writeFileSync lock-file "locked" "utf8") + (is (= [{:pid 123 :port 456}] + (server-list/read-entries file-path))) + (is (fs/existsSync lock-file)) + (is (= "locked" + (.toString (fs/readFileSync lock-file) "utf8"))))) + +(deftest append-entry-deduplicates-valid-entry-under-update + (let [root-dir (node-helper/create-tmp-dir "server-list-append-dedupe") + file-path (server-list/path root-dir) + entry {:pid 123 :port 456}] + (is (= entry (server-list/append-entry! file-path entry))) + (is (= entry (server-list/append-entry! file-path entry))) + (is (= "123 456\n" + (.toString (fs/readFileSync file-path) "utf8"))) + (is (= [entry] + (server-list/read-entries file-path))))) + +(deftest remove-entry-preserves-unrelated-current-entry + (let [root-dir (node-helper/create-tmp-dir "server-list-remove-preserve") + file-path (server-list/path root-dir)] + (server-list/rewrite-entries! file-path [{:pid 111 :port 222} + {:pid 333 :port 444}]) + (server-list/remove-entry! file-path {:pid 111 :port 222}) + (is (= "333 444\n" + (.toString (fs/readFileSync file-path) "utf8"))) + (is (= [{:pid 333 :port 444}] + (server-list/read-entries file-path))))) + +(deftest append-entry-repairs-stale-server-list-lock + (let [root-dir (node-helper/create-tmp-dir "server-list-stale-lock") + file-path (server-list/path root-dir) + lock-file (server-list/lock-path file-path)] + (fs/writeFileSync lock-file + (js/JSON.stringify (clj->js {:pid 999999 + :lock-id "stale-lock"})) + "utf8") + (server-list/append-entry! file-path {:pid 123 :port 456}) + (is (= [{:pid 123 :port 456}] + (server-list/read-entries file-path))) + (is (not (fs/existsSync lock-file))))) + +(deftest append-entry-times-out-on-malformed-server-list-lock + (let [root-dir (node-helper/create-tmp-dir "server-list-malformed-lock") + file-path (server-list/path root-dir) + lock-file (server-list/lock-path file-path)] + (fs/writeFileSync lock-file "not-json" "utf8") + (try + (server-list/append-entry! file-path {:pid 123 :port 456}) + (is false "expected server-list lock timeout") + (catch :default e + (is (= :server-list-lock-timeout (:code (ex-data e)))) + (is (= file-path (:file-path (ex-data e)))) + (is (= lock-file (:lock-path (ex-data e)))) + (is (= "not-json" + (.toString (fs/readFileSync lock-file) "utf8")))))))