Files
logseq/docs/agent-guide/db-worker/003-server-list-write-lock.md

28 KiB

db-worker-node Server List Write Lock Implementation Plan

Goal: Make db-worker-node server-list reads and writes safe across multiple CLI and Electron processes without blocking readers.

Architecture: Keep server-list as the shared discovery file under the configured root-dir, and serialize every mutation with a sibling server-list.lock file. Architecture: Reads remain lock-free and parse the latest complete server-list content, while writes acquire the lock, read the current file inside the critical section, write through a temporary file, and atomically rename it into place.

Tech Stack: ClojureScript, Node.js fs and path, Promesa, existing logseq.db-worker.server-list, existing logseq.cli.server, existing frontend.worker.db-worker-node lifecycle code.

Related: Builds on docs/agent-guide/db-worker/001-db-worker-node-restart-on-version-mismatch.md. Related: Relates to docs/agent-guide/db-worker/002-desktop-db-worker-request-cap-switch-graph.md. Related: Relates to docs/cli/logseq-cli.md server lifecycle documentation.

Problem statement

The current server-list file is a root-dir level text file used to discover running db-worker-node servers.

The current file path is derived by /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs through logseq.db-worker.server-list/path, which returns <root-dir>/server-list.

Each line currently contains one pid port pair.

logseq.db-worker.server-list/read-entries currently checks existence, reads the whole file with fs/readFileSync, splits lines, parses valid pid port pairs, ignores invalid lines, and returns a vector.

logseq.db-worker.server-list/append-entry! currently creates the parent directory and calls fs/appendFileSync with one line.

logseq.db-worker.server-list/remove-entry! currently reads all entries, removes the matching pid port, and calls rewrite-entries!.

logseq.db-worker.server-list/rewrite-entries! currently creates the parent directory and overwrites the entire file with fs/writeFileSync.

/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs appends the current process entry after the HTTP server starts listening.

/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs removes the current process entry when the daemon runtime clears state during stop.

/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs discovers servers by reading server-list, checking process and /healthz state for each entry, then rewriting the list with retained entries.

The current read-modify-write cleanup path can lose data when another process appends or removes entries between the initial read and the later rewrite.

A concrete lost-update sequence is possible today.

  1. Process A calls discover-servers and reads stale entry S from server-list.
  2. Process B starts a new db-worker-node and appends live entry B to server-list.
  3. Process A finishes stale-entry cleanup using its old snapshot and rewrites server-list without entry B.
  4. Later discovery cannot see process B even though process B is alive.

The requested fix is to add a lock when writing server-list.

The requested lock file name is exactly server-list.lock.

The requested lock file is only used by write operations.

Read operations must not acquire, wait for, or delete server-list.lock.

Testing Plan

I will use @Test-Driven Development (TDD) before changing implementation behavior.

I will add unit tests in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs for lock path derivation and lock-free reads.

I will add a unit test in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs proving read-entries can read server-list while server-list.lock already exists.

I will add behavior tests in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs for append-entry! and remove-entry! preserving unrelated entries through the new locked update path.

I will add a regression test in /Users/rcmerci/gh-repos/logseq/src/test/logseq/cli/server_test.cljs that simulates stale cleanup racing with a concurrent append.

The CLI regression test should create an initial stale entry, make the stale check append a live entry before cleanup finishes, call logseq.cli.server/list-servers, and assert that the live entry remains in server-list after stale cleanup.

This test should fail on the current implementation because discover-servers rewrites from a stale snapshot.

I will add a regression test in /Users/rcmerci/gh-repos/logseq/src/test/frontend/worker/db_worker_node_test.cljs only if the public append-entry! or remove-entry! return contract changes.

The preferred implementation should keep those return contracts stable, so existing daemon registration tests should continue to cover daemon integration.

I will run the focused RED tests before implementation and confirm they fail because of missing write locking or stale-snapshot cleanup behavior.

I will run the same focused tests after implementation and confirm they pass.

I will run the existing daemon and CLI server-list tests to ensure the lifecycle still registers, unregisters, lists, and lazily cleans stale entries.

NOTE: I will write all tests before I add any implementation behavior.

Current implementation snapshot

Concern Current location Current behavior Risk
Path derivation /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs path returns <root-dir>/server-list. No lock path exists.
Reads /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs read-entries reads without locking and ignores invalid lines. Existence check plus read has a small TOCTOU window.
Append /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs append-entry! appends one line without coordination. Append can race with a full rewrite.
Remove /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs remove-entry! reads and rewrites without coordination. Remove can overwrite a concurrent append.
Full cleanup rewrite /Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs discover-servers rewrites retained entries after async checks. Cleanup can overwrite changes made after the initial read.
Daemon publish /Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs A started daemon calls append-entry!. Publish can be lost by concurrent cleanup.
Daemon unpublish /Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs A stopped daemon calls remove-entry!. Unpublish can lose unrelated entries.

Target behavior

All writes to <root-dir>/server-list must be serialized by <root-dir>/server-list.lock.

The lock file basename must be exactly server-list.lock.

The lock file must live in the same directory as server-list.

Reads must remain lock-free.

Reads must not block when server-list.lock exists.

Reads must not remove stale server-list.lock files.

Writes must acquire server-list.lock before reading current entries for a mutation.

Writes must release server-list.lock in a finally path after the file mutation finishes or fails.

Writes must fail fast with a structured error after a bounded lock acquisition timeout.

Writes must not silently skip server-list publication or cleanup when lock acquisition fails.

Stale cleanup must remove only entries that the cleanup pass proved stale.

Stale cleanup must not rewrite the file from an old snapshot that can exclude newer entries.

Readers must only ever observe either the previous complete server-list file or the next complete server-list file.

The implementation must not add backward compatibility for old lock names or old --server-list-file arguments.

Design

1. Add server-list lock path derivation

Add logseq.db-worker.server-list/lock-path in /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs.

lock-path should accept the existing server-list file path and return (node-path/join (node-path/dirname file-path) "server-list.lock").

lock-path should reject a missing or blank file path with a clear js/Error or ex-info error.

Do not derive the lock path as server-list.cljs.lock, server-list.locked, or <server-list>.lock if that can produce a basename other than server-list.lock.

2. Add a write-lock helper

Add a private helper such as with-write-lock! in /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs.

The helper should create the parent directory before acquiring the lock.

The helper should acquire server-list.lock with exclusive file creation through Node fs.openSync using an exclusive mode such as "wx".

The helper should write small metadata into the lock file after acquisition.

The metadata should include the owning process pid and a unique lock id.

The metadata makes stale-lock cleanup safer and prevents a releasing writer from deleting a lock owned by a later writer.

The helper should close the lock file descriptor after writing metadata.

The helper should run the supplied write operation only after acquiring the lock.

The helper should release the lock in finally by reading the lock metadata and unlinking only when the lock id still matches the current writer.

The helper should tolerate ENOENT while releasing because another process may have removed a stale lock after a crash.

The helper should treat EEXIST as normal contention while waiting for the lock.

The helper should retry acquisition on a short fixed interval until timeout.

Use a small poll interval such as 25 milliseconds.

Use a bounded timeout of 2000 milliseconds.

If timeout expires and the lock owner process is still alive or cannot be inspected, throw ex-info with :code :server-list-lock-timeout and include :file-path, :lock-path, and lock metadata when available.

If timeout sees a stale lock whose metadata pid is not alive, remove that stale lock and retry acquisition.

If the lock file exists but metadata cannot be parsed, treat it as active until timeout and then fail fast rather than guessing.

Do not use js/Buffer in this helper.

3. Make writes atomic

Add a private write-entries-unlocked! helper in /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs.

The helper should write the next payload to a unique temporary file in the same directory as server-list.

The helper should then call fs/renameSync from the temporary file to server-list.

The temporary file name should include server-list, the current pid, and a unique suffix.

The temporary file must not be named server-list.lock.

The helper should remove the temporary file in a best-effort catch path if writing or renaming fails.

Atomic rename keeps lock-free readers from observing a partially truncated server-list file.

4. Centralize mutations through an update helper

Add a public helper such as update-entries! in /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs.

update-entries! should acquire server-list.lock.

update-entries! should call read-entries after the lock is acquired.

update-entries! should apply a pure transform to the current entries.

update-entries! should normalize the result before writing.

Normalization should keep only valid positive integer :pid and :port pairs.

Normalization should remove exact duplicate pid port entries while preserving first-seen order.

update-entries! should write with write-entries-unlocked!.

update-entries! should return the normalized entries that were written.

append-entry! should become a thin wrapper around update-entries!.

append-entry! should append the entry only when both pid and port are positive integers.

append-entry! should avoid adding a duplicate exact pid port entry.

append-entry! should keep returning the appended entry for valid input to preserve current callers.

remove-entry! should become a thin wrapper around update-entries!.

remove-entry! should remove only exact matches for the target pid port pair.

remove-entry! should preserve unrelated entries added by other writers.

rewrite-entries! should remain public only as a locked full replacement helper for compatibility with explicit full-replacement callers.

Production code must not use rewrite-entries! with a stale snapshot from before asynchronous work.

Prefer mutation-specific helpers such as update-entries! and remove-entries! for normal production server-list changes.

5. Replace stale cleanup rewrite in discovery

Update /Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs.

In logseq.cli.server/discover-servers, keep the existing lock-free read for discovery input.

Keep health checks outside the write lock because health checks are asynchronous and can take up to the request timeout.

Compute the set of entries that should be removed because they are proven stale.

After checks finish, call a server-list helper such as remove-entries! or update-entries! to remove only that stale set from the current file under server-list.lock.

Do not rewrite server-list from cleaned-entries computed from the initial snapshot.

Do not hold server-list.lock while calling pid-status or /healthz.

Return discovered server payloads from the original discovery pass as today.

The purpose of the write lock is to protect file mutations, not to serialize discovery network work.

6. Keep daemon call sites simple

Keep /Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs call sites using server-list/append-entry! on publish and server-list/remove-entry! on stop if the public signatures stay the same.

Do not add a separate lock concern in the daemon namespace.

The server-list namespace should own all lock-file details.

7. Improve lock-free reads without changing semantics

Update read-entries in /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs to avoid a separate fs/existsSync check before reading.

Read the file directly and catch only missing-file errors such as ENOENT to return [].

Continue returning [] for blank or missing paths only if that is the existing accepted behavior.

Continue ignoring invalid lines unless implementation review decides to fail fast for corrupt server-list content.

Do not inspect server-list.lock from read-entries.

This removes one filesystem call from the common read path and avoids an existence-check TOCTOU race.

Integration overview

CLI or Electron process
  |
  | lock-free read
  v
<root-dir>/server-list
  |
  | parse pid port entries
  v
pid-status and /healthz checks
  |
  | write mutation only
  v
acquire <root-dir>/server-list.lock
  |
  | read current server-list inside lock
  | apply append/remove/prune transform
  | write temp file
  | rename temp file to server-list
  v
release <root-dir>/server-list.lock

Affected files

File Planned change
/Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs Add lock-path, write-lock helper, atomic write helper, update helper, locked append/remove/prune behavior, and optimized read behavior.
/Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs Replace stale full rewrite in discover-servers with locked removal of only proven stale entries.
/Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs Prefer no code change, but verify existing publish and unpublish call sites use the updated locked helpers.
/Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs Add focused server-list lock, read, append, remove, dedupe, and stale-lock tests.
/Users/rcmerci/gh-repos/logseq/src/test/logseq/cli/server_test.cljs Add a regression test for discovery cleanup preserving entries added during cleanup.
/Users/rcmerci/gh-repos/logseq/src/test/frontend/worker/db_worker_node_test.cljs Keep existing daemon registration tests green, and add tests only if signatures or return contracts change.
/Users/rcmerci/gh-repos/logseq/docs/cli/logseq-cli.md Update only if user-facing server-list troubleshooting needs to mention server-list.lock.

Implementation tasks

Phase 0: Preparation

  1. Read /Users/rcmerci/gh-repos/logseq/AGENTS.md.
  2. Check for directory-specific AGENTS.md files before editing each target file.
  3. Load @Test-Driven Development (TDD) before implementation.
  4. Re-read /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs.
  5. Re-read /Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs around discover-servers.
  6. Re-read /Users/rcmerci/gh-repos/logseq/src/main/frontend/worker/db_worker_node.cljs around server-list append and remove call sites.

Phase 1: Write RED tests

  1. Add server-list-lock-path-derives-sibling-lock-file in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs.
  2. Assert (server-list/lock-path "/tmp/logseq-root/server-list") returns /tmp/logseq-root/server-list.lock.
  3. Add read-entries-ignores-server-list-lock in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs.
  4. Create both server-list and server-list.lock in a temporary root-dir.
  5. Assert server-list/read-entries returns parsed entries without waiting for or deleting server-list.lock.
  6. Add append-entry-deduplicates-valid-entry-under-update in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs.
  7. Assert two identical appends leave one pid port line.
  8. Add remove-entry-preserves-unrelated-current-entry in /Users/rcmerci/gh-repos/logseq/src/test/logseq/db_worker/server_list_test.cljs.
  9. Assert removing one entry does not remove another existing entry.
  10. Add list-servers-preserves-concurrent-server-list-writes in /Users/rcmerci/gh-repos/logseq/src/test/logseq/cli/server_test.cljs.
  11. In that test, write one stale entry to server-list.
  12. In that test, redefine daemon/pid-status so checking the stale pid appends a second live entry to server-list before returning :not-found.
  13. In that test, call cli-server/list-servers.
  14. In that test, assert the final server-list still contains the second live entry.
  15. Run the focused tests and confirm the new race regression test fails on current behavior.

Expected RED command examples:

bb dev:test -v logseq.db-worker.server-list-test/server-list-lock-path-derives-sibling-lock-file
bb dev:test -v logseq.db-worker.server-list-test/read-entries-ignores-server-list-lock
bb dev:test -v logseq.cli.server-test/list-servers-preserves-concurrent-server-list-writes

The first command should fail until lock-path exists.

The race regression command should fail until discovery cleanup stops rewriting stale snapshots.

Phase 2: Implement locked server-list mutations

  1. Add lock-path to /Users/rcmerci/gh-repos/logseq/src/main/logseq/db_worker/server_list.cljs.
  2. Add private constants for write-lock timeout and poll interval.
  3. Set the write-lock timeout constant to 2000 milliseconds.
  4. Set the write-lock poll interval constant to 25 milliseconds unless tests show this is too aggressive.
  5. Add a private lock metadata writer.
  6. Add a private lock metadata reader.
  7. Add a private stale-lock predicate using current lock metadata and process status.
  8. Use a local process-status helper rather than depending on logseq.db-worker.daemon if adding the dependency would create a cycle.
  9. Add acquire-write-lock! with exclusive file creation.
  10. Add release-write-lock! with lock-id verification.
  11. Add with-write-lock! that wraps acquire, body execution, and release.
  12. Add read-entries-unlocked only if it keeps the public read-entries simple and testable.
  13. Update read-entries to read directly and catch missing-file errors.
  14. Add entry-key or equivalent exact pid port identity helper.
  15. Add normalize-entries to validate entries and remove exact duplicate pid port entries while preserving order.
  16. Add payload-for-entries or equivalent formatting helper.
  17. Add write-entries-unlocked! that writes through a temporary file and renames to server-list.
  18. Add update-entries! that reads current entries inside the lock and writes normalized transformed entries.
  19. Reimplement append-entry! through update-entries!.
  20. Reimplement remove-entry! through update-entries!.
  21. Keep rewrite-entries! public as a locked full-replacement helper.
  22. Add remove-entries! or prune-entries! if this makes discover-servers clearer.
  23. Keep public return values compatible for current daemon call sites unless a test proves a change is needed.

Phase 3: Replace stale discovery cleanup

  1. Update logseq.cli.server/discover-servers in /Users/rcmerci/gh-repos/logseq/src/main/logseq/cli/server.cljs.
  2. Keep entries as the discovery input read before health checks.
  3. Keep the existing p/all health-check structure.
  4. Compute stale-entries as entries whose result has :retain? false.
  5. Call server-list/remove-entries! or server-list/update-entries! only when stale-entries is non-empty.
  6. Remove the current cleaned-entries full rewrite based on the initial snapshot.
  7. Preserve existing retained-results and returned :server values.
  8. Verify list-servers-lazily-cleans-stale-server-list-entries still passes.
  9. Verify ensure-server-preserves-live-server-list-entry-after-transient-healthz-failure still passes.

Phase 4: Verify integration behavior

  1. Run the new focused server-list tests.
  2. Run the new focused CLI race regression test.
  3. Run existing CLI server-list cleanup tests.
  4. Run existing db-worker-node daemon registration tests.
  5. Run the full lint and unit test suite when the focused tests are green.

Expected GREEN command examples:

bb dev:test -v logseq.db-worker.server-list-test
bb dev:test -v logseq.cli.server-test/list-servers-preserves-concurrent-server-list-writes
bb dev:test -v logseq.cli.server-test/list-servers-lazily-cleans-stale-server-list-entries
bb dev:test -v logseq.cli.server-test/ensure-server-preserves-live-server-list-entry-after-transient-healthz-failure
bb dev:test -v frontend.worker.db-worker-node-test/db-worker-node-start-daemon-registers-and-unregisters-derived-server-list-entry
bb dev:lint-and-test

Each focused command should pass without warnings.

bb dev:lint-and-test should pass before opening a PR.

Edge cases

A reader may run while a writer holds server-list.lock.

The reader must ignore the lock and return entries from the current complete server-list file.

A reader may run while a writer is replacing server-list through atomic rename.

The reader should see either the old complete file or the new complete file.

A writer may crash after acquiring server-list.lock.

A later writer should remove the stale lock only when metadata proves the owner pid is gone.

A writer may see server-list.lock from a live process.

The writer should wait for the 2000 millisecond bounded timeout and then fail fast with :server-list-lock-timeout.

A process may not have permission to inspect the pid in the lock file.

Treat that lock as active until timeout rather than deleting it.

A lock file may contain malformed metadata.

Treat malformed metadata as active until timeout and then fail fast rather than deleting it speculatively.

Two processes may append the same pid port entry.

The normalized mutation path should avoid duplicate exact entries.

Discovery may prove entry S stale while another process appends entry B.

The cleanup mutation must remove only S from the current file and preserve B.

Discovery may fail /healthz transiently for a live process.

The current behavior retains entries after transient health failure, and the new implementation must preserve that behavior.

A daemon may stop and remove its entry while discovery is checking health.

The locked remove path should remove only the daemon entry and preserve unrelated entries.

The root-dir directory may not exist before the first write.

The write path should create the directory before lock acquisition.

The root-dir may be missing during read.

The read path should return an empty vector for a missing server-list file.

Acceptance criteria

server-list.lock is the only lock filename used for server-list writes.

Every production mutation of server-list acquires server-list.lock first.

No production read of server-list acquires or waits for server-list.lock.

Write-lock acquisition times out after 2000 milliseconds with a structured :server-list-lock-timeout error.

Malformed server-list.lock metadata is not deleted speculatively and fails fast after the timeout.

append-entry! does not lose entries when another process is cleaning stale entries.

remove-entry! does not lose unrelated entries.

discover-servers no longer rewrites the list from a stale pre-health-check snapshot.

A stale cleanup pass preserves entries appended after the pass started.

Exact duplicate pid port entries are normalized away while preserving first-seen order.

rewrite-entries! remains public only as a locked full-replacement helper.

Readers never intentionally observe temporary files.

Focused tests for logseq.db-worker.server-list-test pass.

Focused tests for logseq.cli.server-test server-list discovery behavior pass.

Existing frontend.worker.db-worker-node-test daemon registration behavior passes.

bb dev:lint-and-test passes before PR submission.

Testing Details

The server-list unit tests should verify externally visible file behavior rather than private helper details wherever possible.

The lock path test is acceptable because the requested lock filename is part of the behavior contract.

The lock-free read test should create server-list.lock and prove reads still return entries without deleting the lock.

The CLI race regression should test the actual list-servers or discover-servers behavior and prove a live entry appended during stale cleanup survives.

The daemon registration test should verify the existing start and stop behavior still publishes and unpublishes the current process entry through the updated helpers.

Implementation Details

  • Keep the lock implementation inside logseq.db-worker.server-list so callers cannot bypass it accidentally.
  • Use <root-dir>/server-list.lock and no other lock filename.
  • Use exclusive file creation to acquire the lock.
  • Use a 2000 millisecond bounded wait and fail fast with structured error data when the lock cannot be acquired.
  • Use lock metadata with pid and lock id to support safe release and stale cleanup.
  • Treat malformed server-list.lock metadata as active until timeout and then fail fast.
  • Keep reads lock-free and independent from server-list.lock.
  • Read current entries inside the write lock before applying append, remove, or prune transforms.
  • Normalize exact duplicate pid port entries while preserving first-seen order.
  • Keep rewrite-entries! public only as a locked full-replacement helper.
  • Replace full stale-snapshot rewrites in discovery with targeted stale-entry removal.
  • Write through a temporary file and atomic rename so readers do not see partial writes.
  • Preserve current public call-site simplicity in frontend.worker.db-worker-node.

Question

No open questions remain for this plan.

The write-lock timeout is 2000 milliseconds.

Malformed server-list.lock metadata fails fast after the normal timeout and is not deleted speculatively.

rewrite-entries! remains public only as a locked full-replacement helper, while production code should prefer mutation-specific helpers.

Exact duplicate pid port entries are normalized away while preserving first-seen order.