fix(fleet): guard mosaic fleet restart against tight-loop re-entry race
#680
Reference in New Issue
Block a user
Delete Branch "fix/fleet-restart-tightloop-guard"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Hardens
mosaic fleet restartagainst a tight-loop re-entry race (logged backlog item).mosaic fleet restartruns as a fresh process each invocation, issuingsystemctl --user restartfor the tmux holder and then each agent. Agent sessions live inside the holder's tmux, so restarting the holder tears them down. With no mutual exclusion, a second restart entering while the first is still mid-teardown —mosaic update --relaunch, a watchdog, or a hurried operator — interleaves: agents relaunch against a half-torn-down holder, fail, and tight-loop.Fix
A cross-process teardown-settle guard — an
O_CREAT|O_EXCLlock file at<mosaicHome>/fleet/run/restart.lock:RESTART_LOCK_STALE_MS, or corrupt/unparseable) is broken immediately rather than waited on.RESTART_LOCK_MAX_WAIT_MSthe lock is broken anyway, so a dead owner can never deadlock the fleet permanently.start/stop/statusare unchanged; the systemctl call sequence is identical (the lock is filesystem-only, adds no runner calls).Regression test (fails on
main)waits for an in-flight restart to clear before relaunching (re-entry guard)pre-creates a held lock and asserts the restart waits before issuing any systemctl command. Verified it fails on the unguarded code path (expected 0 to be greater than 0) and passes with the guard. Adds stale-lock-break and lock-release coverage.Gates
pnpm typecheck✅pnpm lint✅pnpm format:check✅pnpm test(mosaic) ✅ — 621 passed, fleet.spec 186 passed (4 restart tests)Incidental: pre-push unblock (separate commit)
The repo-wide pre-push
format:checkhook was red fleet-wide —docs/federation/TASKS.mdlanded via #678 with an unformatted markdown table, blocking every push offmain. A separate commit applies a pure prettier table-alignment reformat (no content change) so this branch can pass the hook without--no-verify. Flagging for awareness — root cause is upstream of this PR.🤖 Generated with Claude Code
mosaic fleet restartagainst tight-loop re-entry race 906620dd54REVIEW-OF-RECORD — REQUEST CHANGES (#680). BLOCKER: restart lock ownership is unsafe. In acquireRestartLock(), release() unconditionally unlinks restart.lock, and the stale/max-wait breaker calls that same release() even when this process does not own the current lock. This can delete another process's freshly acquired lock: e.g. A holds lock, B times out/stale-breaks and acquires a new lock, then A exits and its finally release() unlinks B's lock; C can then acquire and interleave while B is still restarting. Similar race exists with two stale-lock breakers that both read the stale file before one acquires. That violates the serialization/no tight-loop guarantee under exactly the re-entry pressure this PR is meant to harden. Please make release/break ownership-safe (e.g. token/owner identity and do not remove a lock that is no longer yours; stale break must not delete a newly-created owner) and add a regression test for that race.
SHOULD FIX: add explicit single-agent restart guard coverage. The code guards mosaic fleet restart , but current new tests exercise only full-fleet restart.
Reviewer evidence: fleet.spec 186/186 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts/docs/federation/TASKS.md. mosaic package typecheck is not usable evidence in this worktree due pre-existing fleet-backlog/@mosaicstack/db export errors unrelated to this PR. Ignored the docs/federation/TASKS.md prettier-only commit per route; review focused on fleet restart source.
df905b7a34to43ad813e0dREVIEW-OF-RECORD — REQUEST CHANGES (#680, head
43ad813e0d). BLOCKER: the v2 ownership-safe release fixes the old-owner-release case, but stale/max-waitbreakAndOwnRestartLock()is still not safe against concurrent breakers. It unconditionallyrename(tmpPath, lockPath)over whatever is currently atrestart.lock; there is no compare-against-the-owner-token-that-was-judged-stale. Race: R1 holds a stale lock; R2 and R3 both observe it as stale; R2 renames token R2 and returns/starts restart; R3 then renames token R3 over R2 and also returns/starts restart. Both think they own the lock, so the tight-loop interleave can still happen. The new three-restart regression only covers old R1 releasing after R2 owns the lock; it does not exercise two concurrent stale/max-wait breakers racing with each other. Please make stale/max-wait takeover conditional on the lock content/token still being the stale/timed-out owner observed before takeover (or use a lock directory / other atomic CAS-like pattern), and add a regression where two breakers race and only one proceeds while the other waits/re-evaluates.Verified positives: single-agent restart path now has guard coverage; old-owner release no longer drops the newer owner's lock; normal wait/stale/release tests pass. Reviewer verification on head
43ad813e: fleet.spec 188/188 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts. CI is green, but merge should wait on the concurrent-breaker ownership fix.Stale/max-wait takeover was not safe against concurrent breakers: two breakers could both judge the lock stale and both proceed, re-introducing the tight-loop. POSIX/Node has no content- or inode-conditional unlink or rename, so "judge stale, then replace" can never be atomic with pure path ops. Serialize ALL lock transitions (acquire, release, takeover) under one short-lived registry mutex held only across a few fs ops, never across the restart itself. This makes check-then-mutate atomic, so exactly one breaker can take over a stale lock while the others wait and re-evaluate. The mutex itself uses mtime-based staleness (open('wx') creates an empty inode before the token is written; a content check would reap a lock that is still being acquired). The mutex populates-or-cleans-up on write failure so a half-created mutex never leaks. Regression coverage at two widths: a 2-breaker barrier test (exactly one takes over, the other waits) and the existing 3-breaker test (maxActive===1, distinct tokens, final lock released). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>REVIEW-OF-RECORD — APPROVE (#680, head
786762587d). Re-review confirms the prior concurrent-breaker blocker is closed. v3 serializes all restart-lock state transitions (free acquire, stale/timeout takeover, and token-checked release) behind the short-lived registry mutex, so the check-then-mutate sequence is atomic and two stale/max-wait breakers cannot both proceed. The mutex reclaim path uses mtime staleness and a reclaim lock, avoiding the empty-file/content-stale acquisition window and the file-leak/error path called out by Codex.Tests now cover the relevant races: normal re-entry wait before systemctl, stale lock break, sequential release, single-agent restart guarded, old owner release cannot drop a new owner, two-breaker barrier where one waits, and three-breaker concurrency with maxActive==1/distinct tokens/final lock removed. The acknowledged residual stat/unlink TOCTOU inherent to pure Node named-path locks is acceptable for this PR's scope; moving to flock would be a separate hardening design, not a blocker here.
Reviewer verification on head
78676258: fleet.spec 190/190 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts. CI reported green by orchestrator. APPROVE.