fix(fleet): guard mosaic fleet restart against tight-loop re-entry race #680

Merged
jason.woltje merged 3 commits from fix/fleet-restart-tightloop-guard into main 2026-06-25 01:44:49 +00:00
Owner

Summary

Hardens mosaic fleet restart against a tight-loop re-entry race (logged backlog item).

mosaic fleet restart runs as a fresh process each invocation, issuing systemctl --user restart for the tmux holder and then each agent. Agent sessions live inside the holder's tmux, so restarting the holder tears them down. With no mutual exclusion, a second restart entering while the first is still mid-teardown — mosaic update --relaunch, a watchdog, or a hurried operator — interleaves: agents relaunch against a half-torn-down holder, fail, and tight-loop.

Fix

A cross-process teardown-settle guard — an O_CREAT|O_EXCL lock file at <mosaicHome>/fleet/run/restart.lock:

  • A re-entrant restart waits (bounded, injectable sleep) for the in-flight restart to release the lock — i.e. for clean shutdown to settle — before relaunching.
  • A stale lock left by a crashed owner (age > RESTART_LOCK_STALE_MS, or corrupt/unparseable) is broken immediately rather than waited on.
  • After RESTART_LOCK_MAX_WAIT_MS the lock is broken anyway, so a dead owner can never deadlock the fleet permanently.
  • Both the full-fleet and single-agent restart paths are guarded. start / stop / status are unchanged; the systemctl call sequence is identical (the lock is filesystem-only, adds no runner calls).

Regression test (fails on main)

waits for an in-flight restart to clear before relaunching (re-entry guard) pre-creates a held lock and asserts the restart waits before issuing any systemctl command. Verified it fails on the unguarded code path (expected 0 to be greater than 0) and passes with the guard. Adds stale-lock-break and lock-release coverage.

Gates

  • pnpm typecheck
  • pnpm lint
  • pnpm format:check
  • pnpm test (mosaic) 621 passed, fleet.spec 186 passed (4 restart tests)

Incidental: pre-push unblock (separate commit)

The repo-wide pre-push format:check hook was red fleet-widedocs/federation/TASKS.md landed via #678 with an unformatted markdown table, blocking every push off main. A separate commit applies a pure prettier table-alignment reformat (no content change) so this branch can pass the hook without --no-verify. Flagging for awareness — root cause is upstream of this PR.

🤖 Generated with Claude Code

## Summary Hardens `mosaic fleet restart` against a tight-loop re-entry race (logged backlog item). `mosaic fleet restart` runs as a **fresh process** each invocation, issuing `systemctl --user restart` for the tmux holder and then each agent. Agent sessions live *inside* the holder's tmux, so restarting the holder tears them down. With no mutual exclusion, a second restart entering while the first is still mid-teardown — `mosaic update --relaunch`, a watchdog, or a hurried operator — **interleaves**: agents relaunch against a half-torn-down holder, fail, and tight-loop. ## Fix A cross-process **teardown-settle guard** — an `O_CREAT|O_EXCL` lock file at `<mosaicHome>/fleet/run/restart.lock`: - A re-entrant restart **waits** (bounded, injectable sleep) for the in-flight restart to release the lock — i.e. for clean shutdown to settle — before relaunching. - A **stale** lock left by a crashed owner (age > `RESTART_LOCK_STALE_MS`, or corrupt/unparseable) is broken immediately rather than waited on. - After `RESTART_LOCK_MAX_WAIT_MS` the lock is broken anyway, so a dead owner can never deadlock the fleet permanently. - Both the full-fleet and single-agent restart paths are guarded. `start` / `stop` / `status` are unchanged; the systemctl call sequence is identical (the lock is filesystem-only, adds no runner calls). ## Regression test (fails on `main`) `waits for an in-flight restart to clear before relaunching (re-entry guard)` pre-creates a held lock and asserts the restart **waits before issuing any systemctl command**. Verified it fails on the unguarded code path (`expected 0 to be greater than 0`) and passes with the guard. Adds stale-lock-break and lock-release coverage. ## Gates - `pnpm typecheck` ✅ - `pnpm lint` ✅ - `pnpm format:check` ✅ - `pnpm test` (mosaic) ✅ — **621 passed**, fleet.spec **186 passed** (4 restart tests) ## Incidental: pre-push unblock (separate commit) The repo-wide pre-push `format:check` hook was **red fleet-wide** — `docs/federation/TASKS.md` landed via #678 with an unformatted markdown table, blocking every push off `main`. A separate commit applies a pure prettier table-alignment reformat (no content change) so this branch can pass the hook **without** `--no-verify`. Flagging for awareness — root cause is upstream of this PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
jason.woltje added 2 commits 2026-06-24 22:08:52 +00:00
`mosaic fleet restart` runs as a fresh process each invocation, issuing
`systemctl --user restart` for the tmux holder and then each agent. The
agent sessions live inside the holder's tmux, so restarting the holder
tears them down. With no mutual exclusion, a second restart entering
while the first is still mid-teardown (upgrade `--relaunch`, a watchdog,
or a hurried operator) interleaves: agents relaunch against a
half-torn-down holder, fail, and tight-loop.

Add a cross-process teardown-settle guard: a lock file under
`<mosaicHome>/fleet/run/restart.lock` acquired with O_CREAT|O_EXCL.
A re-entrant restart waits (bounded, injectable sleep) for the in-flight
restart to release the lock before relaunching, breaks a stale lock left
by a crashed owner, and after a max wait breaks the lock to avoid a
permanent deadlock. Both full-fleet and single-agent restart paths are
guarded; start/stop/status are unchanged.

Regression test reproduces the race: with an in-flight lock held, the
restart must wait before issuing any systemctl command — fails on the
unguarded code path, passes with the guard. Adds stale-lock-break and
lock-release coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chore(docs): prettier-format federation TASKS.md table (unblock pre-push)
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline failed
df905b7a34
`docs/federation/TASKS.md` landed via #678 with an unformatted markdown
table, leaving the repo-wide pre-push `format:check` gate red for every
branch off `main`. This is a pure prettier table-alignment reformat
(depends_on column re-padding) — no content change — applied so this
branch (and any other) can pass the pre-push hook without `--no-verify`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Owner

REVIEW-OF-RECORD — REQUEST CHANGES (#680). BLOCKER: restart lock ownership is unsafe. In acquireRestartLock(), release() unconditionally unlinks restart.lock, and the stale/max-wait breaker calls that same release() even when this process does not own the current lock. This can delete another process's freshly acquired lock: e.g. A holds lock, B times out/stale-breaks and acquires a new lock, then A exits and its finally release() unlinks B's lock; C can then acquire and interleave while B is still restarting. Similar race exists with two stale-lock breakers that both read the stale file before one acquires. That violates the serialization/no tight-loop guarantee under exactly the re-entry pressure this PR is meant to harden. Please make release/break ownership-safe (e.g. token/owner identity and do not remove a lock that is no longer yours; stale break must not delete a newly-created owner) and add a regression test for that race.

SHOULD FIX: add explicit single-agent restart guard coverage. The code guards mosaic fleet restart , but current new tests exercise only full-fleet restart.

Reviewer evidence: fleet.spec 186/186 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts/docs/federation/TASKS.md. mosaic package typecheck is not usable evidence in this worktree due pre-existing fleet-backlog/@mosaicstack/db export errors unrelated to this PR. Ignored the docs/federation/TASKS.md prettier-only commit per route; review focused on fleet restart source.

REVIEW-OF-RECORD — REQUEST CHANGES (#680). BLOCKER: restart lock ownership is unsafe. In acquireRestartLock(), release() unconditionally unlinks restart.lock, and the stale/max-wait breaker calls that same release() even when this process does not own the current lock. This can delete another process's freshly acquired lock: e.g. A holds lock, B times out/stale-breaks and acquires a new lock, then A exits and its finally release() unlinks B's lock; C can then acquire and interleave while B is still restarting. Similar race exists with two stale-lock breakers that both read the stale file before one acquires. That violates the serialization/no tight-loop guarantee under exactly the re-entry pressure this PR is meant to harden. Please make release/break ownership-safe (e.g. token/owner identity and do not remove a lock that is no longer yours; stale break must not delete a newly-created owner) and add a regression test for that race. SHOULD FIX: add explicit single-agent restart guard coverage. The code guards mosaic fleet restart <agent>, but current new tests exercise only full-fleet restart. Reviewer evidence: fleet.spec 186/186 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts/docs/federation/TASKS.md. mosaic package typecheck is not usable evidence in this worktree due pre-existing fleet-backlog/@mosaicstack/db export errors unrelated to this PR. Ignored the docs/federation/TASKS.md prettier-only commit per route; review focused on fleet restart source.
jason.woltje force-pushed fix/fleet-restart-tightloop-guard from df905b7a34 to 43ad813e0d 2026-06-25 00:19:44 +00:00 Compare
Author
Owner

REVIEW-OF-RECORD — REQUEST CHANGES (#680, head 43ad813e0d). BLOCKER: the v2 ownership-safe release fixes the old-owner-release case, but stale/max-wait breakAndOwnRestartLock() is still not safe against concurrent breakers. It unconditionally rename(tmpPath, lockPath) over whatever is currently at restart.lock; there is no compare-against-the-owner-token-that-was-judged-stale. Race: R1 holds a stale lock; R2 and R3 both observe it as stale; R2 renames token R2 and returns/starts restart; R3 then renames token R3 over R2 and also returns/starts restart. Both think they own the lock, so the tight-loop interleave can still happen. The new three-restart regression only covers old R1 releasing after R2 owns the lock; it does not exercise two concurrent stale/max-wait breakers racing with each other. Please make stale/max-wait takeover conditional on the lock content/token still being the stale/timed-out owner observed before takeover (or use a lock directory / other atomic CAS-like pattern), and add a regression where two breakers race and only one proceeds while the other waits/re-evaluates.

Verified positives: single-agent restart path now has guard coverage; old-owner release no longer drops the newer owner's lock; normal wait/stale/release tests pass. Reviewer verification on head 43ad813e: fleet.spec 188/188 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts. CI is green, but merge should wait on the concurrent-breaker ownership fix.

REVIEW-OF-RECORD — REQUEST CHANGES (#680, head 43ad813e0d4bb0c648b09b8288a48dc1681135a0). BLOCKER: the v2 ownership-safe release fixes the old-owner-release case, but stale/max-wait `breakAndOwnRestartLock()` is still not safe against concurrent breakers. It unconditionally `rename(tmpPath, lockPath)` over whatever is currently at `restart.lock`; there is no compare-against-the-owner-token-that-was-judged-stale. Race: R1 holds a stale lock; R2 and R3 both observe it as stale; R2 renames token R2 and returns/starts restart; R3 then renames token R3 over R2 and also returns/starts restart. Both think they own the lock, so the tight-loop interleave can still happen. The new three-restart regression only covers old R1 releasing after R2 owns the lock; it does not exercise two concurrent stale/max-wait breakers racing with each other. Please make stale/max-wait takeover conditional on the lock content/token still being the stale/timed-out owner observed before takeover (or use a lock directory / other atomic CAS-like pattern), and add a regression where two breakers race and only one proceeds while the other waits/re-evaluates. Verified positives: single-agent restart path now has guard coverage; old-owner release no longer drops the newer owner's lock; normal wait/stale/release tests pass. Reviewer verification on head 43ad813e: fleet.spec 188/188 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts. CI is green, but merge should wait on the concurrent-breaker ownership fix.
jason.woltje added 1 commit 2026-06-25 01:27:39 +00:00
fix(fleet): serialize restart-lock transitions to close concurrent-breaker race (review #680)
All checks were successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/push/ci Pipeline was successful
786762587d
Stale/max-wait takeover was not safe against concurrent breakers: two
breakers could both judge the lock stale and both proceed, re-introducing
the tight-loop. POSIX/Node has no content- or inode-conditional unlink or
rename, so "judge stale, then replace" can never be atomic with pure path
ops.

Serialize ALL lock transitions (acquire, release, takeover) under one
short-lived registry mutex held only across a few fs ops, never across the
restart itself. This makes check-then-mutate atomic, so exactly one breaker
can take over a stale lock while the others wait and re-evaluate.

The mutex itself uses mtime-based staleness (open('wx') creates an empty
inode before the token is written; a content check would reap a lock that is
still being acquired). The mutex populates-or-cleans-up on write failure so a
half-created mutex never leaks.

Regression coverage at two widths: a 2-breaker barrier test (exactly one
takes over, the other waits) and the existing 3-breaker test (maxActive===1,
distinct tokens, final lock released).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Owner

REVIEW-OF-RECORD — APPROVE (#680, head 786762587d). Re-review confirms the prior concurrent-breaker blocker is closed. v3 serializes all restart-lock state transitions (free acquire, stale/timeout takeover, and token-checked release) behind the short-lived registry mutex, so the check-then-mutate sequence is atomic and two stale/max-wait breakers cannot both proceed. The mutex reclaim path uses mtime staleness and a reclaim lock, avoiding the empty-file/content-stale acquisition window and the file-leak/error path called out by Codex.

Tests now cover the relevant races: normal re-entry wait before systemctl, stale lock break, sequential release, single-agent restart guarded, old owner release cannot drop a new owner, two-breaker barrier where one waits, and three-breaker concurrency with maxActive==1/distinct tokens/final lock removed. The acknowledged residual stat/unlink TOCTOU inherent to pure Node named-path locks is acceptable for this PR's scope; moving to flock would be a separate hardening design, not a blocker here.

Reviewer verification on head 78676258: fleet.spec 190/190 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts. CI reported green by orchestrator. APPROVE.

REVIEW-OF-RECORD — APPROVE (#680, head 786762587d435856382a758607cc7894bf68c52f). Re-review confirms the prior concurrent-breaker blocker is closed. v3 serializes all restart-lock state transitions (free acquire, stale/timeout takeover, and token-checked release) behind the short-lived registry mutex, so the check-then-mutate sequence is atomic and two stale/max-wait breakers cannot both proceed. The mutex reclaim path uses mtime staleness and a reclaim lock, avoiding the empty-file/content-stale acquisition window and the file-leak/error path called out by Codex. Tests now cover the relevant races: normal re-entry wait before systemctl, stale lock break, sequential release, single-agent restart guarded, old owner release cannot drop a new owner, two-breaker barrier where one waits, and three-breaker concurrency with maxActive==1/distinct tokens/final lock removed. The acknowledged residual stat/unlink TOCTOU inherent to pure Node named-path locks is acceptable for this PR's scope; moving to flock would be a separate hardening design, not a blocker here. Reviewer verification on head 78676258: fleet.spec 190/190 pass; mosaic lint pass; prettier check pass for fleet.ts/fleet.spec.ts. CI reported green by orchestrator. APPROVE.
jason.woltje merged commit 67135d3822 into main 2026-06-25 01:44:49 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mosaicstack/stack#680