Files
stack/packages
Jarvis 9c2e4f0b2d fix(fleet): guard mosaic fleet restart against tight-loop re-entry race
`mosaic fleet restart` runs as a fresh process each invocation, issuing
`systemctl --user restart` for the tmux holder and then each agent. The
agent sessions live inside the holder's tmux, so restarting the holder
tears them down. With no mutual exclusion, a second restart entering
while the first is still mid-teardown (upgrade `--relaunch`, a watchdog,
or a hurried operator) interleaves: agents relaunch against a
half-torn-down holder, fail, and tight-loop.

Add a cross-process teardown-settle guard: a lock file under
`<mosaicHome>/fleet/run/restart.lock` acquired with O_CREAT|O_EXCL.
A re-entrant restart waits (bounded, injectable sleep) for the in-flight
restart to release the lock before relaunching, breaks a stale lock left
by a crashed owner, and after a max wait breaks the lock to avoid a
permanent deadlock. Both full-fleet and single-agent restart paths are
guarded; start/stop/status are unchanged.

Regression test reproduces the race: with an in-flight lock held, the
restart must wait before issuing any systemctl command — fails on the
unguarded code path, passes with the guard. Adds stale-lock-break and
lock-release coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 19:18:38 -05:00
..