feat(fleet): Phase-2 observability — fleet ps + watch + send verify #579

Merged
jason.woltje merged 9 commits from feat/fleet-observability into main 2026-06-21 04:23:52 +00:00
Owner

Closes #578. Workstream W-FLEET under mvp-20260312. North star: docs/fleet/north-star.md

Operator observability for the durable tmux fleet (per docs/fleet/PRD.md):

  • mosaic fleet ps: joins systemd + tmux + process + idle + heartbeat; flags DRIFT and BOOT-ENABLE; JSON carries tenant_id + host.
  • mosaic agent watch: read-only join (no keystrokes / no resize).
  • mosaic agent send (verify flag): delivery/acceptance receipt.
  • Heartbeat protocol v1: per-agent file under fleet/run; healthy = within 3x interval.

Tests: 62 (31 new). Live-verified on mosaic-factory: fleet ps surfaced canary-pi DRIFT + BOOT-ENABLE; HB healthy after responder landed.
Review: dual-engine (Claude + gpt-5.5/Codex), Lead-serialized merge.

Closes #578. Workstream W-FLEET under mvp-20260312. North star: docs/fleet/north-star.md Operator observability for the durable tmux fleet (per docs/fleet/PRD.md): - mosaic fleet ps: joins systemd + tmux + process + idle + heartbeat; flags DRIFT and BOOT-ENABLE; JSON carries tenant_id + host. - mosaic agent watch: read-only join (no keystrokes / no resize). - mosaic agent send (verify flag): delivery/acceptance receipt. - Heartbeat protocol v1: per-agent file under fleet/run; healthy = within 3x interval. Tests: 62 (31 new). Live-verified on mosaic-factory: fleet ps surfaced canary-pi DRIFT + BOOT-ENABLE; HB healthy after responder landed. Review: dual-engine (Claude + gpt-5.5/Codex), Lead-serialized merge.
jason.woltje added 6 commits 2026-06-21 03:35:37 +00:00
Establish Fleet workstream doctrine under mvp-20260312: north star (incl.
fleet-as-means-of-production), Phase-2 observability PRD, workstream tasks,
and scratchpad. Collision-safe: scoped to docs/fleet/, touches none of the
MVP single-writer control-plane files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
FR-1 fleet ps: joins systemd show (ActiveState/SubState/UnitFileState),
tmux list-panes (pid/command/dead/activity), and file-based heartbeat
(~/.config/mosaic/fleet/run/<name>.hb) into one table per roster agent.
Flags DRIFT (roster runtime ≠ actual pane command) and BOOT-ENABLE
(active but UnitFileState=disabled). --json output includes tenant_id
and host on every record (FR-6 zero-foreclosure for multi-tenant/host).

FR-3 agent watch: read-only tmux attach (-r flag) so the operator can
observe any session without injecting keystrokes or resizing the window.
Registered as a new verb alongside tail/send/reset in registerFleetAgentCommands.

FR-5 agent send --verify: after keystroke injection, captures the last 5
pane lines and checks for draft heuristic (last non-empty line starts
with '> '). Exits non-zero and writes to stderr if the message appears
unsubmitted. Default send behavior is unchanged when --verify is omitted.

New pure exported helpers (all unit-testable without real tmux/systemd):
buildSystemdShowCommand, buildTmuxListPanesCommand, buildAgentWatchCommand,
buildAgentVerifyAcceptedCommand, parseHeartbeat, parseSystemdShow,
parseTmuxListPanes, detectDrift, getDefaultTenantAndHost, isSendAccepted,
heartbeatPath. Added 31 new spec cases (62 total) covering exact command
construction, JSON shape, heartbeat parsing, drift detection, and verify flow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
style(fleet): prettier-format workstream docs
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline was canceled
ddeb200fdf
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
jason.woltje added 1 commit 2026-06-21 03:47:06 +00:00
fix(fleet): verify fails-closed on unverifiable + interactive watch
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline was canceled
aec560162b
- isSendAccepted now returns 'accepted' | 'draft' | 'unverifiable' (was bool)
- Blank/empty capture => 'unverifiable' => process.exitCode=1 with distinct
  "could not verify delivery (blank/no response captured)" message; previously
  blank was treated as success, violating FR-5 fail-closed semantics
- Draft line ('^> ') => process.exitCode=1 with "left as unsubmitted draft"
  message; distinct wording from unverifiable case
- agent watch now dispatched through injectable InteractiveRunner (stdio:inherit)
  instead of the capturing CommandRunner; tmux attach requires TTY passthrough
- Default spawnInteractive implementation uses node:child_process spawn with
  stdio:'inherit'; injectable via FleetCommandDeps.interactiveRunner for tests
- Removed buildSystemdIsActiveCommand (dead code — exported but unused)
- Tests: blank=>exitCode=1, draft=>exitCode=1, real response=>exitCode=0,
  watch dispatched through interactiveRunner not capturing runner
- PRD: added "Known limitations" section (heuristic verify, blank fails closed,
  non-pi/claude draft detection is best-effort, watch requires TTY passthrough)
- Code comment on isSendAccepted notes pi/claude-specific draft heuristic

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
jason.woltje added 1 commit 2026-06-21 03:58:12 +00:00
fix(fleet): verify via pane-change diff + non-resizing watch
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline was canceled
8466ca2d81
Blocker fix: send --verify now captures a BEFORE snapshot immediately
before the send and an AFTER snapshot after the delay, then uses
classifySendResult(before, after) to classify. A wedged pane showing
stale non-empty content is no longer falsely reported as 'accepted' —
BEFORE==AFTER maps to 'unverifiable' (exit 1, "no pane change after
send"). Blank AFTER still fails closed as 'unverifiable'. Only
AFTER != BEFORE without a draft suffix counts as 'accepted' (exit 0).

Should-fix: agent watch now uses a GROUPED VIEWER SESSION instead of a
bare 'tmux attach -r' against the agent session. A bare attach lets the
viewer terminal shrink the agent's window; a grouped session has
independent sizing so the agent's window is never affected.
Sequence: new-session -d -t '=<agent>' -s '<agent>-watch-<pid>' (runner),
attach -r to viewer session (interactiveRunner), kill-session on detach
(runner). New builder functions exported: buildAgentWatchCreateViewerCommand,
buildAgentWatchAttachCommand, buildAgentWatchKillViewerCommand,
buildViewerSessionName. buildAgentWatchCommand kept but deprecated.

New exports: classifySendResult(before, after) — the testable classifier.

Tests added:
- classifySendResult unit suite (6 cases): accepted/draft/unverifiable/
  stale-pane/both-blank/before-blank-after-response
- send --verify regression: stale (before==after non-empty) => exit 1
- send --verify regression: blank AFTER => exit 1
- send --verify regression: draft after pane change => exit 1
- send --verify regression: changed non-draft => exit 0
- send --verify: 3-call sequence assertion (before-capture, send, after-capture)
- watch dispatch: grouped viewer session created/attached/killed; no bare
  attach against agent session; viewer name matches <agent>-watch-<pid>

PRD Known-limitations updated: pane-change check rationale, Phase-3
heartbeat-ack requirement, grouped-session watch design.

All gates pass: pnpm typecheck, pnpm lint, pnpm --filter @mosaicstack/mosaic test
(382 tests, 74 fleet), prettier --check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
jason.woltje added 1 commit 2026-06-21 04:08:51 +00:00
fix(fleet): bounded-poll send --verify to eliminate false unverifiable on slow TUIs
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/pr/ci Pipeline was successful
dd10f0046b
Replace the single fixed 300ms capture-pane delay in `agent send --verify` with a
bounded polling loop. After sending, the loop polls `capture-pane` every 400ms
(VERIFY_POLL_INTERVAL_MS) up to a configurable total timeout (default 6000ms,
VERIFY_DEFAULT_TIMEOUT_MS). classifySendResult is called on each poll: accepted/draft
return immediately; unverifiable keeps polling until timeout, then fails closed with
the existing "no pane change after send" message.

New `--verify-timeout <ms>` option on `agent send` (default 6000ms documented).
Injectable SleepFn added to FleetCommandDeps for test isolation — no real sleeps in
tests. Exports VERIFY_POLL_INTERVAL_MS and VERIFY_DEFAULT_TIMEOUT_MS as constants.
classifySendResult and all other pure functions remain unchanged.

Tests: multi-poll acceptance on 2nd/3rd poll => exit 0; pane unchanged until timeout
=> exit 1; draft detected on first poll => exit 1. All 386 tests pass.

docs/fleet/PRD.md Known-limitations updated: verify now polls up to bounded timeout
(default ~6s, --verify-timeout); definitive acceptance still deferred to Phase-3
heartbeat-ack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
jason.woltje merged commit af2eede7a9 into main 2026-06-21 04:23:52 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mosaicstack/stack#579