Replace the single fixed 300ms capture-pane delay in `agent send --verify` with a bounded polling loop. After sending, the loop polls `capture-pane` every 400ms (VERIFY_POLL_INTERVAL_MS) up to a configurable total timeout (default 6000ms, VERIFY_DEFAULT_TIMEOUT_MS). classifySendResult is called on each poll: accepted/draft return immediately; unverifiable keeps polling until timeout, then fails closed with the existing "no pane change after send" message. New `--verify-timeout <ms>` option on `agent send` (default 6000ms documented). Injectable SleepFn added to FleetCommandDeps for test isolation — no real sleeps in tests. Exports VERIFY_POLL_INTERVAL_MS and VERIFY_DEFAULT_TIMEOUT_MS as constants. classifySendResult and all other pure functions remain unchanged. Tests: multi-poll acceptance on 2nd/3rd poll => exit 0; pane unchanged until timeout => exit 1; draft detected on first poll => exit 1. All 386 tests pass. docs/fleet/PRD.md Known-limitations updated: verify now polls up to bounded timeout (default ~6s, --verify-timeout); definitive acceptance still deferred to Phase-3 heartbeat-ack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
8.4 KiB
PRD — Fleet Phase 2: Operator Observability
Workstream: W-FLEET under
mvp-20260312· Phase: 2 North star: docs/fleet/north-star.md Source umbrella PRD: docs/PRD.md (Mosaic Stack v0.1.0) Tracks task:fleet-observability-1— restore operator observability into fleet agent sessions.
Problem
The durable tmux fleet runs on the isolated mosaic-factory socket. That isolation
(which protects the operator's default tmux) makes the fleet invisible to default
tooling, and truth is split across three planes no single command joins — systemd
(systemctl --user), tmux (-L mosaic-factory), and the process tree (pstree).
agent tail (capture-pane) returns blank for full-screen TUIs, and agent send
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
observability and no safe way to watch a session.
Goals
- One command shows the whole fleet's real state, joining all three planes.
- Liveness is truthful: healthy = answered a heartbeat, not "pane alive".
- The operator can watch any session read-only without disrupting it.
sendreports delivered-and-accepted, not just injected.- Every record/address carries
tenant_id+host(zero foreclosure for multi-tenant/multi-host).
Non-goals (this phase)
- No webUI (Phase 5; rides federation for cross-host).
- No
fleetddaemon or persistent history store. - No real-runtime swap (Phase 3) — instrument the live dogfood stub fleet.
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
Functional requirements
| ID | Requirement |
|---|---|
| FR-1 | mosaic fleet ps [--json] prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · last-heartbeat age · drift flag (roster runtime ≠ actual pane command) · boot-enable warning (active but UnitFileState=disabled). |
| FR-2 | Heartbeat protocol v1 (see below); dogfood-agent.py implements the responder. fleet ps issues probes (or reads last-seen) and reports health per FR-1. |
| FR-3 | mosaic agent watch <name> opens a read-only view of the pane (grouped session or tmux attach -r) that cannot send keystrokes and does not shrink the agent's window. |
| FR-4 | mosaic agent attach <name> remains the explicit interactive-takeover path (separate verb, documented as the only one that can type). |
| FR-5 | mosaic agent send <name> --verify confirms the message was accepted (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
| FR-6 | All structured output (--json) includes tenant_id and host fields. |
Heartbeat protocol v1
- Probe: operator/
fleet pswrites a sentinel line to the agent's input or a well-known per-agent heartbeat file path~/.config/mosaic/fleet/run/<agent>.hb. - Response: the runtime updates
<agent>.hbwithts=<iso8601> pid=<pid> status=<ok|busy>on a fixed interval (default 15s) and on demand when probed. - Health rule:
healthyifnow - ts <= 3 × interval; elsestale; missing file =unknown. - Contract: every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
full-screen TUIs alike (no
capture-panedependency). ASSUMPTION:file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
Acceptance criteria
mosaic fleet psshows all 5 live sessions onmosaic-factorywith correct pane/pid/idle and flags the dogfood drift (canary-piruntime=pi but pane runsdogfood-agent.py) and the boot-enable gap (active but disabled).- Killing one agent's pane flips its row to dead/stale within one
interval. agent watchshows live output and provably cannot type into the pane; detaching leaves the agent's window size unchanged.agent send --verifyreturns success on an accepting pane and non-zero on a wedged/draft pane.- Quality gates green:
pnpm typecheck,pnpm lint,pnpm format:check, pluspnpm --filter @mosaicstack/mosaic test. - Independent review passed; dogfood evidence captured against the live fleet.
Test plan
- Unit/CLI specs in
packages/mosaic/src/commands/fleet.spec.ts(and a newfleet-ps/watch/send-verifyspec) using the injectedCommandRunnerto assert exact tmux/systemd command construction and JSON shape (tenant+host present). - Situational: run against the live
mosaic-factoryfleet; capturefleet psoutput, a kill-and-detect cycle, a read-onlywatch, and asend --verifypass/fail pair.
Known limitations
- Verify heuristic is best-effort:
agent send --verifyuses a>-prefix draft heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode TUIs is best-effort only; those runtimes may not use the same input-line indicator. - Pane-change check is the best Phase-2 signal; verify now polls up to a bounded
timeout:
agent send --verifycaptures a BEFORE snapshot, sends the message, then pollscapture-paneevery ~400 ms up to a configurable total timeout (default ~6 s, controlled by--verify-timeout <ms>). On each poll it runs classifySendResult: if the pane shows 'accepted' or 'draft' the loop exits immediately; while the result is 'unverifiable' (no pane change yet) it keeps polling. After the timeout with no definitive result, it fails closed: exit 1 with "no pane change after send". This eliminates false 'unverifiable' failures for slow/loaded TUIs that were previously caused by the old fixed 300 ms single-capture. Definitive acceptance ultimately requires a runtime acknowledgement (Phase-3 heartbeat-ack); the bounded pane-change poll is the best signal available against an opaque TUI for Phase-2. - Blank AFTER capture fails closed: Full-screen TUIs (claude, codex, opencode, pi)
render blank for
tmux capture-pane. When the AFTER snapshot is empty,send --verifyreturns non-zero with an "unverifiable" message rather than silently succeeding. This is an intentional fail-closed design (FR-5). agent watchuses a grouped viewer session:tmux attach -rdirectly against the agent session lets the viewer terminal shrink the agent's window.agent watchinstead creates a throwaway grouped session (tmux new-session -d -t '=<agent>' -s '<agent>-watch-<pid>'), attaches read-only to that session, and kills it on detach. The grouped session shares the agent's windows but has independent sizing, so the agent's window is never affected.tmux attachis still interactive and requires inherited stdio; theinteractiveRunnerhandles TTY passthrough.
Surfaces & parity (MVP-X1)
CLI lands this phase. TUI surface follows in the packages/mosaic wizard; webUI in
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.