Blocker fix: send --verify now captures a BEFORE snapshot immediately before the send and an AFTER snapshot after the delay, then uses classifySendResult(before, after) to classify. A wedged pane showing stale non-empty content is no longer falsely reported as 'accepted' — BEFORE==AFTER maps to 'unverifiable' (exit 1, "no pane change after send"). Blank AFTER still fails closed as 'unverifiable'. Only AFTER != BEFORE without a draft suffix counts as 'accepted' (exit 0). Should-fix: agent watch now uses a GROUPED VIEWER SESSION instead of a bare 'tmux attach -r' against the agent session. A bare attach lets the viewer terminal shrink the agent's window; a grouped session has independent sizing so the agent's window is never affected. Sequence: new-session -d -t '=<agent>' -s '<agent>-watch-<pid>' (runner), attach -r to viewer session (interactiveRunner), kill-session on detach (runner). New builder functions exported: buildAgentWatchCreateViewerCommand, buildAgentWatchAttachCommand, buildAgentWatchKillViewerCommand, buildViewerSessionName. buildAgentWatchCommand kept but deprecated. New exports: classifySendResult(before, after) — the testable classifier. Tests added: - classifySendResult unit suite (6 cases): accepted/draft/unverifiable/ stale-pane/both-blank/before-blank-after-response - send --verify regression: stale (before==after non-empty) => exit 1 - send --verify regression: blank AFTER => exit 1 - send --verify regression: draft after pane change => exit 1 - send --verify regression: changed non-draft => exit 0 - send --verify: 3-call sequence assertion (before-capture, send, after-capture) - watch dispatch: grouped viewer session created/attached/killed; no bare attach against agent session; viewer name matches <agent>-watch-<pid> PRD Known-limitations updated: pane-change check rationale, Phase-3 heartbeat-ack requirement, grouped-session watch design. All gates pass: pnpm typecheck, pnpm lint, pnpm --filter @mosaicstack/mosaic test (382 tests, 74 fleet), prettier --check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
8.1 KiB
PRD — Fleet Phase 2: Operator Observability
Workstream: W-FLEET under
mvp-20260312· Phase: 2 North star: docs/fleet/north-star.md Source umbrella PRD: docs/PRD.md (Mosaic Stack v0.1.0) Tracks task:fleet-observability-1— restore operator observability into fleet agent sessions.
Problem
The durable tmux fleet runs on the isolated mosaic-factory socket. That isolation
(which protects the operator's default tmux) makes the fleet invisible to default
tooling, and truth is split across three planes no single command joins — systemd
(systemctl --user), tmux (-L mosaic-factory), and the process tree (pstree).
agent tail (capture-pane) returns blank for full-screen TUIs, and agent send
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
observability and no safe way to watch a session.
Goals
- One command shows the whole fleet's real state, joining all three planes.
- Liveness is truthful: healthy = answered a heartbeat, not "pane alive".
- The operator can watch any session read-only without disrupting it.
sendreports delivered-and-accepted, not just injected.- Every record/address carries
tenant_id+host(zero foreclosure for multi-tenant/multi-host).
Non-goals (this phase)
- No webUI (Phase 5; rides federation for cross-host).
- No
fleetddaemon or persistent history store. - No real-runtime swap (Phase 3) — instrument the live dogfood stub fleet.
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
Functional requirements
| ID | Requirement |
|---|---|
| FR-1 | mosaic fleet ps [--json] prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · last-heartbeat age · drift flag (roster runtime ≠ actual pane command) · boot-enable warning (active but UnitFileState=disabled). |
| FR-2 | Heartbeat protocol v1 (see below); dogfood-agent.py implements the responder. fleet ps issues probes (or reads last-seen) and reports health per FR-1. |
| FR-3 | mosaic agent watch <name> opens a read-only view of the pane (grouped session or tmux attach -r) that cannot send keystrokes and does not shrink the agent's window. |
| FR-4 | mosaic agent attach <name> remains the explicit interactive-takeover path (separate verb, documented as the only one that can type). |
| FR-5 | mosaic agent send <name> --verify confirms the message was accepted (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
| FR-6 | All structured output (--json) includes tenant_id and host fields. |
Heartbeat protocol v1
- Probe: operator/
fleet pswrites a sentinel line to the agent's input or a well-known per-agent heartbeat file path~/.config/mosaic/fleet/run/<agent>.hb. - Response: the runtime updates
<agent>.hbwithts=<iso8601> pid=<pid> status=<ok|busy>on a fixed interval (default 15s) and on demand when probed. - Health rule:
healthyifnow - ts <= 3 × interval; elsestale; missing file =unknown. - Contract: every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
full-screen TUIs alike (no
capture-panedependency). ASSUMPTION:file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
Acceptance criteria
mosaic fleet psshows all 5 live sessions onmosaic-factorywith correct pane/pid/idle and flags the dogfood drift (canary-piruntime=pi but pane runsdogfood-agent.py) and the boot-enable gap (active but disabled).- Killing one agent's pane flips its row to dead/stale within one
interval. agent watchshows live output and provably cannot type into the pane; detaching leaves the agent's window size unchanged.agent send --verifyreturns success on an accepting pane and non-zero on a wedged/draft pane.- Quality gates green:
pnpm typecheck,pnpm lint,pnpm format:check, pluspnpm --filter @mosaicstack/mosaic test. - Independent review passed; dogfood evidence captured against the live fleet.
Test plan
- Unit/CLI specs in
packages/mosaic/src/commands/fleet.spec.ts(and a newfleet-ps/watch/send-verifyspec) using the injectedCommandRunnerto assert exact tmux/systemd command construction and JSON shape (tenant+host present). - Situational: run against the live
mosaic-factoryfleet; capturefleet psoutput, a kill-and-detect cycle, a read-onlywatch, and asend --verifypass/fail pair.
Known limitations
- Verify heuristic is best-effort:
agent send --verifyuses a>-prefix draft heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode TUIs is best-effort only; those runtimes may not use the same input-line indicator. - Pane-change check is the best Phase-2 signal:
agent send --verifycompares a BEFORE snapshot (captured immediately before the send) to an AFTER snapshot (captured after the send delay). A pane that changed and does not end in a draft line is reported as 'accepted'. A pane that did not change — including a wedged pane showing stale non-empty content — is reported 'unverifiable' (exit 1, "no pane change after send"). Definitive acceptance ultimately requires a runtime acknowledgement (Phase-3 heartbeat-ack); the pane-change check is the best signal available against an opaque TUI for Phase-2. - Blank AFTER capture fails closed: Full-screen TUIs (claude, codex, opencode, pi)
render blank for
tmux capture-pane. When the AFTER snapshot is empty,send --verifyreturns non-zero with an "unverifiable" message rather than silently succeeding. This is an intentional fail-closed design (FR-5). agent watchuses a grouped viewer session:tmux attach -rdirectly against the agent session lets the viewer terminal shrink the agent's window.agent watchinstead creates a throwaway grouped session (tmux new-session -d -t '=<agent>' -s '<agent>-watch-<pid>'), attaches read-only to that session, and kills it on detach. The grouped session shares the agent's windows but has independent sizing, so the agent's window is never affected.tmux attachis still interactive and requires inherited stdio; theinteractiveRunnerhandles TTY passthrough.
Surfaces & parity (MVP-X1)
CLI lands this phase. TUI surface follows in the packages/mosaic wizard; webUI in
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.