Files
stack/docs/fleet/PRD.md
jason.woltje af2eede7a9
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
feat(fleet): Phase-2 observability — fleet ps + watch + send verify (#579)
2026-06-21 04:23:51 +00:00

8.4 KiB
Raw Blame History

PRD — Fleet Phase 2: Operator Observability

Workstream: W-FLEET under mvp-20260312 · Phase: 2 North star: docs/fleet/north-star.md Source umbrella PRD: docs/PRD.md (Mosaic Stack v0.1.0) Tracks task: fleet-observability-1 — restore operator observability into fleet agent sessions.

Problem

The durable tmux fleet runs on the isolated mosaic-factory socket. That isolation (which protects the operator's default tmux) makes the fleet invisible to default tooling, and truth is split across three planes no single command joins — systemd (systemctl --user), tmux (-L mosaic-factory), and the process tree (pstree). agent tail (capture-pane) returns blank for full-screen TUIs, and agent send confirms only keystroke injection, not acceptance. Net: the operator has near-zero observability and no safe way to watch a session.

Goals

  1. One command shows the whole fleet's real state, joining all three planes.
  2. Liveness is truthful: healthy = answered a heartbeat, not "pane alive".
  3. The operator can watch any session read-only without disrupting it.
  4. send reports delivered-and-accepted, not just injected.
  5. Every record/address carries tenant_id + host (zero foreclosure for multi-tenant/multi-host).

Non-goals (this phase)

  • No webUI (Phase 5; rides federation for cross-host).
  • No fleetd daemon or persistent history store.
  • No real-runtime swap (Phase 3) — instrument the live dogfood stub fleet.
  • No cross-host aggregation yet (addressing is host-tagged but queries stay local).

Functional requirements

ID Requirement
FR-1 mosaic fleet ps [--json] prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · last-heartbeat age · drift flag (roster runtime ≠ actual pane command) · boot-enable warning (active but UnitFileState=disabled).
FR-2 Heartbeat protocol v1 (see below); dogfood-agent.py implements the responder. fleet ps issues probes (or reads last-seen) and reports health per FR-1.
FR-3 mosaic agent watch <name> opens a read-only view of the pane (grouped session or tmux attach -r) that cannot send keystrokes and does not shrink the agent's window.
FR-4 mosaic agent attach <name> remains the explicit interactive-takeover path (separate verb, documented as the only one that can type).
FR-5 mosaic agent send <name> --verify confirms the message was accepted (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified.
FR-6 All structured output (--json) includes tenant_id and host fields.

Heartbeat protocol v1

  • Probe: operator/fleet ps writes a sentinel line to the agent's input or a well-known per-agent heartbeat file path ~/.config/mosaic/fleet/run/<agent>.hb.
  • Response: the runtime updates <agent>.hb with ts=<iso8601> pid=<pid> status=<ok|busy> on a fixed interval (default 15s) and on demand when probed.
  • Health rule: healthy if now - ts <= 3 × interval; else stale; missing file = unknown.
  • Contract: every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3) MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and full-screen TUIs alike (no capture-pane dependency).
  • ASSUMPTION: file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).

Acceptance criteria

  • mosaic fleet ps shows all 5 live sessions on mosaic-factory with correct pane/pid/idle and flags the dogfood drift (canary-pi runtime=pi but pane runs dogfood-agent.py) and the boot-enable gap (active but disabled).
  • Killing one agent's pane flips its row to dead/stale within one interval.
  • agent watch shows live output and provably cannot type into the pane; detaching leaves the agent's window size unchanged.
  • agent send --verify returns success on an accepting pane and non-zero on a wedged/draft pane.
  • Quality gates green: pnpm typecheck, pnpm lint, pnpm format:check, plus pnpm --filter @mosaicstack/mosaic test.
  • Independent review passed; dogfood evidence captured against the live fleet.

Test plan

  • Unit/CLI specs in packages/mosaic/src/commands/fleet.spec.ts (and a new fleet-ps/watch/send-verify spec) using the injected CommandRunner to assert exact tmux/systemd command construction and JSON shape (tenant+host present).
  • Situational: run against the live mosaic-factory fleet; capture fleet ps output, a kill-and-detect cycle, a read-only watch, and a send --verify pass/fail pair.

Known limitations

  • Verify heuristic is best-effort: agent send --verify uses a > -prefix draft heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode TUIs is best-effort only; those runtimes may not use the same input-line indicator.
  • Pane-change check is the best Phase-2 signal; verify now polls up to a bounded timeout: agent send --verify captures a BEFORE snapshot, sends the message, then polls capture-pane every ~400 ms up to a configurable total timeout (default ~6 s, controlled by --verify-timeout <ms>). On each poll it runs classifySendResult: if the pane shows 'accepted' or 'draft' the loop exits immediately; while the result is 'unverifiable' (no pane change yet) it keeps polling. After the timeout with no definitive result, it fails closed: exit 1 with "no pane change after send". This eliminates false 'unverifiable' failures for slow/loaded TUIs that were previously caused by the old fixed 300 ms single-capture. Definitive acceptance ultimately requires a runtime acknowledgement (Phase-3 heartbeat-ack); the bounded pane-change poll is the best signal available against an opaque TUI for Phase-2.
  • Blank AFTER capture fails closed: Full-screen TUIs (claude, codex, opencode, pi) render blank for tmux capture-pane. When the AFTER snapshot is empty, send --verify returns non-zero with an "unverifiable" message rather than silently succeeding. This is an intentional fail-closed design (FR-5).
  • agent watch uses a grouped viewer session: tmux attach -r directly against the agent session lets the viewer terminal shrink the agent's window. agent watch instead creates a throwaway grouped session (tmux new-session -d -t '=<agent>' -s '<agent>-watch-<pid>'), attaches read-only to that session, and kills it on detach. The grouped session shares the agent's windows but has independent sizing, so the agent's window is never affected. tmux attach is still interactive and requires inherited stdio; the interactiveRunner handles TTY passthrough.

Surfaces & parity (MVP-X1)

CLI lands this phase. TUI surface follows in the packages/mosaic wizard; webUI in Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.