Files
stack/docs/fleet/PRD.md
Jarvis ddeb200fdf
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline was canceled
style(fleet): prettier-format workstream docs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
2026-06-20 22:32:14 -05:00

6.3 KiB
Raw Blame History

PRD — Fleet Phase 2: Operator Observability

Workstream: W-FLEET under mvp-20260312 · Phase: 2 North star: docs/fleet/north-star.md Source umbrella PRD: docs/PRD.md (Mosaic Stack v0.1.0) Tracks task: fleet-observability-1 — restore operator observability into fleet agent sessions.

Problem

The durable tmux fleet runs on the isolated mosaic-factory socket. That isolation (which protects the operator's default tmux) makes the fleet invisible to default tooling, and truth is split across three planes no single command joins — systemd (systemctl --user), tmux (-L mosaic-factory), and the process tree (pstree). agent tail (capture-pane) returns blank for full-screen TUIs, and agent send confirms only keystroke injection, not acceptance. Net: the operator has near-zero observability and no safe way to watch a session.

Goals

  1. One command shows the whole fleet's real state, joining all three planes.
  2. Liveness is truthful: healthy = answered a heartbeat, not "pane alive".
  3. The operator can watch any session read-only without disrupting it.
  4. send reports delivered-and-accepted, not just injected.
  5. Every record/address carries tenant_id + host (zero foreclosure for multi-tenant/multi-host).

Non-goals (this phase)

  • No webUI (Phase 5; rides federation for cross-host).
  • No fleetd daemon or persistent history store.
  • No real-runtime swap (Phase 3) — instrument the live dogfood stub fleet.
  • No cross-host aggregation yet (addressing is host-tagged but queries stay local).

Functional requirements

ID Requirement
FR-1 mosaic fleet ps [--json] prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · last-heartbeat age · drift flag (roster runtime ≠ actual pane command) · boot-enable warning (active but UnitFileState=disabled).
FR-2 Heartbeat protocol v1 (see below); dogfood-agent.py implements the responder. fleet ps issues probes (or reads last-seen) and reports health per FR-1.
FR-3 mosaic agent watch <name> opens a read-only view of the pane (grouped session or tmux attach -r) that cannot send keystrokes and does not shrink the agent's window.
FR-4 mosaic agent attach <name> remains the explicit interactive-takeover path (separate verb, documented as the only one that can type).
FR-5 mosaic agent send <name> --verify confirms the message was accepted (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified.
FR-6 All structured output (--json) includes tenant_id and host fields.

Heartbeat protocol v1

  • Probe: operator/fleet ps writes a sentinel line to the agent's input or a well-known per-agent heartbeat file path ~/.config/mosaic/fleet/run/<agent>.hb.
  • Response: the runtime updates <agent>.hb with ts=<iso8601> pid=<pid> status=<ok|busy> on a fixed interval (default 15s) and on demand when probed.
  • Health rule: healthy if now - ts <= 3 × interval; else stale; missing file = unknown.
  • Contract: every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3) MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and full-screen TUIs alike (no capture-pane dependency).
  • ASSUMPTION: file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).

Acceptance criteria

  • mosaic fleet ps shows all 5 live sessions on mosaic-factory with correct pane/pid/idle and flags the dogfood drift (canary-pi runtime=pi but pane runs dogfood-agent.py) and the boot-enable gap (active but disabled).
  • Killing one agent's pane flips its row to dead/stale within one interval.
  • agent watch shows live output and provably cannot type into the pane; detaching leaves the agent's window size unchanged.
  • agent send --verify returns success on an accepting pane and non-zero on a wedged/draft pane.
  • Quality gates green: pnpm typecheck, pnpm lint, pnpm format:check, plus pnpm --filter @mosaicstack/mosaic test.
  • Independent review passed; dogfood evidence captured against the live fleet.

Test plan

  • Unit/CLI specs in packages/mosaic/src/commands/fleet.spec.ts (and a new fleet-ps/watch/send-verify spec) using the injected CommandRunner to assert exact tmux/systemd command construction and JSON shape (tenant+host present).
  • Situational: run against the live mosaic-factory fleet; capture fleet ps output, a kill-and-detect cycle, a read-only watch, and a send --verify pass/fail pair.

Surfaces & parity (MVP-X1)

CLI lands this phase. TUI surface follows in the packages/mosaic wizard; webUI in Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.