# PRD — Fleet Phase 2: Operator Observability > **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2 > **North star:** [docs/fleet/north-star.md](./north-star.md) > **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0) > **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions. ## Problem The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation (which protects the operator's default tmux) makes the fleet **invisible** to default tooling, and truth is split across three planes no single command joins — systemd (`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`). `agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send` confirms only keystroke injection, not acceptance. Net: the operator has near-zero observability and no safe way to watch a session. ## Goals 1. One command shows the **whole fleet's** real state, joining all three planes. 2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive". 3. The operator can **watch** any session read-only without disrupting it. 4. `send` reports **delivered-and-accepted**, not just injected. 5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host). ## Non-goals (this phase) - No webUI (Phase 5; rides federation for cross-host). - No `fleetd` daemon or persistent history store. - No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet. - No cross-host aggregation yet (addressing is host-tagged but queries stay local). ## Functional requirements | ID | Requirement | | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). | | FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. | | FR-3 | `mosaic agent watch ` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. | | FR-4 | `mosaic agent attach ` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). | | FR-5 | `mosaic agent send --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. | | FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. | ## Heartbeat protocol v1 - **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/.hb`. - **Response:** the runtime updates `.hb` with `ts= pid= status=` on a fixed interval (default 15s) and on demand when probed. - **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`. - **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3) MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and full-screen TUIs alike (no `capture-pane` dependency). - `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6). ## Acceptance criteria - `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs `dogfood-agent.py`) and the **boot-enable** gap (active but disabled). - Killing one agent's pane flips its row to dead/stale within one `interval`. - `agent watch` shows live output and provably cannot type into the pane; detaching leaves the agent's window size unchanged. - `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane. - Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus `pnpm --filter @mosaicstack/mosaic test`. - Independent review passed; dogfood evidence captured against the live fleet. ## Test plan - Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new `fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert exact tmux/systemd command construction and JSON shape (tenant+host present). - Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output, a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair. ## Known limitations - **Verify heuristic is best-effort:** `agent send --verify` uses a `>` -prefix draft heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode TUIs is best-effort only; those runtimes may not use the same input-line indicator. - **Pane-change check is the best Phase-2 signal; verify now polls up to a bounded timeout:** `agent send --verify` captures a BEFORE snapshot, sends the message, then polls `capture-pane` every ~400 ms up to a configurable total timeout (default ~6 s, controlled by `--verify-timeout `). On each poll it runs classifySendResult: if the pane shows 'accepted' or 'draft' the loop exits immediately; while the result is 'unverifiable' (no pane change yet) it keeps polling. After the timeout with no definitive result, it fails closed: exit 1 with "no pane change after send". This eliminates false 'unverifiable' failures for slow/loaded TUIs that were previously caused by the old fixed 300 ms single-capture. Definitive acceptance ultimately requires a runtime acknowledgement (Phase-3 heartbeat-ack); the bounded pane-change poll is the best signal available against an opaque TUI for Phase-2. - **Blank AFTER capture fails closed:** Full-screen TUIs (claude, codex, opencode, pi) render blank for `tmux capture-pane`. When the AFTER snapshot is empty, `send --verify` returns non-zero with an "unverifiable" message rather than silently succeeding. This is an intentional fail-closed design (FR-5). - **`agent watch` uses a grouped viewer session:** `tmux attach -r` directly against the agent session lets the viewer terminal shrink the agent's window. `agent watch` instead creates a throwaway grouped session (`tmux new-session -d -t '=' -s '-watch-'`), attaches read-only to that session, and kills it on detach. The grouped session shares the agent's windows but has independent sizing, so the agent's window is never affected. `tmux attach` is still interactive and requires inherited stdio; the `interactiveRunner` handles TTY passthrough. ## Surfaces & parity (MVP-X1) CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.