feat(fleet): Phase-2 observability — fleet ps + watch + send verify (#579)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful

This commit was merged in pull request #579.
This commit is contained in:
2026-06-21 04:23:51 +00:00
parent 5118be74cb
commit af2eede7a9
6 changed files with 2041 additions and 6 deletions

109
docs/fleet/PRD.md Normal file
View File

@@ -0,0 +1,109 @@
# PRD — Fleet Phase 2: Operator Observability
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
> **North star:** [docs/fleet/north-star.md](./north-star.md)
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
## Problem
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
(which protects the operator's default tmux) makes the fleet **invisible** to default
tooling, and truth is split across three planes no single command joins — systemd
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
observability and no safe way to watch a session.
## Goals
1. One command shows the **whole fleet's** real state, joining all three planes.
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
3. The operator can **watch** any session read-only without disrupting it.
4. `send` reports **delivered-and-accepted**, not just injected.
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
## Non-goals (this phase)
- No webUI (Phase 5; rides federation for cross-host).
- No `fleetd` daemon or persistent history store.
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
## Functional requirements
| ID | Requirement |
| ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
## Heartbeat protocol v1
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
on a fixed interval (default 15s) and on demand when probed.
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
full-screen TUIs alike (no `capture-pane` dependency).
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
## Acceptance criteria
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
- Killing one agent's pane flips its row to dead/stale within one `interval`.
- `agent watch` shows live output and provably cannot type into the pane; detaching
leaves the agent's window size unchanged.
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
`pnpm --filter @mosaicstack/mosaic test`.
- Independent review passed; dogfood evidence captured against the live fleet.
## Test plan
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
exact tmux/systemd command construction and JSON shape (tenant+host present).
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
## Known limitations
- **Verify heuristic is best-effort:** `agent send --verify` uses a `>` -prefix draft
heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode
TUIs is best-effort only; those runtimes may not use the same input-line indicator.
- **Pane-change check is the best Phase-2 signal; verify now polls up to a bounded
timeout:** `agent send --verify` captures a BEFORE snapshot, sends the message, then
polls `capture-pane` every ~400 ms up to a configurable total timeout (default ~6 s,
controlled by `--verify-timeout <ms>`). On each poll it runs classifySendResult: if
the pane shows 'accepted' or 'draft' the loop exits immediately; while the result is
'unverifiable' (no pane change yet) it keeps polling. After the timeout with no
definitive result, it fails closed: exit 1 with "no pane change after send". This
eliminates false 'unverifiable' failures for slow/loaded TUIs that were previously
caused by the old fixed 300 ms single-capture. Definitive acceptance ultimately
requires a runtime acknowledgement (Phase-3 heartbeat-ack); the bounded pane-change
poll is the best signal available against an opaque TUI for Phase-2.
- **Blank AFTER capture fails closed:** Full-screen TUIs (claude, codex, opencode, pi)
render blank for `tmux capture-pane`. When the AFTER snapshot is empty, `send --verify`
returns non-zero with an "unverifiable" message rather than silently succeeding. This
is an intentional fail-closed design (FR-5).
- **`agent watch` uses a grouped viewer session:** `tmux attach -r` directly against the
agent session lets the viewer terminal shrink the agent's window. `agent watch` instead
creates a throwaway grouped session (`tmux new-session -d -t '=<agent>' -s
'<agent>-watch-<pid>'`), attaches read-only to that session, and kills it on detach.
The grouped session shares the agent's windows but has independent sizing, so the
agent's window is never affected. `tmux attach` is still interactive and requires
inherited stdio; the `interactiveRunner` handles TTY passthrough.
## Surfaces & parity (MVP-X1)
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.