Blocker fix: send --verify now captures a BEFORE snapshot immediately before the send and an AFTER snapshot after the delay, then uses classifySendResult(before, after) to classify. A wedged pane showing stale non-empty content is no longer falsely reported as 'accepted' — BEFORE==AFTER maps to 'unverifiable' (exit 1, "no pane change after send"). Blank AFTER still fails closed as 'unverifiable'. Only AFTER != BEFORE without a draft suffix counts as 'accepted' (exit 0). Should-fix: agent watch now uses a GROUPED VIEWER SESSION instead of a bare 'tmux attach -r' against the agent session. A bare attach lets the viewer terminal shrink the agent's window; a grouped session has independent sizing so the agent's window is never affected. Sequence: new-session -d -t '=<agent>' -s '<agent>-watch-<pid>' (runner), attach -r to viewer session (interactiveRunner), kill-session on detach (runner). New builder functions exported: buildAgentWatchCreateViewerCommand, buildAgentWatchAttachCommand, buildAgentWatchKillViewerCommand, buildViewerSessionName. buildAgentWatchCommand kept but deprecated. New exports: classifySendResult(before, after) — the testable classifier. Tests added: - classifySendResult unit suite (6 cases): accepted/draft/unverifiable/ stale-pane/both-blank/before-blank-after-response - send --verify regression: stale (before==after non-empty) => exit 1 - send --verify regression: blank AFTER => exit 1 - send --verify regression: draft after pane change => exit 1 - send --verify regression: changed non-draft => exit 0 - send --verify: 3-call sequence assertion (before-capture, send, after-capture) - watch dispatch: grouped viewer session created/attached/killed; no bare attach against agent session; viewer name matches <agent>-watch-<pid> PRD Known-limitations updated: pane-change check rationale, Phase-3 heartbeat-ack requirement, grouped-session watch design. All gates pass: pnpm typecheck, pnpm lint, pnpm --filter @mosaicstack/mosaic test (382 tests, 74 fleet), prettier --check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
107 lines
8.1 KiB
Markdown
107 lines
8.1 KiB
Markdown
# PRD — Fleet Phase 2: Operator Observability
|
||
|
||
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
|
||
> **North star:** [docs/fleet/north-star.md](./north-star.md)
|
||
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
|
||
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
|
||
|
||
## Problem
|
||
|
||
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
|
||
(which protects the operator's default tmux) makes the fleet **invisible** to default
|
||
tooling, and truth is split across three planes no single command joins — systemd
|
||
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
|
||
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
|
||
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
|
||
observability and no safe way to watch a session.
|
||
|
||
## Goals
|
||
|
||
1. One command shows the **whole fleet's** real state, joining all three planes.
|
||
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
|
||
3. The operator can **watch** any session read-only without disrupting it.
|
||
4. `send` reports **delivered-and-accepted**, not just injected.
|
||
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
|
||
|
||
## Non-goals (this phase)
|
||
|
||
- No webUI (Phase 5; rides federation for cross-host).
|
||
- No `fleetd` daemon or persistent history store.
|
||
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
|
||
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
|
||
|
||
## Functional requirements
|
||
|
||
| ID | Requirement |
|
||
| ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
|
||
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
|
||
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
|
||
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
|
||
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
|
||
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
|
||
|
||
## Heartbeat protocol v1
|
||
|
||
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
|
||
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
|
||
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
|
||
on a fixed interval (default 15s) and on demand when probed.
|
||
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
|
||
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
|
||
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
|
||
full-screen TUIs alike (no `capture-pane` dependency).
|
||
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
|
||
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
|
||
|
||
## Acceptance criteria
|
||
|
||
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
|
||
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
|
||
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
|
||
- Killing one agent's pane flips its row to dead/stale within one `interval`.
|
||
- `agent watch` shows live output and provably cannot type into the pane; detaching
|
||
leaves the agent's window size unchanged.
|
||
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
|
||
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
|
||
`pnpm --filter @mosaicstack/mosaic test`.
|
||
- Independent review passed; dogfood evidence captured against the live fleet.
|
||
|
||
## Test plan
|
||
|
||
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
|
||
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
|
||
exact tmux/systemd command construction and JSON shape (tenant+host present).
|
||
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
|
||
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
|
||
|
||
## Known limitations
|
||
|
||
- **Verify heuristic is best-effort:** `agent send --verify` uses a `>` -prefix draft
|
||
heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode
|
||
TUIs is best-effort only; those runtimes may not use the same input-line indicator.
|
||
- **Pane-change check is the best Phase-2 signal:** `agent send --verify` compares a
|
||
BEFORE snapshot (captured immediately before the send) to an AFTER snapshot (captured
|
||
after the send delay). A pane that changed and does not end in a draft line is reported
|
||
as 'accepted'. A pane that did not change — including a wedged pane showing stale
|
||
non-empty content — is reported 'unverifiable' (exit 1, "no pane change after send").
|
||
Definitive acceptance ultimately requires a runtime acknowledgement (Phase-3
|
||
heartbeat-ack); the pane-change check is the best signal available against an opaque
|
||
TUI for Phase-2.
|
||
- **Blank AFTER capture fails closed:** Full-screen TUIs (claude, codex, opencode, pi)
|
||
render blank for `tmux capture-pane`. When the AFTER snapshot is empty, `send --verify`
|
||
returns non-zero with an "unverifiable" message rather than silently succeeding. This
|
||
is an intentional fail-closed design (FR-5).
|
||
- **`agent watch` uses a grouped viewer session:** `tmux attach -r` directly against the
|
||
agent session lets the viewer terminal shrink the agent's window. `agent watch` instead
|
||
creates a throwaway grouped session (`tmux new-session -d -t '=<agent>' -s
|
||
'<agent>-watch-<pid>'`), attaches read-only to that session, and kills it on detach.
|
||
The grouped session shares the agent's windows but has independent sizing, so the
|
||
agent's window is never affected. `tmux attach` is still interactive and requires
|
||
inherited stdio; the `interactiveRunner` handles TTY passthrough.
|
||
|
||
## Surfaces & parity (MVP-X1)
|
||
|
||
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
|
||
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.
|