Files
stack/docs/fleet/PRD.md
Jarvis b2071dc898 docs(fleet): north star + Phase-2 observability PRD/tasks (W-FLEET)
Establish Fleet workstream doctrine under mvp-20260312: north star (incl.
fleet-as-means-of-production), Phase-2 observability PRD, workstream tasks,
and scratchpad. Collision-safe: scoped to docs/fleet/, touches none of the
MVP single-writer control-plane files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
2026-06-20 22:30:34 -05:00

82 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PRD — Fleet Phase 2: Operator Observability
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
> **North star:** [docs/fleet/north-star.md](./north-star.md)
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
## Problem
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
(which protects the operator's default tmux) makes the fleet **invisible** to default
tooling, and truth is split across three planes no single command joins — systemd
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
observability and no safe way to watch a session.
## Goals
1. One command shows the **whole fleet's** real state, joining all three planes.
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
3. The operator can **watch** any session read-only without disrupting it.
4. `send` reports **delivered-and-accepted**, not just injected.
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
## Non-goals (this phase)
- No webUI (Phase 5; rides federation for cross-host).
- No `fleetd` daemon or persistent history store.
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
## Functional requirements
| ID | Requirement |
|---|---|
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
## Heartbeat protocol v1
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
on a fixed interval (default 15s) and on demand when probed.
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
full-screen TUIs alike (no `capture-pane` dependency).
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
## Acceptance criteria
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
- Killing one agent's pane flips its row to dead/stale within one `interval`.
- `agent watch` shows live output and provably cannot type into the pane; detaching
leaves the agent's window size unchanged.
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
`pnpm --filter @mosaicstack/mosaic test`.
- Independent review passed; dogfood evidence captured against the live fleet.
## Test plan
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
exact tmux/systemd command construction and JSON shape (tenant+host present).
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
## Surfaces & parity (MVP-X1)
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.