Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
82 lines
6.3 KiB
Markdown
82 lines
6.3 KiB
Markdown
# PRD — Fleet Phase 2: Operator Observability
|
||
|
||
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
|
||
> **North star:** [docs/fleet/north-star.md](./north-star.md)
|
||
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
|
||
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
|
||
|
||
## Problem
|
||
|
||
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
|
||
(which protects the operator's default tmux) makes the fleet **invisible** to default
|
||
tooling, and truth is split across three planes no single command joins — systemd
|
||
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
|
||
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
|
||
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
|
||
observability and no safe way to watch a session.
|
||
|
||
## Goals
|
||
|
||
1. One command shows the **whole fleet's** real state, joining all three planes.
|
||
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
|
||
3. The operator can **watch** any session read-only without disrupting it.
|
||
4. `send` reports **delivered-and-accepted**, not just injected.
|
||
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
|
||
|
||
## Non-goals (this phase)
|
||
|
||
- No webUI (Phase 5; rides federation for cross-host).
|
||
- No `fleetd` daemon or persistent history store.
|
||
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
|
||
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
|
||
|
||
## Functional requirements
|
||
|
||
| ID | Requirement |
|
||
| ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
|
||
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
|
||
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
|
||
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
|
||
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
|
||
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
|
||
|
||
## Heartbeat protocol v1
|
||
|
||
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
|
||
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
|
||
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
|
||
on a fixed interval (default 15s) and on demand when probed.
|
||
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
|
||
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
|
||
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
|
||
full-screen TUIs alike (no `capture-pane` dependency).
|
||
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
|
||
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
|
||
|
||
## Acceptance criteria
|
||
|
||
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
|
||
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
|
||
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
|
||
- Killing one agent's pane flips its row to dead/stale within one `interval`.
|
||
- `agent watch` shows live output and provably cannot type into the pane; detaching
|
||
leaves the agent's window size unchanged.
|
||
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
|
||
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
|
||
`pnpm --filter @mosaicstack/mosaic test`.
|
||
- Independent review passed; dogfood evidence captured against the live fleet.
|
||
|
||
## Test plan
|
||
|
||
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
|
||
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
|
||
exact tmux/systemd command construction and JSON shape (tenant+host present).
|
||
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
|
||
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
|
||
|
||
## Surfaces & parity (MVP-X1)
|
||
|
||
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
|
||
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.
|