- isSendAccepted now returns 'accepted' | 'draft' | 'unverifiable' (was bool)
- Blank/empty capture => 'unverifiable' => process.exitCode=1 with distinct
"could not verify delivery (blank/no response captured)" message; previously
blank was treated as success, violating FR-5 fail-closed semantics
- Draft line ('^> ') => process.exitCode=1 with "left as unsubmitted draft"
message; distinct wording from unverifiable case
- agent watch now dispatched through injectable InteractiveRunner (stdio:inherit)
instead of the capturing CommandRunner; tmux attach requires TTY passthrough
- Default spawnInteractive implementation uses node:child_process spawn with
stdio:'inherit'; injectable via FleetCommandDeps.interactiveRunner for tests
- Removed buildSystemdIsActiveCommand (dead code — exported but unused)
- Tests: blank=>exitCode=1, draft=>exitCode=1, real response=>exitCode=0,
watch dispatched through interactiveRunner not capturing runner
- PRD: added "Known limitations" section (heuristic verify, blank fails closed,
non-pi/claude draft detection is best-effort, watch requires TTY passthrough)
- Code comment on isSendAccepted notes pi/claude-specific draft heuristic
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
7.1 KiB
PRD — Fleet Phase 2: Operator Observability
Workstream: W-FLEET under
mvp-20260312· Phase: 2 North star: docs/fleet/north-star.md Source umbrella PRD: docs/PRD.md (Mosaic Stack v0.1.0) Tracks task:fleet-observability-1— restore operator observability into fleet agent sessions.
Problem
The durable tmux fleet runs on the isolated mosaic-factory socket. That isolation
(which protects the operator's default tmux) makes the fleet invisible to default
tooling, and truth is split across three planes no single command joins — systemd
(systemctl --user), tmux (-L mosaic-factory), and the process tree (pstree).
agent tail (capture-pane) returns blank for full-screen TUIs, and agent send
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
observability and no safe way to watch a session.
Goals
- One command shows the whole fleet's real state, joining all three planes.
- Liveness is truthful: healthy = answered a heartbeat, not "pane alive".
- The operator can watch any session read-only without disrupting it.
sendreports delivered-and-accepted, not just injected.- Every record/address carries
tenant_id+host(zero foreclosure for multi-tenant/multi-host).
Non-goals (this phase)
- No webUI (Phase 5; rides federation for cross-host).
- No
fleetddaemon or persistent history store. - No real-runtime swap (Phase 3) — instrument the live dogfood stub fleet.
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
Functional requirements
| ID | Requirement |
|---|---|
| FR-1 | mosaic fleet ps [--json] prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · last-heartbeat age · drift flag (roster runtime ≠ actual pane command) · boot-enable warning (active but UnitFileState=disabled). |
| FR-2 | Heartbeat protocol v1 (see below); dogfood-agent.py implements the responder. fleet ps issues probes (or reads last-seen) and reports health per FR-1. |
| FR-3 | mosaic agent watch <name> opens a read-only view of the pane (grouped session or tmux attach -r) that cannot send keystrokes and does not shrink the agent's window. |
| FR-4 | mosaic agent attach <name> remains the explicit interactive-takeover path (separate verb, documented as the only one that can type). |
| FR-5 | mosaic agent send <name> --verify confirms the message was accepted (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
| FR-6 | All structured output (--json) includes tenant_id and host fields. |
Heartbeat protocol v1
- Probe: operator/
fleet pswrites a sentinel line to the agent's input or a well-known per-agent heartbeat file path~/.config/mosaic/fleet/run/<agent>.hb. - Response: the runtime updates
<agent>.hbwithts=<iso8601> pid=<pid> status=<ok|busy>on a fixed interval (default 15s) and on demand when probed. - Health rule:
healthyifnow - ts <= 3 × interval; elsestale; missing file =unknown. - Contract: every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
full-screen TUIs alike (no
capture-panedependency). ASSUMPTION:file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
Acceptance criteria
mosaic fleet psshows all 5 live sessions onmosaic-factorywith correct pane/pid/idle and flags the dogfood drift (canary-piruntime=pi but pane runsdogfood-agent.py) and the boot-enable gap (active but disabled).- Killing one agent's pane flips its row to dead/stale within one
interval. agent watchshows live output and provably cannot type into the pane; detaching leaves the agent's window size unchanged.agent send --verifyreturns success on an accepting pane and non-zero on a wedged/draft pane.- Quality gates green:
pnpm typecheck,pnpm lint,pnpm format:check, pluspnpm --filter @mosaicstack/mosaic test. - Independent review passed; dogfood evidence captured against the live fleet.
Test plan
- Unit/CLI specs in
packages/mosaic/src/commands/fleet.spec.ts(and a newfleet-ps/watch/send-verifyspec) using the injectedCommandRunnerto assert exact tmux/systemd command construction and JSON shape (tenant+host present). - Situational: run against the live
mosaic-factoryfleet; capturefleet psoutput, a kill-and-detect cycle, a read-onlywatch, and asend --verifypass/fail pair.
Known limitations
- Verify heuristic is best-effort:
agent send --verifyuses a>-prefix draft heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode TUIs is best-effort only; those runtimes may not use the same input-line indicator. - Blank capture fails closed: Full-screen TUIs (claude, codex, opencode, pi) render
blank for
tmux capture-pane. When the captured output is empty,send --verifyreturns non-zero with an "unverifiable" message rather than silently succeeding. This is an intentional fail-closed design (FR-5). agent watchrequires TTY passthrough:tmux attachis interactive and must be run with inherited stdio. It cannot be captured through a pipe. Tests inject a fakeinteractiveRunner; the real implementation spawns withstdio: 'inherit'.
Surfaces & parity (MVP-X1)
CLI lands this phase. TUI surface follows in the packages/mosaic wizard; webUI in
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.