docs(fleet): north star + Phase-2 observability PRD/tasks (W-FLEET)
Establish Fleet workstream doctrine under mvp-20260312: north star (incl. fleet-as-means-of-production), Phase-2 observability PRD, workstream tasks, and scratchpad. Collision-safe: scoped to docs/fleet/, touches none of the MVP single-writer control-plane files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
This commit is contained in:
81
docs/fleet/PRD.md
Normal file
81
docs/fleet/PRD.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# PRD — Fleet Phase 2: Operator Observability
|
||||
|
||||
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
|
||||
> **North star:** [docs/fleet/north-star.md](./north-star.md)
|
||||
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
|
||||
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
|
||||
|
||||
## Problem
|
||||
|
||||
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
|
||||
(which protects the operator's default tmux) makes the fleet **invisible** to default
|
||||
tooling, and truth is split across three planes no single command joins — systemd
|
||||
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
|
||||
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
|
||||
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
|
||||
observability and no safe way to watch a session.
|
||||
|
||||
## Goals
|
||||
|
||||
1. One command shows the **whole fleet's** real state, joining all three planes.
|
||||
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
|
||||
3. The operator can **watch** any session read-only without disrupting it.
|
||||
4. `send` reports **delivered-and-accepted**, not just injected.
|
||||
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
|
||||
|
||||
## Non-goals (this phase)
|
||||
|
||||
- No webUI (Phase 5; rides federation for cross-host).
|
||||
- No `fleetd` daemon or persistent history store.
|
||||
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
|
||||
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
|
||||
|
||||
## Functional requirements
|
||||
|
||||
| ID | Requirement |
|
||||
|---|---|
|
||||
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
|
||||
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
|
||||
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
|
||||
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
|
||||
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
|
||||
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
|
||||
|
||||
## Heartbeat protocol v1
|
||||
|
||||
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
|
||||
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
|
||||
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
|
||||
on a fixed interval (default 15s) and on demand when probed.
|
||||
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
|
||||
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
|
||||
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
|
||||
full-screen TUIs alike (no `capture-pane` dependency).
|
||||
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
|
||||
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
|
||||
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
|
||||
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
|
||||
- Killing one agent's pane flips its row to dead/stale within one `interval`.
|
||||
- `agent watch` shows live output and provably cannot type into the pane; detaching
|
||||
leaves the agent's window size unchanged.
|
||||
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
|
||||
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
|
||||
`pnpm --filter @mosaicstack/mosaic test`.
|
||||
- Independent review passed; dogfood evidence captured against the live fleet.
|
||||
|
||||
## Test plan
|
||||
|
||||
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
|
||||
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
|
||||
exact tmux/systemd command construction and JSON shape (tenant+host present).
|
||||
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
|
||||
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
|
||||
|
||||
## Surfaces & parity (MVP-X1)
|
||||
|
||||
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
|
||||
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.
|
||||
27
docs/fleet/TASKS.md
Normal file
27
docs/fleet/TASKS.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Tasks — W-FLEET (Fleet) Phase 2: Observability
|
||||
|
||||
> Workstream task file for the Fleet. Single-writer: Fleet workstream lead (orchestrator).
|
||||
> Workers read but never modify. This is **not** the MVP rollup (`docs/TASKS.md`) — a
|
||||
> rollup row is proposed to the MVP orchestrator, not written here.
|
||||
>
|
||||
> Mission: `mvp-20260312` · PRD: [docs/fleet/PRD.md](./PRD.md) · North star: [docs/fleet/north-star.md](./north-star.md)
|
||||
> Status: `not-started` | `in-progress` | `done` | `blocked` | `failed`
|
||||
|
||||
| id | status | description | depends_on | agent | pr | notes |
|
||||
|---|---|---|---|---|---|---|
|
||||
| FLEET-OBS-000 | done | Plan: north-star + Phase-2 PRD + workstream scaffolding | — | lead | — | persisted 2026-06-20 on `feat/fleet-observability` |
|
||||
| FLEET-OBS-001 | not-started | Heartbeat protocol v1 spec finalized in PRD + framework doc | FLEET-OBS-000 | lead | — | file-based `~/.config/mosaic/fleet/run/<agent>.hb` |
|
||||
| FLEET-OBS-002 | not-started | Implement heartbeat responder in `dogfood-agent.py` | FLEET-OBS-001 | worker | — | emits ts/pid/status every 15s + on probe |
|
||||
| FLEET-OBS-003 | not-started | `mosaic fleet ps` — join systemd+tmux+proc+idle+heartbeat; tenant+host tagged; drift + boot-enable flags; `--json` | FLEET-OBS-001 | worker | — | extend `packages/mosaic/src/commands/fleet.ts` |
|
||||
| FLEET-OBS-004 | not-started | `mosaic agent watch <name>` — read-only join (no resize, no keystrokes) | FLEET-OBS-000 | worker | — | grouped session or `attach -r`; keep `attach` as takeover |
|
||||
| FLEET-OBS-005 | not-started | `mosaic agent send --verify` — delivery/acceptance receipt | FLEET-OBS-000 | worker | — | non-zero on wedged/draft pane |
|
||||
| FLEET-OBS-006 | not-started | CLI specs for ps/watch/send-verify (tenant+host shape, command construction) | FLEET-OBS-003,004,005 | worker | — | alongside impl (TDD where risk-bearing) |
|
||||
| FLEET-OBS-007 | not-started | Framework doc: fleet observability guide + verbs | FLEET-OBS-003,004,005 | lead | — | `docs/guides/` or `framework/tools/.../README` |
|
||||
| FLEET-OBS-008 | not-started | Independent review + dogfood verification on live fleet | FLEET-OBS-002..007 | reviewer | — | author ≠ reviewer; capture evidence in scratchpad |
|
||||
| FLEET-OBS-009 | not-started | Open PR → green CI (queue guard) → squash-merge → close `fleet-observability-1` | FLEET-OBS-008 | lead | — | trunk merge; no direct push to main |
|
||||
|
||||
## Proposed MVP rollup row (for the MVP orchestrator — not written by this workstream)
|
||||
|
||||
```
|
||||
| W-FLEET | in-progress | Fleet (agent-session execution layer) | Phase 2/5 | docs/fleet/TASKS.md | observability dogfooded on live stub fleet; control plane rides federation (W1) |
|
||||
```
|
||||
128
docs/fleet/north-star.md
Normal file
128
docs/fleet/north-star.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# Mosaic Fleet — North Star
|
||||
|
||||
> **Workstream:** W-FLEET (Fleet) under mission `mvp-20260312`
|
||||
> **Umbrella:** [docs/MISSION-MANIFEST.md](../MISSION-MANIFEST.md) · [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
|
||||
> **Status:** doctrine — authored 2026-06-20. Owner of this file: Fleet workstream lead.
|
||||
> This document does **not** modify the MVP rollup; a rollup row is proposed, not written here.
|
||||
|
||||
## Vision
|
||||
|
||||
A **customizable, multi-tenant fleet of always-on AI agents** — each defined by role,
|
||||
materialized as a durable, joinable runtime session, coordinated by the proven
|
||||
orchestrator/worker model, and observable end-to-end across hosts. Coding today;
|
||||
finance, analytics, research as roster entries tomorrow — same primitives, different
|
||||
roster. The fleet is the **agent-session execution layer** of the Mosaic Stack MVP:
|
||||
the thing federation makes reachable across hosts and the webUI/TUI/CLI make visible.
|
||||
|
||||
The USC tmux PoC (durable sessions + `agent-send` comms) proved the model. This
|
||||
workstream makes it an official, observable, multi-tenant Mosaic Stack capability.
|
||||
|
||||
## The Fleet as means of production (bootstrapping)
|
||||
|
||||
The Fleet has a **dual role**, and that is the point:
|
||||
|
||||
- **As product** — a multi-tenant agent-fleet capability of Mosaic Stack (this workstream).
|
||||
- **As means of production** — the orchestrator/worker fleet that *actually builds the
|
||||
entire MVP* (federation W1, webUI, TUI, CLI, and the Fleet itself).
|
||||
|
||||
We are **building the system that builds the system.** Every other MVP workstream is
|
||||
delivered *by* the fleet, so fleet observability and control are not merely product
|
||||
features — they are the **operational floor of the whole delivery effort**. If we cannot
|
||||
see and steer the agents, we cannot trust what they ship. This is why Phase 2
|
||||
(observability) leads: it is the instrument panel for the factory, dogfooded on the live
|
||||
fleet that is, recursively, building Mosaic Stack.
|
||||
|
||||
The discipline that makes great power safe is the same gate chain the fleet enforces:
|
||||
independent review before merge, green CI, honest completion, decide-and-inform cadence,
|
||||
and no irreversible action without authority. The bootstrap is only as trustworthy as
|
||||
those gates.
|
||||
|
||||
## Alignment with MVP cross-cutting requirements
|
||||
|
||||
The Fleet inherits — does not re-invent — the MVP's hard requirements:
|
||||
|
||||
| MVP req | What it means for the Fleet |
|
||||
|---|---|
|
||||
| MVP-X1 three-surface parity | fleet observability/control reachable via **CLI + TUI + webUI** (CLI first; webUI is required for parity, not optional) |
|
||||
| MVP-X2 multi-tenant isolation | one tenant = one **Linux uid** (own `systemd --user`, socket, `~/.config/mosaic`); no cross-tenant leakage |
|
||||
| MVP-X3 auth (BetterAuth/SSO) | operator→fleet and cross-host views are auth-gated through the platform's existing auth |
|
||||
| MVP-X4 quality gates | `pnpm typecheck`/`lint`/`format:check` green before any push |
|
||||
| MVP-X5 federated topology | cross-host fleet visibility rides the **federation** boundary (W1), not a bespoke broker |
|
||||
| MVP-X6 OTEL tracing | heartbeats, sends, and lifecycle events emit spans; `traceparent` crosses the federation boundary |
|
||||
| MVP-X7 trunk merge | branch from `main`, squash-merge via PR, never push to `main` |
|
||||
|
||||
## The stack — where every concern lives
|
||||
|
||||
One **definition** is the source of truth; the **session** is how it runs.
|
||||
|
||||
| Layer | Owner | Phase-2 reality | Destination |
|
||||
|---|---|---|---|
|
||||
| **Definition + identity + auth** | gateway / `mosaic-as` (scoped tokens, #541) | `roster.yaml` (tenant-tagged) | one definition; `mosaic agent --new` materializes it |
|
||||
| **Tenancy boundary** | **Linux uid per tenant** (linger, own `systemd --user`, own socket, own `~/.config/mosaic`) | one tenant: `jarvis` = tenant zero | uid-per-tenant; federation aggregates across hosts |
|
||||
| **Runtime** | per-tenant tmux session on isolated socket | dogfood stub sessions (live now on `mosaic-factory`) | claude/codex/pi/opencode TUIs |
|
||||
| **Liveness** | **heartbeat protocol** every runtime answers | protocol defined + dogfood stub answers it | all runtimes answer; "healthy" ≠ "pane alive" |
|
||||
| **Observation** | read-only `watch` (native tmux) + `pipe-pane` stream | CLI `watch`/`ps`; explicit opt-in `attach` for control | + auth-gated webUI streams |
|
||||
| **Control plane** | **federation** across hosts × tenants | records already carry `tenant_id` + `host` | federated gateways expose fleet state; webUI in Phase 5 |
|
||||
|
||||
## Operating model (inherited, not reinvented)
|
||||
|
||||
The AI-guide law stands: one accountable **orchestrator**, isolated **workers** that
|
||||
stop at PR-open, the serialized **gate chain** (independent review → green CI →
|
||||
diff-sanity → squash-merge → verify), **decide-and-inform** cadence, and a durable
|
||||
**board** so missions survive session death. The Fleet is the infrastructure *under*
|
||||
this model. See `mosaicstack-aiguide` whitepapers 01 (inter-agent comms) and 03
|
||||
(orchestration model) for the rationale.
|
||||
|
||||
## Invariants — "maximal vision, incremental delivery, zero foreclosure"
|
||||
|
||||
Every artifact, starting Phase 2, MUST:
|
||||
|
||||
1. Carry **`tenant_id` + `host`** in schema and message addressing — even with one of each today.
|
||||
2. Treat **isolation socket ≠ invisibility** — anything isolated is surfaced by one command.
|
||||
3. Define **healthy = answered a heartbeat within N seconds**, never just "pane alive".
|
||||
4. Make **observation read-only by default**; control is an explicit, separate, opt-in verb.
|
||||
|
||||
## Observation model
|
||||
|
||||
| Verb | Behavior |
|
||||
|---|---|
|
||||
| `mosaic fleet ps` | one table joining systemd + tmux + process + idle + last-heartbeat, with drift + boot-enable flags |
|
||||
| `mosaic agent watch <name>` | **read-only** join (grouped session / `-r`), no resize tyranny, no keystrokes |
|
||||
| `mosaic agent attach <name>` | explicit interactive takeover (the only path that can type) |
|
||||
| `mosaic agent send <name> --verify` | confirms message **accepted**, not merely keystroke-injected |
|
||||
|
||||
> Why the current PoC blocks observation: sessions live on the isolated `mosaic-factory`
|
||||
> socket (invisible to default `tmux ls`), the only sanctioned read is `capture-pane`
|
||||
> (blank for full-screen TUIs), and `attach` is read-write + resizes the session. The
|
||||
> verbs above restore "join and observe" safely.
|
||||
|
||||
## Phased roadmap
|
||||
|
||||
| Phase | Outcome | Status |
|
||||
|---|---|---|
|
||||
| 0–1 | tmux PoC, hardening, published CLI v0.0.34 (#565–#568) | ✅ done |
|
||||
| **2 — Observability** | `fleet ps` (host+tenant aware join), heartbeat protocol + dogfood stub answers it, `agent watch` (read-only), `agent send --verify` receipts | ▶ now |
|
||||
| 3 — Real runtimes | claude/codex/pi/opencode answer heartbeat; **hybrid lifecycle** (core always-on: orchestrator+reviewer; ephemeral workers per lane) | planned |
|
||||
| 4 — Unified definition | one agent schema in gateway; `mosaic agent --new` → materialized per-tenant session; uid-tenant provisioning | planned |
|
||||
| 5 — Control plane | federation-backed cross-host × cross-tenant fleet view; **webUI** (surface chosen then) for MVP-X1 parity | planned |
|
||||
|
||||
## Decisions of record (2026-06-20, with Jason)
|
||||
|
||||
- Agent model: **config defines, session runs** (gateway = definition/identity/auth; tmux = runtime).
|
||||
- Tenancy: **multi-tenant from the start**; isolation = **per-tenant Linux uid**.
|
||||
- Health: **heartbeat required** (dogfood stub implements the protocol now).
|
||||
- Lifecycle: **hybrid** — core always-on + ephemeral workers per lane.
|
||||
- Observation: **read-only default, opt-in takeover**.
|
||||
- Multi-host: **designed-for from day one**; control plane **rides federation (W1)**.
|
||||
- Delivery: **CLI-first now**, dogfood against the live stub fleet; webUI deferred to Phase 5.
|
||||
|
||||
## Assumptions (veto-able)
|
||||
|
||||
- `ASSUMPTION:` first-class runtimes = claude, codex, pi, opencode; a "role" (analyst,
|
||||
finance, researcher) = persona + skills + tools on top of a runtime, shipped as a
|
||||
starter role library in the framework.
|
||||
- `ASSUMPTION:` the cross-host control plane is the **federation** layer (W1), not a
|
||||
separate `fleetd` daemon.
|
||||
- `ASSUMPTION:` Fleet is workstream **W-FLEET** under `mvp-20260312`; a rollup row in
|
||||
`docs/TASKS.md` and a workstream declaration in `MISSION-MANIFEST.md` are proposed to
|
||||
the MVP orchestrator, not written by this workstream.
|
||||
50
docs/scratchpads/fleet-observability-phase2.md
Normal file
50
docs/scratchpads/fleet-observability-phase2.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Scratchpad — Fleet Phase 2: Observability (W-FLEET)
|
||||
|
||||
> Append-only. Mission `mvp-20260312` / workstream W-FLEET.
|
||||
> Lead: Jarvis (Claude) at `W-jarvis:mos-claude-18`. Coordinating with `jwoltje@dragon-lin:coder0-0`.
|
||||
|
||||
## Mission prompt (2026-06-20)
|
||||
|
||||
Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability
|
||||
for delivery. The USC tmux PoC is the proven base. Jason granted lead authority:
|
||||
"The fleet is a great way to actually build the MVP — we are building the system that
|
||||
builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate
|
||||
with a second agent on `dragon-lin`.
|
||||
|
||||
## Decisions of record (with Jason, 2026-06-20)
|
||||
|
||||
- Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
|
||||
- Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
|
||||
- Health: heartbeat required; dogfood stub implements protocol now.
|
||||
- Lifecycle: hybrid (core always-on + ephemeral workers).
|
||||
- Observation: read-only default, opt-in takeover.
|
||||
- Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
|
||||
- Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
|
||||
- Fleet is dual-role: product AND means of production (bootstrapping the MVP).
|
||||
|
||||
## Environment facts (verified 2026-06-20)
|
||||
|
||||
- Fleet is live on `W-jarvis` (uid 1000, `jarvis`, `Linger=yes`) on tmux socket
|
||||
`mosaic-factory`: `_holder`, `canary-pi`, `dogfood-coder`, `dogfood-orchestrator`,
|
||||
`dogfood-reviewer`. All panes run `~/.config/mosaic/fleet/dogfood-agent.py` (stub),
|
||||
including `canary-pi` (roster says runtime=pi → **drift**).
|
||||
- Holder + `mosaic-agent@*` units are `active (exited)` but `UnitFileState=disabled`
|
||||
(reboot loses fleet → boot-enable gap to surface).
|
||||
- Observation blocked by: isolated socket (hidden from default `tmux ls`), `capture-pane`
|
||||
blank for TUIs, `attach` being read-write + resizing.
|
||||
- Second agent: `jwoltje@dragon-lin`, session `coder0-0` (group `coder0`), running `node`,
|
||||
default socket. ssh forward reach confirmed.
|
||||
|
||||
## Governance / collision-safety
|
||||
|
||||
- `mosaicstack-stack` has active mission `mvp-20260312` with single-writer locks on
|
||||
`docs/MISSION-MANIFEST.md`, `docs/TASKS.md`, `docs/scratchpads/mvp-20260312.md`.
|
||||
- This workstream touches NONE of those. All Fleet docs scoped under `docs/fleet/` +
|
||||
this scratchpad. Rollup row proposed, not written.
|
||||
|
||||
## Session log
|
||||
|
||||
- 2026-06-20: Researched AI guide + fleet code + live state. Established north star with
|
||||
Jason (8 forks decided). Branched `feat/fleet-observability`. Persisted
|
||||
`docs/fleet/{north-star.md,PRD.md,TASKS.md}` + this scratchpad. Next: establish comms
|
||||
with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + `fleet ps`).
|
||||
Reference in New Issue
Block a user