Compare commits
2 Commits
feat/p3-1-
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| fc90c89913 | |||
| af2eede7a9 |
109
docs/fleet/PRD.md
Normal file
109
docs/fleet/PRD.md
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
# PRD — Fleet Phase 2: Operator Observability
|
||||||
|
|
||||||
|
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
|
||||||
|
> **North star:** [docs/fleet/north-star.md](./north-star.md)
|
||||||
|
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
|
||||||
|
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
|
||||||
|
(which protects the operator's default tmux) makes the fleet **invisible** to default
|
||||||
|
tooling, and truth is split across three planes no single command joins — systemd
|
||||||
|
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
|
||||||
|
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
|
||||||
|
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
|
||||||
|
observability and no safe way to watch a session.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
1. One command shows the **whole fleet's** real state, joining all three planes.
|
||||||
|
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
|
||||||
|
3. The operator can **watch** any session read-only without disrupting it.
|
||||||
|
4. `send` reports **delivered-and-accepted**, not just injected.
|
||||||
|
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
|
||||||
|
|
||||||
|
## Non-goals (this phase)
|
||||||
|
|
||||||
|
- No webUI (Phase 5; rides federation for cross-host).
|
||||||
|
- No `fleetd` daemon or persistent history store.
|
||||||
|
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
|
||||||
|
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
|
||||||
|
|
||||||
|
## Functional requirements
|
||||||
|
|
||||||
|
| ID | Requirement |
|
||||||
|
| ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
|
||||||
|
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
|
||||||
|
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
|
||||||
|
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
|
||||||
|
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
|
||||||
|
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
|
||||||
|
|
||||||
|
## Heartbeat protocol v1
|
||||||
|
|
||||||
|
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
|
||||||
|
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
|
||||||
|
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
|
||||||
|
on a fixed interval (default 15s) and on demand when probed.
|
||||||
|
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
|
||||||
|
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
|
||||||
|
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
|
||||||
|
full-screen TUIs alike (no `capture-pane` dependency).
|
||||||
|
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
|
||||||
|
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
|
||||||
|
|
||||||
|
## Acceptance criteria
|
||||||
|
|
||||||
|
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
|
||||||
|
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
|
||||||
|
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
|
||||||
|
- Killing one agent's pane flips its row to dead/stale within one `interval`.
|
||||||
|
- `agent watch` shows live output and provably cannot type into the pane; detaching
|
||||||
|
leaves the agent's window size unchanged.
|
||||||
|
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
|
||||||
|
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
|
||||||
|
`pnpm --filter @mosaicstack/mosaic test`.
|
||||||
|
- Independent review passed; dogfood evidence captured against the live fleet.
|
||||||
|
|
||||||
|
## Test plan
|
||||||
|
|
||||||
|
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
|
||||||
|
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
|
||||||
|
exact tmux/systemd command construction and JSON shape (tenant+host present).
|
||||||
|
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
|
||||||
|
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
|
||||||
|
|
||||||
|
## Known limitations
|
||||||
|
|
||||||
|
- **Verify heuristic is best-effort:** `agent send --verify` uses a `>` -prefix draft
|
||||||
|
heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode
|
||||||
|
TUIs is best-effort only; those runtimes may not use the same input-line indicator.
|
||||||
|
- **Pane-change check is the best Phase-2 signal; verify now polls up to a bounded
|
||||||
|
timeout:** `agent send --verify` captures a BEFORE snapshot, sends the message, then
|
||||||
|
polls `capture-pane` every ~400 ms up to a configurable total timeout (default ~6 s,
|
||||||
|
controlled by `--verify-timeout <ms>`). On each poll it runs classifySendResult: if
|
||||||
|
the pane shows 'accepted' or 'draft' the loop exits immediately; while the result is
|
||||||
|
'unverifiable' (no pane change yet) it keeps polling. After the timeout with no
|
||||||
|
definitive result, it fails closed: exit 1 with "no pane change after send". This
|
||||||
|
eliminates false 'unverifiable' failures for slow/loaded TUIs that were previously
|
||||||
|
caused by the old fixed 300 ms single-capture. Definitive acceptance ultimately
|
||||||
|
requires a runtime acknowledgement (Phase-3 heartbeat-ack); the bounded pane-change
|
||||||
|
poll is the best signal available against an opaque TUI for Phase-2.
|
||||||
|
- **Blank AFTER capture fails closed:** Full-screen TUIs (claude, codex, opencode, pi)
|
||||||
|
render blank for `tmux capture-pane`. When the AFTER snapshot is empty, `send --verify`
|
||||||
|
returns non-zero with an "unverifiable" message rather than silently succeeding. This
|
||||||
|
is an intentional fail-closed design (FR-5).
|
||||||
|
- **`agent watch` uses a grouped viewer session:** `tmux attach -r` directly against the
|
||||||
|
agent session lets the viewer terminal shrink the agent's window. `agent watch` instead
|
||||||
|
creates a throwaway grouped session (`tmux new-session -d -t '=<agent>' -s
|
||||||
|
'<agent>-watch-<pid>'`), attaches read-only to that session, and kills it on detach.
|
||||||
|
The grouped session shares the agent's windows but has independent sizing, so the
|
||||||
|
agent's window is never affected. `tmux attach` is still interactive and requires
|
||||||
|
inherited stdio; the `interactiveRunner` handles TTY passthrough.
|
||||||
|
|
||||||
|
## Surfaces & parity (MVP-X1)
|
||||||
|
|
||||||
|
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
|
||||||
|
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.
|
||||||
27
docs/fleet/TASKS.md
Normal file
27
docs/fleet/TASKS.md
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
# Tasks — W-FLEET (Fleet) Phase 2: Observability
|
||||||
|
|
||||||
|
> Workstream task file for the Fleet. Single-writer: Fleet workstream lead (orchestrator).
|
||||||
|
> Workers read but never modify. This is **not** the MVP rollup (`docs/TASKS.md`) — a
|
||||||
|
> rollup row is proposed to the MVP orchestrator, not written here.
|
||||||
|
>
|
||||||
|
> Mission: `mvp-20260312` · PRD: [docs/fleet/PRD.md](./PRD.md) · North star: [docs/fleet/north-star.md](./north-star.md)
|
||||||
|
> Status: `not-started` | `in-progress` | `done` | `blocked` | `failed`
|
||||||
|
|
||||||
|
| id | status | description | depends_on | agent | pr | notes |
|
||||||
|
| ------------- | ----------- | ------------------------------------------------------------------------------------------------------------------ | --------------------- | ----------- | --- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| FLEET-OBS-000 | done | Plan: north-star + Phase-2 PRD + workstream scaffolding | — | lead | — | persisted 2026-06-20 on `feat/fleet-observability` |
|
||||||
|
| FLEET-OBS-001 | done | Heartbeat protocol v1 spec finalized in PRD + framework doc | FLEET-OBS-000 | lead | — | file-based `~/.config/mosaic/fleet/run/<agent>.hb`; spec in PRD |
|
||||||
|
| FLEET-OBS-002 | in-progress | Implement heartbeat responder in `dogfood-agent.py` | FLEET-OBS-001 | fleet-coder | — | dispatched to ad-hoc `mosaic yolo` fleet agent (dogfood) |
|
||||||
|
| FLEET-OBS-003 | done | `mosaic fleet ps` — join systemd+tmux+proc+idle+heartbeat; tenant+host tagged; drift + boot-enable flags; `--json` | FLEET-OBS-001 | worker | — | commit ab47831; LIVE-verified on mosaic-factory; caught canary-pi DRIFT + BOOT-ENABLE. Polish: idleSeconds parse returns null |
|
||||||
|
| FLEET-OBS-004 | done | `mosaic agent watch <name>` — read-only join (no resize, no keystrokes) | FLEET-OBS-000 | worker | — | `attach -r`; verb wired |
|
||||||
|
| FLEET-OBS-005 | done | `mosaic agent send --verify` — delivery/acceptance receipt | FLEET-OBS-000 | worker | — | --verify flag; draft-heuristic verify |
|
||||||
|
| FLEET-OBS-006 | done | CLI specs for ps/watch/send-verify (tenant+host shape, command construction) | FLEET-OBS-003,004,005 | worker | — | 62 tests green (31 new); re-verified by lead |
|
||||||
|
| FLEET-OBS-007 | not-started | Framework doc: fleet observability guide + verbs | FLEET-OBS-003,004,005 | lead | — | `docs/guides/` or `framework/tools/.../README` |
|
||||||
|
| FLEET-OBS-008 | not-started | Independent review + dogfood verification on live fleet | FLEET-OBS-002..007 | reviewer | — | author ≠ reviewer; capture evidence in scratchpad |
|
||||||
|
| FLEET-OBS-009 | not-started | Open PR → green CI (queue guard) → squash-merge → close `fleet-observability-1` | FLEET-OBS-008 | lead | — | trunk merge; no direct push to main |
|
||||||
|
|
||||||
|
## Proposed MVP rollup row (for the MVP orchestrator — not written by this workstream)
|
||||||
|
|
||||||
|
```
|
||||||
|
| W-FLEET | in-progress | Fleet (agent-session execution layer) | Phase 2/5 | docs/fleet/TASKS.md | observability dogfooded on live stub fleet; control plane rides federation (W1) |
|
||||||
|
```
|
||||||
133
docs/fleet/north-star.md
Normal file
133
docs/fleet/north-star.md
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
# Mosaic Fleet — North Star
|
||||||
|
|
||||||
|
> **Workstream:** W-FLEET (Fleet) under mission `mvp-20260312`
|
||||||
|
> **Umbrella:** [docs/MISSION-MANIFEST.md](../MISSION-MANIFEST.md) · [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
|
||||||
|
> **Status:** doctrine — authored 2026-06-20. Owner of this file: Fleet workstream lead.
|
||||||
|
> This document does **not** modify the MVP rollup; a rollup row is proposed, not written here.
|
||||||
|
|
||||||
|
## Vision
|
||||||
|
|
||||||
|
A **customizable, multi-tenant fleet of always-on AI agents** — each defined by role,
|
||||||
|
materialized as a durable, joinable runtime session, coordinated by the proven
|
||||||
|
orchestrator/worker model, and observable end-to-end across hosts. Coding today;
|
||||||
|
finance, analytics, research as roster entries tomorrow — same primitives, different
|
||||||
|
roster. The fleet is the **agent-session execution layer** of the Mosaic Stack MVP:
|
||||||
|
the thing federation makes reachable across hosts and the webUI/TUI/CLI make visible.
|
||||||
|
|
||||||
|
The USC tmux PoC (durable sessions + `agent-send` comms) proved the model. This
|
||||||
|
workstream makes it an official, observable, multi-tenant Mosaic Stack capability.
|
||||||
|
|
||||||
|
## The Fleet as means of production (bootstrapping)
|
||||||
|
|
||||||
|
The Fleet has a **dual role**, and that is the point:
|
||||||
|
|
||||||
|
- **As product** — a multi-tenant agent-fleet capability of Mosaic Stack (this workstream).
|
||||||
|
- **As means of production** — the orchestrator/worker fleet that _actually builds the
|
||||||
|
entire MVP_ (federation W1, webUI, TUI, CLI, and the Fleet itself).
|
||||||
|
|
||||||
|
We are **building the system that builds the system.** Every other MVP workstream is
|
||||||
|
delivered _by_ the fleet, so fleet observability and control are not merely product
|
||||||
|
features — they are the **operational floor of the whole delivery effort**. If we cannot
|
||||||
|
see and steer the agents, we cannot trust what they ship. This is why Phase 2
|
||||||
|
(observability) leads: it is the instrument panel for the factory, dogfooded on the live
|
||||||
|
fleet that is, recursively, building Mosaic Stack.
|
||||||
|
|
||||||
|
The discipline that makes great power safe is the same gate chain the fleet enforces:
|
||||||
|
independent review before merge, green CI, honest completion, decide-and-inform cadence,
|
||||||
|
and no irreversible action without authority. The bootstrap is only as trustworthy as
|
||||||
|
those gates.
|
||||||
|
|
||||||
|
## Alignment with MVP cross-cutting requirements
|
||||||
|
|
||||||
|
The Fleet inherits — does not re-invent — the MVP's hard requirements:
|
||||||
|
|
||||||
|
| MVP req | What it means for the Fleet |
|
||||||
|
| ----------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| MVP-X1 three-surface parity | fleet observability/control reachable via **CLI + TUI + webUI** (CLI first; webUI is required for parity, not optional) |
|
||||||
|
| MVP-X2 multi-tenant isolation | one tenant = one **Linux uid** (own `systemd --user`, socket, `~/.config/mosaic`); no cross-tenant leakage |
|
||||||
|
| MVP-X3 auth (BetterAuth/SSO) | operator→fleet and cross-host views are auth-gated through the platform's existing auth |
|
||||||
|
| MVP-X4 quality gates | `pnpm typecheck`/`lint`/`format:check` green before any push |
|
||||||
|
| MVP-X5 federated topology | cross-host fleet visibility rides the **federation** boundary (W1), not a bespoke broker |
|
||||||
|
| MVP-X6 OTEL tracing | heartbeats, sends, and lifecycle events emit spans; `traceparent` crosses the federation boundary |
|
||||||
|
| MVP-X7 trunk merge | branch from `main`, squash-merge via PR, never push to `main` |
|
||||||
|
|
||||||
|
## The stack — where every concern lives
|
||||||
|
|
||||||
|
One **definition** is the source of truth; the **session** is how it runs.
|
||||||
|
|
||||||
|
| Layer | Owner | Phase-2 reality | Destination |
|
||||||
|
| -------------------------------- | ------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------- |
|
||||||
|
| **Definition + identity + auth** | gateway / `mosaic-as` (scoped tokens, #541) | `roster.yaml` (tenant-tagged) | one definition; `mosaic agent --new` materializes it |
|
||||||
|
| **Tenancy boundary** | **Linux uid per tenant** (linger, own `systemd --user`, own socket, own `~/.config/mosaic`) | one tenant: `jarvis` = tenant zero | uid-per-tenant; federation aggregates across hosts |
|
||||||
|
| **Runtime** | per-tenant tmux session on isolated socket | dogfood stub sessions (live now on `mosaic-factory`) | claude/codex/pi/opencode TUIs |
|
||||||
|
| **Liveness** | **heartbeat protocol** every runtime answers | protocol defined + dogfood stub answers it | all runtimes answer; "healthy" ≠ "pane alive" |
|
||||||
|
| **Observation** | read-only `watch` (native tmux) + `pipe-pane` stream | CLI `watch`/`ps`; explicit opt-in `attach` for control | + auth-gated webUI streams |
|
||||||
|
| **Control plane** | **federation** across hosts × tenants | records already carry `tenant_id` + `host` | federated gateways expose fleet state; webUI in Phase 5 |
|
||||||
|
|
||||||
|
## Operating model (inherited, not reinvented)
|
||||||
|
|
||||||
|
The AI-guide law stands: one accountable **orchestrator**, isolated **workers** that
|
||||||
|
stop at PR-open, the serialized **gate chain** (independent review → green CI →
|
||||||
|
diff-sanity → squash-merge → verify), **decide-and-inform** cadence, and a durable
|
||||||
|
**board** so missions survive session death. The Fleet is the infrastructure _under_
|
||||||
|
this model. See `mosaicstack-aiguide` whitepapers 01 (inter-agent comms) and 03
|
||||||
|
(orchestration model) for the rationale.
|
||||||
|
|
||||||
|
## Invariants — "maximal vision, incremental delivery, zero foreclosure"
|
||||||
|
|
||||||
|
Every artifact, starting Phase 2, MUST:
|
||||||
|
|
||||||
|
1. Carry **`tenant_id` + `host`** in schema and message addressing — even with one of each today.
|
||||||
|
2. Treat **isolation socket ≠ invisibility** — anything isolated is surfaced by one command.
|
||||||
|
3. Define **healthy = answered a heartbeat within N seconds**, never just "pane alive".
|
||||||
|
4. Make **observation read-only by default**; control is an explicit, separate, opt-in verb.
|
||||||
|
|
||||||
|
## Observation model
|
||||||
|
|
||||||
|
| Verb | Behavior |
|
||||||
|
| ----------------------------------- | -------------------------------------------------------------------------------------------------- |
|
||||||
|
| `mosaic fleet ps` | one table joining systemd + tmux + process + idle + last-heartbeat, with drift + boot-enable flags |
|
||||||
|
| `mosaic agent watch <name>` | **read-only** join (grouped session / `-r`), no resize tyranny, no keystrokes |
|
||||||
|
| `mosaic agent attach <name>` | explicit interactive takeover (the only path that can type) |
|
||||||
|
| `mosaic agent send <name> --verify` | confirms message **accepted**, not merely keystroke-injected |
|
||||||
|
|
||||||
|
> Why the current PoC blocks observation: sessions live on the isolated `mosaic-factory`
|
||||||
|
> socket (invisible to default `tmux ls`), the only sanctioned read is `capture-pane`
|
||||||
|
> (blank for full-screen TUIs), and `attach` is read-write + resizes the session. The
|
||||||
|
> verbs above restore "join and observe" safely.
|
||||||
|
|
||||||
|
## Phased roadmap
|
||||||
|
|
||||||
|
| Phase | Outcome | Status |
|
||||||
|
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
||||||
|
| 0–1 | tmux PoC, hardening, published CLI v0.0.34 (#565–#568) | ✅ done |
|
||||||
|
| **2 — Observability** | `fleet ps` (host+tenant aware join), heartbeat protocol + dogfood stub answers it, `agent watch` (read-only), `agent send --verify` receipts | ▶ now |
|
||||||
|
| 3 — Real runtimes | claude/codex/pi/opencode answer heartbeat; **hybrid lifecycle** (core always-on: orchestrator+reviewer; ephemeral workers per lane) | planned |
|
||||||
|
| 4 — Unified definition | one agent schema in gateway; `mosaic agent --new` → materialized per-tenant session; uid-tenant provisioning | planned |
|
||||||
|
| 5 — Control plane | federation-backed cross-host × cross-tenant fleet view; **webUI** (surface chosen then) for MVP-X1 parity | planned |
|
||||||
|
|
||||||
|
## Decisions of record (2026-06-20, with Jason)
|
||||||
|
|
||||||
|
- Agent model: **config defines, session runs** (gateway = definition/identity/auth; tmux = runtime).
|
||||||
|
- Tenancy: **multi-tenant from the start**; isolation = **per-tenant Linux uid**.
|
||||||
|
- Health: **heartbeat required** (dogfood stub implements the protocol now).
|
||||||
|
- Lifecycle: **hybrid** — core always-on + ephemeral workers per lane.
|
||||||
|
- Observation: **read-only default, opt-in takeover**.
|
||||||
|
- Multi-host: **designed-for from day one**; control plane **rides federation (W1)**.
|
||||||
|
- Delivery: **CLI-first now**, dogfood against the live stub fleet; webUI deferred to Phase 5.
|
||||||
|
- Runtimes: fleet agents default to **Codex / pi-on-Codex**; **Claude is reserved for Claude
|
||||||
|
Code only** (avoid alternate-harness API pricing). Validated durable recipe:
|
||||||
|
`mosaic yolo pi --model openai-codex/gpt-5.5:high`. Durable detached launch requires the
|
||||||
|
runtime-bin on PATH (baked into the pane command) + boot-survival (`enable` + linger),
|
||||||
|
which `fleet init` should automate.
|
||||||
|
|
||||||
|
## Assumptions (veto-able)
|
||||||
|
|
||||||
|
- `ASSUMPTION:` first-class runtimes = claude, codex, pi, opencode; a "role" (analyst,
|
||||||
|
finance, researcher) = persona + skills + tools on top of a runtime, shipped as a
|
||||||
|
starter role library in the framework.
|
||||||
|
- `ASSUMPTION:` the cross-host control plane is the **federation** layer (W1), not a
|
||||||
|
separate `fleetd` daemon.
|
||||||
|
- `ASSUMPTION:` Fleet is workstream **W-FLEET** under `mvp-20260312`; a rollup row in
|
||||||
|
`docs/TASKS.md` and a workstream declaration in `MISSION-MANIFEST.md` are proposed to
|
||||||
|
the MVP orchestrator, not written by this workstream.
|
||||||
100
docs/scratchpads/fleet-observability-phase2.md
Normal file
100
docs/scratchpads/fleet-observability-phase2.md
Normal file
@@ -0,0 +1,100 @@
|
|||||||
|
# Scratchpad — Fleet Phase 2: Observability (W-FLEET)
|
||||||
|
|
||||||
|
> Append-only. Mission `mvp-20260312` / workstream W-FLEET.
|
||||||
|
> Lead: Jarvis (Claude) at `W-jarvis:mos-claude-18`. Coordinating with `jwoltje@dragon-lin:coder0-0`.
|
||||||
|
|
||||||
|
## Mission prompt (2026-06-20)
|
||||||
|
|
||||||
|
Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability
|
||||||
|
for delivery. The USC tmux PoC is the proven base. Jason granted lead authority:
|
||||||
|
"The fleet is a great way to actually build the MVP — we are building the system that
|
||||||
|
builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate
|
||||||
|
with a second agent on `dragon-lin`.
|
||||||
|
|
||||||
|
## Decisions of record (with Jason, 2026-06-20)
|
||||||
|
|
||||||
|
- Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
|
||||||
|
- Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
|
||||||
|
- Health: heartbeat required; dogfood stub implements protocol now.
|
||||||
|
- Lifecycle: hybrid (core always-on + ephemeral workers).
|
||||||
|
- Observation: read-only default, opt-in takeover.
|
||||||
|
- Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
|
||||||
|
- Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
|
||||||
|
- Fleet is dual-role: product AND means of production (bootstrapping the MVP).
|
||||||
|
- Code review = **dual-engine**: Claude **and** gpt-5.5/Codex, run together (Jason: the
|
||||||
|
combination produces the best results). Launch reviewers via `mosaic yolo pi` / `codex`
|
||||||
|
(proven path) or `~/.config/mosaic/tools/codex/codex-code-review.sh`. Applies to all
|
||||||
|
code-review gates incl. FLEET-OBS-008. Per Jason 2026-06-20.
|
||||||
|
- Worktree discipline: do fleet work in `~/src/mosaicstack-stack-worktrees/<branch>`, NOT
|
||||||
|
the shared main checkout — concurrent processes mutate `main` there (learned 2026-06-20).
|
||||||
|
|
||||||
|
## Environment facts (verified 2026-06-20)
|
||||||
|
|
||||||
|
- Fleet is live on `W-jarvis` (uid 1000, `jarvis`, `Linger=yes`) on tmux socket
|
||||||
|
`mosaic-factory`: `_holder`, `canary-pi`, `dogfood-coder`, `dogfood-orchestrator`,
|
||||||
|
`dogfood-reviewer`. All panes run `~/.config/mosaic/fleet/dogfood-agent.py` (stub),
|
||||||
|
including `canary-pi` (roster says runtime=pi → **drift**).
|
||||||
|
- Holder + `mosaic-agent@*` units are `active (exited)` but `UnitFileState=disabled`
|
||||||
|
(reboot loses fleet → boot-enable gap to surface).
|
||||||
|
- Observation blocked by: isolated socket (hidden from default `tmux ls`), `capture-pane`
|
||||||
|
blank for TUIs, `attach` being read-write + resizing.
|
||||||
|
- Second agent: `jwoltje@dragon-lin`, session `coder0-0` (group `coder0`), running `node`,
|
||||||
|
default socket. ssh forward reach confirmed.
|
||||||
|
|
||||||
|
## Governance / collision-safety
|
||||||
|
|
||||||
|
- `mosaicstack-stack` has active mission `mvp-20260312` with single-writer locks on
|
||||||
|
`docs/MISSION-MANIFEST.md`, `docs/TASKS.md`, `docs/scratchpads/mvp-20260312.md`.
|
||||||
|
- This workstream touches NONE of those. All Fleet docs scoped under `docs/fleet/` +
|
||||||
|
this scratchpad. Rollup row proposed, not written.
|
||||||
|
|
||||||
|
## Session log
|
||||||
|
|
||||||
|
- 2026-06-20: Researched AI guide + fleet code + live state. Established north star with
|
||||||
|
Jason (8 forks decided). Branched `feat/fleet-observability`. Persisted
|
||||||
|
`docs/fleet/{north-star.md,PRD.md,TASKS.md}` + this scratchpad. Next: establish comms
|
||||||
|
with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + `fleet ps`).
|
||||||
|
- 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): `fleet ps`,
|
||||||
|
`agent watch`, `agent send --verify`, 62 tests. LIVE-verified `fleet ps` on
|
||||||
|
mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON.
|
||||||
|
Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — `fleet ps` HB now
|
||||||
|
`healthy` for all 4 agents.
|
||||||
|
- Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572
|
||||||
|
(sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine
|
||||||
|
blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
|
||||||
|
- **FINDINGS (north-star / Phase-3 blockers):**
|
||||||
|
1. Ad-hoc `mosaic yolo {codex,pi}` via `start-agent-session.sh` DIE immediately in a
|
||||||
|
detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub
|
||||||
|
survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY
|
||||||
|
in the detached shell) must be fixed before Phase-3 real-runtime swap. `fleet ps`
|
||||||
|
caught both dead panes instantly (tool validated).
|
||||||
|
2. `MOSAIC_AGENT_NAME` (set in systemd EnvironmentFile) is NOT propagated into tmux's
|
||||||
|
global env, so agents defaulted to `unknown`. Worked around in dogfood-agent.py via
|
||||||
|
tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
|
||||||
|
- Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close
|
||||||
|
`fleet-observability-1`. Defer launch-path + env-propagation fixes to Phase 3.
|
||||||
|
- 2026-06-21 (session 3): Phase-2 PR #579 merged (3 dual-engine rounds hardened
|
||||||
|
verify+watch). Then closed the launch-path question with Jason's input — CORRECTING
|
||||||
|
earlier findings:
|
||||||
|
- The ad-hoc launch deaths were NOT a fundamental TTY blocker: (a) codex was a stale
|
||||||
|
version (Jason updated it); (b) pi was misconfigured to Claude auth (Jason removed it;
|
||||||
|
default is now Codex). The REAL durable-launch bug is **PATH**: the detached tmux
|
||||||
|
launch shell is login+non-interactive, so it misses `~/.npm-global/bin` (added only in
|
||||||
|
`~/.bashrc`) -> `mosaic: command not found` (127) -> pane dies. tmux panes inherit the
|
||||||
|
tmux _server_ env, so PATH must be baked into the pane command.
|
||||||
|
- **Durable real-agent recipe (validated live on gpt-5.5, Claude-free):**
|
||||||
|
`mosaic yolo pi --model openai-codex/gpt-5.5:high` — pi tolerates detached tmux; a raw
|
||||||
|
interactive TUI (codex CLI) exits without an attached client. Status line confirmed
|
||||||
|
`(openai-codex) gpt-5.5 • high`.
|
||||||
|
- PATH fix landed in `start-agent-session.sh` (commit 32efc13, branch
|
||||||
|
feat/fleet-launch-path): derive runtime-bin prefix (MOSAIC_RUNTIME_BIN | npm prefix |
|
||||||
|
~/.npm-global/bin | ~/.local/bin), bake `export PATH=...; exec <cmd>` into the pane;
|
||||||
|
`exec` also fixes the drift false-positive. Live-tested under stripped PATH -> durable.
|
||||||
|
- Boot-survival: Jason ran `systemctl --user enable` (+ linger). TODO: auto-enable in
|
||||||
|
**fleet init** so operators never have to remember it (agentic-enhancement cycle).
|
||||||
|
- Future custom Pi harness build: pi cannot self-report its model (track
|
||||||
|
runtime/model/effort as fleet metadata); drift detection should recognize `node` as
|
||||||
|
pi's pane command (a node-wrapped pane can currently read as drift).
|
||||||
|
- Findings recorded in AI Guide playbooks/tmux-fleet.md (aiguide PR #7, merged).
|
||||||
|
- Policy: avoid Claude outside Claude Code (API pricing for alt-harness use) — fleet
|
||||||
|
runtimes default to Codex / pi-on-Codex; Claude stays in Claude Code only.
|
||||||
@@ -26,5 +26,75 @@ if [ -z "$MOSAIC_AGENT_COMMAND" ]; then
|
|||||||
MOSAIC_AGENT_COMMAND="mosaic yolo $MOSAIC_AGENT_RUNTIME"
|
MOSAIC_AGENT_COMMAND="mosaic yolo $MOSAIC_AGENT_RUNTIME"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# ── Derive a runtime-bin PATH prefix ─────────────────────────────────────────
|
||||||
|
# Precedence:
|
||||||
|
# 1. $MOSAIC_RUNTIME_BIN (explicit override)
|
||||||
|
# 2. $(npm config get prefix)/bin (if npm is on PATH)
|
||||||
|
# 3. Fallbacks: $HOME/.npm-global/bin and $HOME/.local/bin
|
||||||
|
#
|
||||||
|
# Only directories that already exist are included. The prefix is baked into
|
||||||
|
# the pane command regardless of what the LAUNCHER process's $PATH contains,
|
||||||
|
# because the tmux pane inherits the tmux SERVER environment (not this script's
|
||||||
|
# environment). A dir on the launcher's PATH may be absent from the server PATH,
|
||||||
|
# so every existing candidate must always be included. Dedup within the
|
||||||
|
# constructed prefix avoids listing the same dir twice.
|
||||||
|
_build_runtime_bin_prefix() {
|
||||||
|
local candidates=()
|
||||||
|
|
||||||
|
if [ -n "${MOSAIC_RUNTIME_BIN:-}" ]; then
|
||||||
|
candidates+=("$MOSAIC_RUNTIME_BIN")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v npm >/dev/null 2>&1; then
|
||||||
|
local npm_prefix
|
||||||
|
npm_prefix=$(npm config get prefix 2>/dev/null) || true
|
||||||
|
if [ -n "$npm_prefix" ]; then
|
||||||
|
candidates+=("${npm_prefix}/bin")
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
candidates+=("$HOME/.npm-global/bin")
|
||||||
|
candidates+=("$HOME/.local/bin")
|
||||||
|
|
||||||
|
local prefix=""
|
||||||
|
for dir in "${candidates[@]}"; do
|
||||||
|
[ -d "$dir" ] || continue
|
||||||
|
if [ -z "$prefix" ]; then
|
||||||
|
prefix="$dir"
|
||||||
|
else
|
||||||
|
case ":${prefix}:" in
|
||||||
|
*":${dir}:"*) ;; # already in our prefix — skip
|
||||||
|
*) prefix="${prefix}:${dir}" ;;
|
||||||
|
esac
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '%s' "$prefix"
|
||||||
|
}
|
||||||
|
|
||||||
|
MOSAIC_RUNTIME_BIN_PREFIX=$(_build_runtime_bin_prefix)
|
||||||
|
|
||||||
|
# ── Build the pane command ────────────────────────────────────────────────────
|
||||||
|
# The pane command must:
|
||||||
|
# - Export the augmented PATH so the runtime binary is found.
|
||||||
|
# - exec the agent command so the runtime is the pane's foreground process
|
||||||
|
# (makes `fleet ps` pane_current_command check reliable; no DRIFT false-positive).
|
||||||
|
#
|
||||||
|
# Quoting strategy: single-quote the inner shell snippet so that variable
|
||||||
|
# references in MOSAIC_AGENT_COMMAND are NOT expanded here — they expand inside
|
||||||
|
# the pane shell. However, MOSAIC_RUNTIME_BIN_PREFIX and PATH must be expanded
|
||||||
|
# NOW (in this script) because the pane shell inherits the tmux server
|
||||||
|
# environment, not this script's env.
|
||||||
|
#
|
||||||
|
# We build the snippet as a double-quoted here-string embedded in a printf call
|
||||||
|
# to avoid nested quoting problems.
|
||||||
|
|
||||||
|
if [ -n "$MOSAIC_RUNTIME_BIN_PREFIX" ]; then
|
||||||
|
PANE_SHELL_SNIPPET="export PATH=\"${MOSAIC_RUNTIME_BIN_PREFIX}:\${PATH}\"; exec ${MOSAIC_AGENT_COMMAND}"
|
||||||
|
else
|
||||||
|
PANE_SHELL_SNIPPET="exec ${MOSAIC_AGENT_COMMAND}"
|
||||||
|
fi
|
||||||
|
|
||||||
mkdir -p "$MOSAIC_AGENT_WORKDIR"
|
mkdir -p "$MOSAIC_AGENT_WORKDIR"
|
||||||
exec tmux -L "$MOSAIC_TMUX_SOCKET" new-session -d -s "$AGENT_NAME" -c "$MOSAIC_AGENT_WORKDIR" "$MOSAIC_AGENT_COMMAND"
|
exec tmux -L "$MOSAIC_TMUX_SOCKET" new-session -d -s "$AGENT_NAME" -c "$MOSAIC_AGENT_WORKDIR" \
|
||||||
|
bash -c "$PANE_SHELL_SNIPPET"
|
||||||
|
|||||||
@@ -6,13 +6,26 @@ START="$SCRIPT_DIR/start-agent-session.sh"
|
|||||||
SOCKET="mosaic-agent-test-$RANDOM-$$"
|
SOCKET="mosaic-agent-test-$RANDOM-$$"
|
||||||
AGENT="agent-$RANDOM"
|
AGENT="agent-$RANDOM"
|
||||||
WORKDIR=$(mktemp -d)
|
WORKDIR=$(mktemp -d)
|
||||||
trap 'tmux -L "$SOCKET" kill-server >/dev/null 2>&1 || true; rm -rf "$WORKDIR"' EXIT
|
|
||||||
|
# Keep a single cleanup trap that accumulates resources.
|
||||||
|
CLEANUP_DIRS=("$WORKDIR")
|
||||||
|
CLEANUP_SOCKETS=("$SOCKET")
|
||||||
|
trap '_cleanup' EXIT
|
||||||
|
_cleanup() {
|
||||||
|
for s in "${CLEANUP_SOCKETS[@]:-}"; do
|
||||||
|
tmux -L "$s" kill-server >/dev/null 2>&1 || true
|
||||||
|
done
|
||||||
|
for d in "${CLEANUP_DIRS[@]:-}"; do
|
||||||
|
rm -rf "$d"
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
fail() {
|
fail() {
|
||||||
echo "FAIL: $*" >&2
|
echo "FAIL: $*" >&2
|
||||||
exit 1
|
exit 1
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# ── Test 1: basic session creation with workdir check ─────────────────────────
|
||||||
MOSAIC_TMUX_SOCKET="$SOCKET" \
|
MOSAIC_TMUX_SOCKET="$SOCKET" \
|
||||||
MOSAIC_AGENT_WORKDIR="$WORKDIR" \
|
MOSAIC_AGENT_WORKDIR="$WORKDIR" \
|
||||||
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
|
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
|
||||||
@@ -22,6 +35,7 @@ tmux -L "$SOCKET" has-session -t "=$AGENT:0.0" || fail "agent session was not cr
|
|||||||
actual_dir=$(tmux -L "$SOCKET" display-message -p -t "=$AGENT:0.0" '#{pane_current_path}')
|
actual_dir=$(tmux -L "$SOCKET" display-message -p -t "=$AGENT:0.0" '#{pane_current_path}')
|
||||||
[ "$actual_dir" = "$WORKDIR" ] || fail "agent workdir mismatch: $actual_dir"
|
[ "$actual_dir" = "$WORKDIR" ] || fail "agent workdir mismatch: $actual_dir"
|
||||||
|
|
||||||
|
# ── Test 2: idempotency (duplicate start prints 'already running') ─────────────
|
||||||
MOSAIC_TMUX_SOCKET="$SOCKET" \
|
MOSAIC_TMUX_SOCKET="$SOCKET" \
|
||||||
MOSAIC_AGENT_WORKDIR="$WORKDIR" \
|
MOSAIC_AGENT_WORKDIR="$WORKDIR" \
|
||||||
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
|
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
|
||||||
@@ -29,4 +43,166 @@ MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
|
|||||||
|
|
||||||
grep -qF 'already running' /tmp/mosaic-start-agent-idempotent.out || fail "duplicate start was not idempotent"
|
grep -qF 'already running' /tmp/mosaic-start-agent-idempotent.out || fail "duplicate start was not idempotent"
|
||||||
|
|
||||||
|
# ── Test 3: runtime-bin PATH prefix is baked into the pane command ────────────
|
||||||
|
#
|
||||||
|
# We capture the command the script would hand to tmux by injecting a fake
|
||||||
|
# 'tmux' shim into PATH. The shim:
|
||||||
|
# - Intercepts 'new-session' calls and records its arguments to a file.
|
||||||
|
# - For 'has-session' calls, exits 1 (session does not exist) so the script
|
||||||
|
# proceeds to launch instead of printing "already running".
|
||||||
|
# - For all other subcommands, exits 0.
|
||||||
|
#
|
||||||
|
# Assertions:
|
||||||
|
# a) 'export PATH=' with the synthetic MOSAIC_RUNTIME_BIN prefix appears.
|
||||||
|
# b) 'exec' appears so the runtime replaces the wrapper shell.
|
||||||
|
# c) MOSAIC_AGENT_COMMAND with flags is forwarded intact.
|
||||||
|
|
||||||
|
FAKE_BIN=$(mktemp -d)
|
||||||
|
FAKE_RUNTIME_BIN=$(mktemp -d)
|
||||||
|
TMUX_ARGS_FILE=$(mktemp)
|
||||||
|
CLEANUP_DIRS+=("$FAKE_BIN" "$FAKE_RUNTIME_BIN")
|
||||||
|
|
||||||
|
# Write the fake tmux shim (uses only positional args, no sourced vars).
|
||||||
|
cat > "$FAKE_BIN/tmux" <<SHIM
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Fake tmux: record new-session args; report has-session as missing.
|
||||||
|
subcmd="\$3" # argv: tmux -L <socket> <subcmd> ...
|
||||||
|
if [ "\$subcmd" = "has-session" ]; then
|
||||||
|
exit 1 # session not found → script will attempt new-session
|
||||||
|
fi
|
||||||
|
if [ "\$subcmd" = "new-session" ]; then
|
||||||
|
printf '%s\n' "\$@" > "$TMUX_ARGS_FILE"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
exit 0
|
||||||
|
SHIM
|
||||||
|
chmod +x "$FAKE_BIN/tmux"
|
||||||
|
|
||||||
|
SOCKET3="mosaic-agent-test3-$RANDOM-$$"
|
||||||
|
AGENT3="agent3-$RANDOM"
|
||||||
|
WORKDIR3=$(mktemp -d)
|
||||||
|
CLEANUP_DIRS+=("$WORKDIR3")
|
||||||
|
|
||||||
|
PATH="$FAKE_BIN:$PATH" \
|
||||||
|
MOSAIC_TMUX_SOCKET="$SOCKET3" \
|
||||||
|
MOSAIC_AGENT_WORKDIR="$WORKDIR3" \
|
||||||
|
MOSAIC_AGENT_RUNTIME="pi" \
|
||||||
|
MOSAIC_RUNTIME_BIN="$FAKE_RUNTIME_BIN" \
|
||||||
|
MOSAIC_AGENT_COMMAND="mosaic yolo pi --model openai-codex/gpt-5.5:high" \
|
||||||
|
"$START" "$AGENT3"
|
||||||
|
|
||||||
|
all_args=$(cat "$TMUX_ARGS_FILE" 2>/dev/null || true)
|
||||||
|
rm -f "$TMUX_ARGS_FILE"
|
||||||
|
|
||||||
|
echo "--- captured tmux new-session args ---"
|
||||||
|
echo "$all_args"
|
||||||
|
echo "--- end args ---"
|
||||||
|
|
||||||
|
# a) PATH prefix containing FAKE_RUNTIME_BIN must appear.
|
||||||
|
echo "$all_args" | grep -qF "export PATH=" || fail "pane command does not export PATH"
|
||||||
|
echo "$all_args" | grep -qF "$FAKE_RUNTIME_BIN" || fail "pane command does not include MOSAIC_RUNTIME_BIN in PATH prefix"
|
||||||
|
|
||||||
|
# b) exec must appear so the runtime replaces the wrapper shell.
|
||||||
|
echo "$all_args" | grep -qF "exec " || fail "pane command does not use exec"
|
||||||
|
|
||||||
|
# c) Full MOSAIC_AGENT_COMMAND (with flags) must be forwarded.
|
||||||
|
echo "$all_args" | grep -qF "mosaic yolo pi --model openai-codex/gpt-5.5:high" || \
|
||||||
|
fail "pane command does not forward MOSAIC_AGENT_COMMAND with flags intact"
|
||||||
|
|
||||||
|
# ── Test 4: when no extra runtime-bin dirs exist, exec still appears ───────────
|
||||||
|
TMUX_ARGS_FILE2=$(mktemp)
|
||||||
|
FAKE_BIN2=$(mktemp -d)
|
||||||
|
CLEANUP_DIRS+=("$FAKE_BIN2")
|
||||||
|
|
||||||
|
cat > "$FAKE_BIN2/tmux" <<SHIM2
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
subcmd="\$3"
|
||||||
|
if [ "\$subcmd" = "has-session" ]; then exit 1; fi
|
||||||
|
if [ "\$subcmd" = "new-session" ]; then
|
||||||
|
printf '%s\n' "\$@" > "$TMUX_ARGS_FILE2"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
exit 0
|
||||||
|
SHIM2
|
||||||
|
chmod +x "$FAKE_BIN2/tmux"
|
||||||
|
|
||||||
|
SOCKET4="mosaic-agent-test4-$RANDOM-$$"
|
||||||
|
AGENT4="agent4-$RANDOM"
|
||||||
|
WORKDIR4=$(mktemp -d)
|
||||||
|
CLEANUP_DIRS+=("$WORKDIR4")
|
||||||
|
|
||||||
|
# MOSAIC_RUNTIME_BIN points to a non-existent dir so prefix will be empty;
|
||||||
|
# .npm-global/bin and .local/bin may or may not exist but we just want exec.
|
||||||
|
PATH="$FAKE_BIN2:$PATH" \
|
||||||
|
MOSAIC_TMUX_SOCKET="$SOCKET4" \
|
||||||
|
MOSAIC_AGENT_WORKDIR="$WORKDIR4" \
|
||||||
|
MOSAIC_AGENT_RUNTIME="pi" \
|
||||||
|
MOSAIC_RUNTIME_BIN="/nonexistent-dir-$$" \
|
||||||
|
MOSAIC_AGENT_COMMAND="mosaic yolo pi" \
|
||||||
|
"$START" "$AGENT4"
|
||||||
|
|
||||||
|
all_args4=$(cat "$TMUX_ARGS_FILE2" 2>/dev/null || true)
|
||||||
|
rm -f "$TMUX_ARGS_FILE2"
|
||||||
|
rm -rf "$WORKDIR4"
|
||||||
|
|
||||||
|
echo "$all_args4" | grep -qF "exec " || fail "pane command (no prefix dirs) does not use exec"
|
||||||
|
echo "$all_args4" | grep -qF "mosaic yolo pi" || fail "pane command does not include agent command when no prefix"
|
||||||
|
|
||||||
|
# ── Test 5: candidate dir already in LAUNCHER $PATH is still baked into pane ──
|
||||||
|
#
|
||||||
|
# Regression guard for the bug where _build_runtime_bin_prefix() used to skip
|
||||||
|
# a candidate because it was already present in the launcher process's $PATH.
|
||||||
|
# That check was wrong: the pane inherits the tmux SERVER environment, not the
|
||||||
|
# launcher's env. Even if a dir is on the launcher's PATH it must always be
|
||||||
|
# baked into the pane's PATH export.
|
||||||
|
#
|
||||||
|
# We prove this by setting PATH to include FAKE_RUNTIME_BIN5 (the candidate),
|
||||||
|
# then asserting the generated new-session command still exports it.
|
||||||
|
TMUX_ARGS_FILE5=$(mktemp)
|
||||||
|
FAKE_BIN5=$(mktemp -d)
|
||||||
|
FAKE_RUNTIME_BIN5=$(mktemp -d) # this dir IS on the launcher's PATH below
|
||||||
|
CLEANUP_DIRS+=("$FAKE_BIN5" "$FAKE_RUNTIME_BIN5")
|
||||||
|
|
||||||
|
cat > "$FAKE_BIN5/tmux" <<SHIM5
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
subcmd="\$3"
|
||||||
|
if [ "\$subcmd" = "has-session" ]; then exit 1; fi
|
||||||
|
if [ "\$subcmd" = "new-session" ]; then
|
||||||
|
printf '%s\n' "\$@" > "$TMUX_ARGS_FILE5"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
exit 0
|
||||||
|
SHIM5
|
||||||
|
chmod +x "$FAKE_BIN5/tmux"
|
||||||
|
|
||||||
|
SOCKET5="mosaic-agent-test5-$RANDOM-$$"
|
||||||
|
AGENT5="agent5-$RANDOM"
|
||||||
|
WORKDIR5=$(mktemp -d)
|
||||||
|
CLEANUP_DIRS+=("$WORKDIR5")
|
||||||
|
CLEANUP_SOCKETS+=("$SOCKET5")
|
||||||
|
|
||||||
|
# FAKE_RUNTIME_BIN5 is deliberately placed on the LAUNCHER PATH so that the
|
||||||
|
# old (buggy) code would have skipped it. The correct code must still include
|
||||||
|
# it in the pane PATH export.
|
||||||
|
PATH="$FAKE_BIN5:$FAKE_RUNTIME_BIN5:$PATH" \
|
||||||
|
MOSAIC_TMUX_SOCKET="$SOCKET5" \
|
||||||
|
MOSAIC_AGENT_WORKDIR="$WORKDIR5" \
|
||||||
|
MOSAIC_AGENT_RUNTIME="pi" \
|
||||||
|
MOSAIC_RUNTIME_BIN="$FAKE_RUNTIME_BIN5" \
|
||||||
|
MOSAIC_AGENT_COMMAND="mosaic yolo pi" \
|
||||||
|
"$START" "$AGENT5"
|
||||||
|
|
||||||
|
all_args5=$(cat "$TMUX_ARGS_FILE5" 2>/dev/null || true)
|
||||||
|
rm -f "$TMUX_ARGS_FILE5"
|
||||||
|
rm -rf "$WORKDIR5"
|
||||||
|
|
||||||
|
echo "--- test 5: launcher-PATH candidate must still appear in pane export ---"
|
||||||
|
echo "$all_args5"
|
||||||
|
echo "--- end test 5 args ---"
|
||||||
|
|
||||||
|
echo "$all_args5" | grep -qF "export PATH=" || \
|
||||||
|
fail "test5: pane command does not export PATH when candidate is on launcher PATH"
|
||||||
|
echo "$all_args5" | grep -qF "$FAKE_RUNTIME_BIN5" || \
|
||||||
|
fail "test5: candidate dir (already on launcher PATH) was NOT baked into pane PATH — regression"
|
||||||
|
|
||||||
echo "ok - start-agent-session"
|
echo "ok - start-agent-session"
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -1,12 +1,19 @@
|
|||||||
import { constants } from 'node:fs';
|
import { constants } from 'node:fs';
|
||||||
import { access, chmod, copyFile, mkdir, readFile, writeFile } from 'node:fs/promises';
|
import { access, chmod, copyFile, mkdir, readFile, writeFile } from 'node:fs/promises';
|
||||||
import { homedir, hostname } from 'node:os';
|
import { homedir, hostname, userInfo } from 'node:os';
|
||||||
import { dirname, join, resolve } from 'node:path';
|
import { dirname, join, resolve } from 'node:path';
|
||||||
import { fileURLToPath } from 'node:url';
|
import { fileURLToPath } from 'node:url';
|
||||||
import { spawn } from 'node:child_process';
|
import { spawn } from 'node:child_process';
|
||||||
import type { Command } from 'commander';
|
import type { Command } from 'commander';
|
||||||
import YAML from 'yaml';
|
import YAML from 'yaml';
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A function that spawns a command with inherited stdio (TTY passthrough).
|
||||||
|
* Used for interactive commands like `tmux attach` that need a real terminal.
|
||||||
|
* Resolves with the process exit code.
|
||||||
|
*/
|
||||||
|
export type InteractiveRunner = (command: string, args: string[]) => Promise<number>;
|
||||||
|
|
||||||
export interface CommandResult {
|
export interface CommandResult {
|
||||||
stdout: string;
|
stdout: string;
|
||||||
stderr: string;
|
stderr: string;
|
||||||
@@ -15,8 +22,23 @@ export interface CommandResult {
|
|||||||
|
|
||||||
export type CommandRunner = (command: string, args: string[]) => Promise<CommandResult>;
|
export type CommandRunner = (command: string, args: string[]) => Promise<CommandResult>;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Injectable sleep helper used by the send --verify polling loop.
|
||||||
|
* Tests stub this to avoid real delays; production uses the default
|
||||||
|
* implementation backed by setTimeout.
|
||||||
|
*/
|
||||||
|
export type SleepFn = (ms: number) => Promise<void>;
|
||||||
|
|
||||||
export interface FleetCommandDeps {
|
export interface FleetCommandDeps {
|
||||||
runner?: CommandRunner;
|
runner?: CommandRunner;
|
||||||
|
/** Injectable interactive runner for commands needing inherited TTY (e.g., `tmux attach`). */
|
||||||
|
interactiveRunner?: InteractiveRunner;
|
||||||
|
/**
|
||||||
|
* Injectable sleep function for the send --verify polling loop.
|
||||||
|
* Defaults to a real setTimeout-based sleep. Tests stub this to avoid
|
||||||
|
* real delays; the default is used in production.
|
||||||
|
*/
|
||||||
|
sleepFn?: SleepFn;
|
||||||
mosaicHome?: string;
|
mosaicHome?: string;
|
||||||
frameworkRoot?: string;
|
frameworkRoot?: string;
|
||||||
}
|
}
|
||||||
@@ -92,6 +114,18 @@ type FleetServiceAction = 'start' | 'stop' | 'restart' | 'status';
|
|||||||
const DEFAULT_SOCKET_NAME = 'mosaic-factory';
|
const DEFAULT_SOCKET_NAME = 'mosaic-factory';
|
||||||
const DEFAULT_HOLDER_SESSION = '_holder';
|
const DEFAULT_HOLDER_SESSION = '_holder';
|
||||||
const DEFAULT_WORKING_DIRECTORY = '~/src';
|
const DEFAULT_WORKING_DIRECTORY = '~/src';
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Default poll interval (ms) between capture-pane checks in `send --verify`.
|
||||||
|
* Kept short enough to react quickly while not hammering tmux on busy hosts.
|
||||||
|
*/
|
||||||
|
export const VERIFY_POLL_INTERVAL_MS = 400;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Default total timeout (ms) for the `send --verify` polling loop.
|
||||||
|
* Configurable via `--verify-timeout <ms>` on `agent send`.
|
||||||
|
*/
|
||||||
|
export const VERIFY_DEFAULT_TIMEOUT_MS = 6_000;
|
||||||
const DEFAULT_RUNTIME_RESETS: Record<string, { resetCommand: string }> = {
|
const DEFAULT_RUNTIME_RESETS: Record<string, { resetCommand: string }> = {
|
||||||
claude: { resetCommand: '/clear' },
|
claude: { resetCommand: '/clear' },
|
||||||
codex: { resetCommand: '/clear' },
|
codex: { resetCommand: '/clear' },
|
||||||
@@ -236,6 +270,401 @@ export function buildAgentTailCommand(
|
|||||||
];
|
];
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// Fleet ps — phase 2 observability helpers
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
export const HEARTBEAT_INTERVAL_MS = 15_000;
|
||||||
|
export const HEARTBEAT_HEALTHY_MULTIPLIER = 3;
|
||||||
|
|
||||||
|
export interface HeartbeatInfo {
|
||||||
|
ts: Date | null;
|
||||||
|
pid: number | null;
|
||||||
|
status: 'ok' | 'busy' | null;
|
||||||
|
/** healthy | stale | unknown */
|
||||||
|
health: 'healthy' | 'stale' | 'unknown';
|
||||||
|
ageMs: number | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface AgentPsRow {
|
||||||
|
name: string;
|
||||||
|
tenant_id: string;
|
||||||
|
host: string;
|
||||||
|
runtime: string;
|
||||||
|
systemdActive: string;
|
||||||
|
systemdEnabled: string;
|
||||||
|
paneAlive: boolean;
|
||||||
|
panePid: number | null;
|
||||||
|
paneCommand: string | null;
|
||||||
|
idleSeconds: number | null;
|
||||||
|
heartbeat: HeartbeatInfo;
|
||||||
|
/** roster runtime !== actual pane command */
|
||||||
|
driftFlag: boolean;
|
||||||
|
/** active but UnitFileState=disabled */
|
||||||
|
bootEnableWarning: boolean;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the systemd show command for an agent unit (active+enabled state).
|
||||||
|
* Returns: `systemctl --user show <unit> -p ActiveState -p SubState -p UnitFileState`
|
||||||
|
*/
|
||||||
|
export function buildSystemdShowCommand(agentName: string): string[] {
|
||||||
|
const unit = `mosaic-agent@${agentName}.service`;
|
||||||
|
return [
|
||||||
|
'systemctl',
|
||||||
|
'--user',
|
||||||
|
'show',
|
||||||
|
unit,
|
||||||
|
'-p',
|
||||||
|
'ActiveState',
|
||||||
|
'-p',
|
||||||
|
'SubState',
|
||||||
|
'-p',
|
||||||
|
'UnitFileState',
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the tmux list-panes command for an agent pane.
|
||||||
|
* Format: `#{pane_pid} #{pane_current_command} #{pane_dead} #{pane_activity}`
|
||||||
|
*/
|
||||||
|
export function buildTmuxListPanesCommand(
|
||||||
|
agentName: string,
|
||||||
|
socketName = DEFAULT_SOCKET_NAME,
|
||||||
|
): string[] {
|
||||||
|
return [
|
||||||
|
'tmux',
|
||||||
|
'-L',
|
||||||
|
socketName,
|
||||||
|
'list-panes',
|
||||||
|
'-t',
|
||||||
|
`=${agentName}:0.0`,
|
||||||
|
'-F',
|
||||||
|
'#{pane_pid} #{pane_current_command} #{pane_dead} #{pane_activity}',
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the heartbeat file path for an agent.
|
||||||
|
*/
|
||||||
|
export function heartbeatPath(agentName: string, mosaicHome = defaultMosaicHome()): string {
|
||||||
|
return join(mosaicHome, 'fleet', 'run', `${agentName}.hb`);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Parse a heartbeat file's contents into a HeartbeatInfo.
|
||||||
|
* File format (one key=value per line):
|
||||||
|
* ts=<iso8601>
|
||||||
|
* pid=<pid>
|
||||||
|
* status=<ok|busy>
|
||||||
|
*/
|
||||||
|
export function parseHeartbeat(content: string | null, nowMs = Date.now()): HeartbeatInfo {
|
||||||
|
if (content === null) {
|
||||||
|
return { ts: null, pid: null, status: null, health: 'unknown', ageMs: null };
|
||||||
|
}
|
||||||
|
const lines = content.split('\n');
|
||||||
|
let ts: Date | null = null;
|
||||||
|
let pid: number | null = null;
|
||||||
|
let status: 'ok' | 'busy' | null = null;
|
||||||
|
for (const line of lines) {
|
||||||
|
const [key, ...rest] = line.split('=');
|
||||||
|
const val = rest.join('=').trim();
|
||||||
|
if (key === 'ts' && val) {
|
||||||
|
const d = new Date(val);
|
||||||
|
if (!Number.isNaN(d.getTime())) ts = d;
|
||||||
|
} else if (key === 'pid' && val) {
|
||||||
|
const n = Number.parseInt(val, 10);
|
||||||
|
if (Number.isFinite(n)) pid = n;
|
||||||
|
} else if (key === 'status' && (val === 'ok' || val === 'busy')) {
|
||||||
|
status = val;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
const thresholdMs = HEARTBEAT_INTERVAL_MS * HEARTBEAT_HEALTHY_MULTIPLIER;
|
||||||
|
let health: 'healthy' | 'stale' | 'unknown' = 'unknown';
|
||||||
|
let ageMs: number | null = null;
|
||||||
|
if (ts !== null) {
|
||||||
|
ageMs = nowMs - ts.getTime();
|
||||||
|
health = ageMs <= thresholdMs ? 'healthy' : 'stale';
|
||||||
|
}
|
||||||
|
return { ts, pid, status, health, ageMs };
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Parse the output of `systemctl --user show ... -p ActiveState -p SubState -p UnitFileState`
|
||||||
|
* Returns an object with the three properties.
|
||||||
|
*/
|
||||||
|
export function parseSystemdShow(output: string): {
|
||||||
|
ActiveState: string;
|
||||||
|
SubState: string;
|
||||||
|
UnitFileState: string;
|
||||||
|
} {
|
||||||
|
const result: Record<string, string> = {};
|
||||||
|
for (const line of output.split('\n')) {
|
||||||
|
const eq = line.indexOf('=');
|
||||||
|
if (eq !== -1) {
|
||||||
|
result[line.slice(0, eq)] = line.slice(eq + 1).trim();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return {
|
||||||
|
ActiveState: result['ActiveState'] ?? 'unknown',
|
||||||
|
SubState: result['SubState'] ?? 'unknown',
|
||||||
|
UnitFileState: result['UnitFileState'] ?? 'unknown',
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Parse the output of `tmux list-panes -F '#{pane_pid} #{pane_current_command} #{pane_dead} #{pane_activity}'`
|
||||||
|
* pane_activity is a Unix epoch timestamp (seconds).
|
||||||
|
*/
|
||||||
|
export function parseTmuxListPanes(
|
||||||
|
output: string,
|
||||||
|
nowMs = Date.now(),
|
||||||
|
): { pid: number | null; command: string | null; dead: boolean; idleSeconds: number | null } {
|
||||||
|
const line = output.trim().split('\n')[0];
|
||||||
|
if (!line) {
|
||||||
|
return { pid: null, command: null, dead: true, idleSeconds: null };
|
||||||
|
}
|
||||||
|
// format: <pid> <command> <dead(0|1)> <activity_epoch>
|
||||||
|
const parts = line.split(' ');
|
||||||
|
const pid = parts[0] ? (Number.isFinite(Number(parts[0])) ? Number(parts[0]) : null) : null;
|
||||||
|
const command = parts[1] ?? null;
|
||||||
|
const dead = parts[2] === '1';
|
||||||
|
const activityEpoch = parts[3] ? Number(parts[3]) : NaN;
|
||||||
|
const idleSeconds =
|
||||||
|
Number.isFinite(activityEpoch) && activityEpoch > 0
|
||||||
|
? Math.floor((nowMs - activityEpoch * 1000) / 1000)
|
||||||
|
: null;
|
||||||
|
return { pid, command, dead, idleSeconds };
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Determine if there is a runtime drift: roster says one runtime but the pane
|
||||||
|
* is actually running something from a different runtime. We detect this by
|
||||||
|
* checking if the pane command doesn't match a known canonical command for the
|
||||||
|
* roster's declared runtime.
|
||||||
|
*
|
||||||
|
* Known canonical commands per runtime:
|
||||||
|
* claude → claude
|
||||||
|
* codex → codex
|
||||||
|
* opencode → opencode
|
||||||
|
* pi → pi
|
||||||
|
*
|
||||||
|
* If the pane is running something else (e.g., python3/dogfood-agent.py) for
|
||||||
|
* an agent whose roster runtime is "pi", that's a drift.
|
||||||
|
*/
|
||||||
|
export function detectDrift(rosterRuntime: string, paneCommand: string | null): boolean {
|
||||||
|
if (!paneCommand) return false;
|
||||||
|
const knownCommands: Record<string, string[]> = {
|
||||||
|
claude: ['claude'],
|
||||||
|
codex: ['codex'],
|
||||||
|
opencode: ['opencode'],
|
||||||
|
pi: ['pi'],
|
||||||
|
};
|
||||||
|
const expected = knownCommands[rosterRuntime];
|
||||||
|
if (!expected) return false;
|
||||||
|
return !expected.includes(paneCommand);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the default tenant_id (OS username) and host (short hostname).
|
||||||
|
* These MUST appear in every --json record for multi-tenant/multi-host zero-foreclosure.
|
||||||
|
*/
|
||||||
|
export function getDefaultTenantAndHost(): { tenant_id: string; host: string } {
|
||||||
|
let tenant_id: string;
|
||||||
|
try {
|
||||||
|
tenant_id = userInfo().username;
|
||||||
|
} catch {
|
||||||
|
tenant_id = process.env['USER'] ?? process.env['LOGNAME'] ?? 'unknown';
|
||||||
|
}
|
||||||
|
const host = hostname().split('.')[0] || 'localhost';
|
||||||
|
return { tenant_id, host };
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds the command to create a grouped viewer session targeting an agent session.
|
||||||
|
* A grouped session shares the same windows as the target but gets INDEPENDENT sizing,
|
||||||
|
* so attaching the viewer never resizes the agent's window.
|
||||||
|
*
|
||||||
|
* The viewer session name is derived from the agent name and a unique suffix (typically
|
||||||
|
* the caller's PID) so multiple concurrent watchers don't collide.
|
||||||
|
*
|
||||||
|
* Usage sequence:
|
||||||
|
* 1. Run buildAgentWatchCreateViewerCommand → create grouped session (via capturing runner).
|
||||||
|
* 2. Run buildAgentWatchAttachCommand → attach -r to the viewer session (via interactiveRunner).
|
||||||
|
* 3. Run buildAgentWatchKillViewerCommand → kill the viewer session on detach (via capturing runner).
|
||||||
|
*/
|
||||||
|
export function buildAgentWatchCreateViewerCommand(
|
||||||
|
agentName: string,
|
||||||
|
viewerSessionName: string,
|
||||||
|
socketName = DEFAULT_SOCKET_NAME,
|
||||||
|
): string[] {
|
||||||
|
return [
|
||||||
|
'tmux',
|
||||||
|
'-L',
|
||||||
|
socketName,
|
||||||
|
'new-session',
|
||||||
|
'-d',
|
||||||
|
'-t',
|
||||||
|
`=${agentName}`,
|
||||||
|
'-s',
|
||||||
|
viewerSessionName,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds the interactive attach command for a viewer session (read-only).
|
||||||
|
* Must be run via interactiveRunner (stdio: 'inherit').
|
||||||
|
*/
|
||||||
|
export function buildAgentWatchAttachCommand(
|
||||||
|
viewerSessionName: string,
|
||||||
|
socketName = DEFAULT_SOCKET_NAME,
|
||||||
|
): string[] {
|
||||||
|
return ['tmux', '-L', socketName, 'attach', '-r', '-t', viewerSessionName];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds the kill-session command to clean up a viewer session after detach.
|
||||||
|
* Keeps the agent session intact.
|
||||||
|
*/
|
||||||
|
export function buildAgentWatchKillViewerCommand(
|
||||||
|
viewerSessionName: string,
|
||||||
|
socketName = DEFAULT_SOCKET_NAME,
|
||||||
|
): string[] {
|
||||||
|
return ['tmux', '-L', socketName, 'kill-session', '-t', viewerSessionName];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns a unique viewer session name for a given agent.
|
||||||
|
* Uses process.pid so concurrent watchers produce distinct names.
|
||||||
|
*/
|
||||||
|
export function buildViewerSessionName(agentName: string): string {
|
||||||
|
return `${agentName}-watch-${process.pid}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @deprecated Use buildAgentWatchCreateViewerCommand + buildAgentWatchAttachCommand +
|
||||||
|
* buildAgentWatchKillViewerCommand instead. This bare attach targets the agent session
|
||||||
|
* directly and can resize it when the viewer terminal is smaller than the agent's window.
|
||||||
|
*
|
||||||
|
* Kept for backward compatibility only.
|
||||||
|
*/
|
||||||
|
export function buildAgentWatchCommand(
|
||||||
|
agentName: string,
|
||||||
|
socketName = DEFAULT_SOCKET_NAME,
|
||||||
|
): string[] {
|
||||||
|
return ['tmux', '-L', socketName, 'attach', '-r', '-t', `=${agentName}`];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds the capture-pane command used to verify that agent send was accepted
|
||||||
|
* (not left as an unsubmitted draft). Captures the last N lines and checks for
|
||||||
|
* the draft heuristic.
|
||||||
|
*/
|
||||||
|
export function buildAgentVerifyAcceptedCommand(
|
||||||
|
agentName: string,
|
||||||
|
socketName = DEFAULT_SOCKET_NAME,
|
||||||
|
lines = 5,
|
||||||
|
): string[] {
|
||||||
|
return [
|
||||||
|
'tmux',
|
||||||
|
'-L',
|
||||||
|
socketName,
|
||||||
|
'capture-pane',
|
||||||
|
'-t',
|
||||||
|
`=${agentName}:0.0`,
|
||||||
|
'-p',
|
||||||
|
'-S',
|
||||||
|
`-${lines}`,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Result of a send-verify check.
|
||||||
|
* - 'accepted': positive evidence that the message was accepted (response content present).
|
||||||
|
* - 'draft': last non-empty line matches the draft heuristic (unsubmitted input).
|
||||||
|
* - 'unverifiable': pane did not change after send (stale or blank) — we cannot determine
|
||||||
|
* acceptance; fails closed per FR-5.
|
||||||
|
*/
|
||||||
|
export type SendVerifyResult = 'accepted' | 'draft' | 'unverifiable';
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Classify the result of a send-verify check by comparing BEFORE and AFTER pane snapshots.
|
||||||
|
*
|
||||||
|
* This is the primary classifier for `send --verify`. It addresses the stale-pane
|
||||||
|
* false-success problem: if the pane content did not change after the send, the new
|
||||||
|
* message never registered in the TUI (wedged pane, send dropped, etc.).
|
||||||
|
*
|
||||||
|
* Classification logic:
|
||||||
|
* 'unverifiable' — AFTER is blank/empty OR AFTER == BEFORE (no pane change after send).
|
||||||
|
* 'draft' — AFTER differs from BEFORE AND the last non-empty line of AFTER starts
|
||||||
|
* with the draft pattern ("> "); message was typed but not submitted.
|
||||||
|
* 'accepted' — AFTER differs from BEFORE AND AFTER does not end in a draft line;
|
||||||
|
* positive evidence that the TUI accepted the message.
|
||||||
|
*
|
||||||
|
* NOTE on blank AFTER: Full-screen TUIs (claude, codex, opencode, pi) render blank for
|
||||||
|
* `tmux capture-pane`. A blank AFTER is indistinguishable from a wedged pane, so it
|
||||||
|
* is always classified 'unverifiable' (fail-closed).
|
||||||
|
*
|
||||||
|
* NOTE on definitive acceptance: Phase-2 can only observe the pane surface — there is no
|
||||||
|
* runtime acknowledgement (heartbeat-ack) at this phase. The pane-change check is the best
|
||||||
|
* signal available against an opaque TUI. Definitive acceptance ultimately requires a
|
||||||
|
* runtime acknowledgement (Phase-3 heartbeat-ack).
|
||||||
|
*
|
||||||
|
* Draft heuristic: a last non-empty line (after stripping ANSI escapes) that starts
|
||||||
|
* with "> " is treated as an unsubmitted input line. This pattern is specific to
|
||||||
|
* pi/claude TUIs; draft detection for codex/opencode TUIs is best-effort only.
|
||||||
|
*
|
||||||
|
* FR-5 requires `send --verify` to return non-zero when delivery cannot be verified.
|
||||||
|
*
|
||||||
|
* @param before Pane snapshot captured BEFORE the send command.
|
||||||
|
* @param after Pane snapshot captured AFTER the send command (after the delay).
|
||||||
|
*/
|
||||||
|
export function classifySendResult(before: string, after: string): SendVerifyResult {
|
||||||
|
const afterLines = after.split('\n').filter((l) => l.trim().length > 0);
|
||||||
|
// Blank/empty AFTER => full-screen TUI rendered blank, or pane is wedged => unverifiable.
|
||||||
|
if (afterLines.length === 0) return 'unverifiable';
|
||||||
|
// No change => message didn't register in the TUI (stale/wedged pane) => unverifiable.
|
||||||
|
if (after === before) return 'unverifiable';
|
||||||
|
// AFTER differs from BEFORE — check whether the pane is now showing a draft line.
|
||||||
|
const lastLine = afterLines[afterLines.length - 1]!;
|
||||||
|
const stripped = lastLine.replace(/\x1b\[[0-9;]*m/g, '').trim();
|
||||||
|
// Heuristic: if stripped last line starts with "> " — that's the common draft pattern
|
||||||
|
// in pi/claude TUIs for showing user input before submission.
|
||||||
|
// NOTE: this heuristic is pi/claude-specific; draft detection for codex/opencode
|
||||||
|
// TUIs is best-effort only and may miss other unsubmitted-input indicators.
|
||||||
|
if (/^>\s/.test(stripped)) return 'draft';
|
||||||
|
return 'accepted';
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Check whether a send was accepted (not left as draft), using only the AFTER snapshot.
|
||||||
|
*
|
||||||
|
* @deprecated Prefer classifySendResult(before, after) which guards against stale-pane
|
||||||
|
* false-successes. This single-snapshot variant cannot detect a wedged pane that still
|
||||||
|
* shows old non-empty content — it will incorrectly return 'accepted' in that case.
|
||||||
|
*
|
||||||
|
* Retained for unit-test compatibility with single-snapshot assertions.
|
||||||
|
*
|
||||||
|
* Returns:
|
||||||
|
* 'unverifiable' — blank/empty capture (full-screen TUIs render blank; we cannot tell).
|
||||||
|
* 'draft' — last non-empty line matches the draft heuristic.
|
||||||
|
* 'accepted' — non-blank and not a draft line (but may be stale — see above).
|
||||||
|
*/
|
||||||
|
export function isSendAccepted(capturedOutput: string): SendVerifyResult {
|
||||||
|
const lines = capturedOutput.split('\n').filter((l) => l.trim().length > 0);
|
||||||
|
// Blank/empty capture => full-screen TUI rendered blank => unverifiable.
|
||||||
|
// This is the known-unverifiable case; fail closed (not treated as success).
|
||||||
|
if (lines.length === 0) return 'unverifiable';
|
||||||
|
const lastLine = lines[lines.length - 1]!;
|
||||||
|
const stripped = lastLine.replace(/\x1b\[[0-9;]*m/g, '').trim();
|
||||||
|
// Heuristic: if stripped last line starts with "> " — that's the common draft pattern
|
||||||
|
// in pi/claude TUIs for showing user input before submission.
|
||||||
|
// NOTE: this heuristic is pi/claude-specific; draft detection for codex/opencode
|
||||||
|
// TUIs is best-effort only and may miss other unsubmitted-input indicators.
|
||||||
|
if (/^>\s/.test(stripped)) return 'draft';
|
||||||
|
return 'accepted';
|
||||||
|
}
|
||||||
|
|
||||||
export function registerFleetCommand(program: Command, deps: FleetCommandDeps = {}): Command {
|
export function registerFleetCommand(program: Command, deps: FleetCommandDeps = {}): Command {
|
||||||
const runner = deps.runner ?? runCommand;
|
const runner = deps.runner ?? runCommand;
|
||||||
const paths = resolveFleetPaths(deps.mosaicHome);
|
const paths = resolveFleetPaths(deps.mosaicHome);
|
||||||
@@ -360,6 +789,113 @@ export function registerFleetCommand(program: Command, deps: FleetCommandDeps =
|
|||||||
console.log(`Verified fleet on tmux socket ${socketName}.`);
|
console.log(`Verified fleet on tmux socket ${socketName}.`);
|
||||||
});
|
});
|
||||||
|
|
||||||
|
cmd
|
||||||
|
.command('ps')
|
||||||
|
.description('Show real-time status for all roster agents (systemd + tmux + heartbeat)')
|
||||||
|
.option('--json', 'Print JSON array')
|
||||||
|
.action(async (opts: { json?: boolean }) => {
|
||||||
|
const commandOpts = cmd.opts<{ mosaicHome: string; roster?: string }>();
|
||||||
|
const activePaths = resolveFleetPaths(commandOpts.mosaicHome);
|
||||||
|
const roster = await loadRosterForCommand(cmd);
|
||||||
|
const { tenant_id, host } = getDefaultTenantAndHost();
|
||||||
|
const nowMs = Date.now();
|
||||||
|
|
||||||
|
const rows: AgentPsRow[] = [];
|
||||||
|
|
||||||
|
for (const agent of roster.agents) {
|
||||||
|
// systemd show
|
||||||
|
const showResult = await runner(...splitCommand(buildSystemdShowCommand(agent.name)));
|
||||||
|
const sysInfo = parseSystemdShow(showResult.stdout);
|
||||||
|
|
||||||
|
// tmux list-panes
|
||||||
|
const panesResult = await runner(
|
||||||
|
...splitCommand(buildTmuxListPanesCommand(agent.name, roster.tmux.socketName)),
|
||||||
|
);
|
||||||
|
const paneInfo = parseTmuxListPanes(panesResult.stdout, nowMs);
|
||||||
|
|
||||||
|
// heartbeat
|
||||||
|
const hbFile = heartbeatPath(agent.name, activePaths.mosaicHome);
|
||||||
|
let hbContent: string | null = null;
|
||||||
|
try {
|
||||||
|
hbContent = await readFile(hbFile, 'utf8');
|
||||||
|
} catch {
|
||||||
|
hbContent = null;
|
||||||
|
}
|
||||||
|
const hb = parseHeartbeat(hbContent, nowMs);
|
||||||
|
|
||||||
|
// drift and boot-enable
|
||||||
|
const driftFlag = detectDrift(agent.runtime, paneInfo.command);
|
||||||
|
const bootEnableWarning =
|
||||||
|
sysInfo.ActiveState === 'active' && sysInfo.UnitFileState === 'disabled';
|
||||||
|
|
||||||
|
rows.push({
|
||||||
|
name: agent.name,
|
||||||
|
tenant_id,
|
||||||
|
host,
|
||||||
|
runtime: agent.runtime,
|
||||||
|
systemdActive: sysInfo.ActiveState,
|
||||||
|
systemdEnabled: sysInfo.UnitFileState,
|
||||||
|
paneAlive: !paneInfo.dead,
|
||||||
|
panePid: paneInfo.pid,
|
||||||
|
paneCommand: paneInfo.command,
|
||||||
|
idleSeconds: paneInfo.idleSeconds,
|
||||||
|
heartbeat: hb,
|
||||||
|
driftFlag,
|
||||||
|
bootEnableWarning,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (opts.json) {
|
||||||
|
console.log(JSON.stringify(rows, null, 2));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Table output
|
||||||
|
const header = [
|
||||||
|
'NAME'.padEnd(18),
|
||||||
|
'TENANT'.padEnd(12),
|
||||||
|
'HOST'.padEnd(12),
|
||||||
|
'RUNTIME'.padEnd(10),
|
||||||
|
'SYSTEMD'.padEnd(16),
|
||||||
|
'PANE'.padEnd(8),
|
||||||
|
'PID'.padEnd(8),
|
||||||
|
'IDLE'.padEnd(8),
|
||||||
|
'HB'.padEnd(12),
|
||||||
|
'FLAGS',
|
||||||
|
].join(' ');
|
||||||
|
console.log(header);
|
||||||
|
console.log('-'.repeat(header.length));
|
||||||
|
|
||||||
|
for (const row of rows) {
|
||||||
|
const systemd = `${row.systemdActive}/${row.systemdEnabled}`;
|
||||||
|
const pane = row.paneAlive ? 'alive' : 'dead';
|
||||||
|
const pid = row.panePid !== null ? String(row.panePid) : '-';
|
||||||
|
const idle = row.idleSeconds !== null ? `${row.idleSeconds}s` : '-';
|
||||||
|
const hbAge =
|
||||||
|
row.heartbeat.ageMs !== null
|
||||||
|
? `${Math.round(row.heartbeat.ageMs / 1000)}s/${row.heartbeat.health}`
|
||||||
|
: `unknown`;
|
||||||
|
const flags: string[] = [];
|
||||||
|
if (row.driftFlag) flags.push('DRIFT');
|
||||||
|
if (row.bootEnableWarning) flags.push('BOOT-ENABLE');
|
||||||
|
|
||||||
|
console.log(
|
||||||
|
[
|
||||||
|
row.name.padEnd(18),
|
||||||
|
row.tenant_id.padEnd(12),
|
||||||
|
row.host.padEnd(12),
|
||||||
|
row.runtime.padEnd(10),
|
||||||
|
systemd.padEnd(16),
|
||||||
|
pane.padEnd(8),
|
||||||
|
pid.padEnd(8),
|
||||||
|
idle.padEnd(8),
|
||||||
|
hbAge.padEnd(12),
|
||||||
|
flags.join(','),
|
||||||
|
].join(' '),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
return cmd;
|
return cmd;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -368,6 +904,8 @@ export function registerFleetAgentCommands(
|
|||||||
deps: FleetCommandDeps = {},
|
deps: FleetCommandDeps = {},
|
||||||
): void {
|
): void {
|
||||||
const runner = deps.runner ?? runCommand;
|
const runner = deps.runner ?? runCommand;
|
||||||
|
const iRunner = deps.interactiveRunner ?? spawnInteractive;
|
||||||
|
const sleepFn = deps.sleepFn ?? defaultSleep;
|
||||||
|
|
||||||
agentCommand
|
agentCommand
|
||||||
.command('roster')
|
.command('roster')
|
||||||
@@ -417,21 +955,141 @@ export function registerFleetAgentCommands(
|
|||||||
.requiredOption('--message <text>', 'Message text')
|
.requiredOption('--message <text>', 'Message text')
|
||||||
.option('--source-label <label>', 'Source label for the message preamble')
|
.option('--source-label <label>', 'Source label for the message preamble')
|
||||||
.option('--source <label>', 'Alias for --source-label')
|
.option('--source <label>', 'Alias for --source-label')
|
||||||
|
.option(
|
||||||
|
'--verify',
|
||||||
|
'Verify message was accepted (not left as a draft); exit non-zero if unverifiable',
|
||||||
|
)
|
||||||
|
.option(
|
||||||
|
'--verify-timeout <ms>',
|
||||||
|
`Maximum time (ms) to poll for pane change when --verify is set (default: ${VERIFY_DEFAULT_TIMEOUT_MS})`,
|
||||||
|
String(VERIFY_DEFAULT_TIMEOUT_MS),
|
||||||
|
)
|
||||||
.action(
|
.action(
|
||||||
async (agent: string, opts: { message: string; sourceLabel?: string; source?: string }) => {
|
async (
|
||||||
|
agent: string,
|
||||||
|
opts: {
|
||||||
|
message: string;
|
||||||
|
sourceLabel?: string;
|
||||||
|
source?: string;
|
||||||
|
verify?: boolean;
|
||||||
|
verifyTimeout?: string;
|
||||||
|
},
|
||||||
|
) => {
|
||||||
const roster = await loadRosterFromAgentCommand(agentCommand, deps.mosaicHome);
|
const roster = await loadRosterFromAgentCommand(agentCommand, deps.mosaicHome);
|
||||||
getRosterAgent(roster, agent);
|
getRosterAgent(roster, agent);
|
||||||
const paths = resolveFleetPaths(
|
const paths = resolveFleetPaths(
|
||||||
resolveMosaicHomeFromCommand(agentCommand, deps.mosaicHome),
|
resolveMosaicHomeFromCommand(agentCommand, deps.mosaicHome),
|
||||||
);
|
);
|
||||||
const sourceLabel = opts.sourceLabel ?? opts.source ?? getDefaultOperatorSourceLabel();
|
const sourceLabel = opts.sourceLabel ?? opts.source ?? getDefaultOperatorSourceLabel();
|
||||||
|
if (opts.verify) {
|
||||||
|
const parsedTimeout =
|
||||||
|
opts.verifyTimeout !== undefined ? Number.parseInt(opts.verifyTimeout, 10) : Number.NaN;
|
||||||
|
const timeoutMs = Number.isFinite(parsedTimeout)
|
||||||
|
? Math.max(0, parsedTimeout)
|
||||||
|
: VERIFY_DEFAULT_TIMEOUT_MS;
|
||||||
|
|
||||||
|
// Capture BEFORE snapshot so we can detect stale-pane false-successes.
|
||||||
|
// A wedged pane that still shows old non-empty content must not be reported
|
||||||
|
// as 'accepted' — we compare BEFORE vs AFTER to guard against that case.
|
||||||
|
const beforeResult = await runner(
|
||||||
|
...splitCommand(buildAgentVerifyAcceptedCommand(agent, roster.tmux.socketName)),
|
||||||
|
);
|
||||||
|
if (beforeResult.exitCode !== 0) {
|
||||||
|
throw new Error(
|
||||||
|
`send --verify: could not capture pane output before send (tmux exited ${beforeResult.exitCode}).`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
const beforeSnapshot = beforeResult.stdout;
|
||||||
|
|
||||||
await runChecked(
|
await runChecked(
|
||||||
runner,
|
runner,
|
||||||
buildAgentSendCommand(paths, agent, opts.message, roster.tmux.socketName, sourceLabel),
|
buildAgentSendCommand(paths, agent, opts.message, roster.tmux.socketName, sourceLabel),
|
||||||
);
|
);
|
||||||
|
|
||||||
|
// Bounded polling loop: poll capture-pane every VERIFY_POLL_INTERVAL_MS up to
|
||||||
|
// timeoutMs. Return immediately when the pane shows 'accepted' or 'draft';
|
||||||
|
// keep polling while 'unverifiable' (no pane change yet). Fail closed after
|
||||||
|
// timeout with the existing "no pane change after send" message.
|
||||||
|
const deadline = Date.now() + timeoutMs;
|
||||||
|
let verifyResult: SendVerifyResult = 'unverifiable';
|
||||||
|
|
||||||
|
while (true) {
|
||||||
|
await sleepFn(VERIFY_POLL_INTERVAL_MS);
|
||||||
|
const afterResult = await runner(
|
||||||
|
...splitCommand(buildAgentVerifyAcceptedCommand(agent, roster.tmux.socketName)),
|
||||||
|
);
|
||||||
|
if (afterResult.exitCode !== 0) {
|
||||||
|
throw new Error(
|
||||||
|
`send --verify: could not capture pane output to verify acceptance (tmux exited ${afterResult.exitCode}).`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
verifyResult = classifySendResult(beforeSnapshot, afterResult.stdout);
|
||||||
|
// Definitive result — stop polling immediately.
|
||||||
|
if (verifyResult === 'accepted' || verifyResult === 'draft') {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
// Still unverifiable — check if we have time left to poll again.
|
||||||
|
if (Date.now() >= deadline) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (verifyResult === 'draft') {
|
||||||
|
process.exitCode = 1;
|
||||||
|
process.stderr.write(
|
||||||
|
`send --verify: message left as unsubmitted draft in agent "${agent}".\n`,
|
||||||
|
);
|
||||||
|
} else if (verifyResult === 'unverifiable') {
|
||||||
|
process.exitCode = 1;
|
||||||
|
process.stderr.write(
|
||||||
|
`send --verify: could not verify delivery (no pane change after send) for agent "${agent}".\n`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
await runChecked(
|
||||||
|
runner,
|
||||||
|
buildAgentSendCommand(paths, agent, opts.message, roster.tmux.socketName, sourceLabel),
|
||||||
|
);
|
||||||
|
}
|
||||||
},
|
},
|
||||||
);
|
);
|
||||||
|
|
||||||
|
agentCommand
|
||||||
|
.command('watch <agent>')
|
||||||
|
.description('Open a read-only view of a fleet agent tmux session (cannot send keystrokes)')
|
||||||
|
.action(async (agent: string) => {
|
||||||
|
const roster = await loadRosterFromAgentCommand(agentCommand, deps.mosaicHome);
|
||||||
|
getRosterAgent(roster, agent);
|
||||||
|
|
||||||
|
// Use a GROUPED VIEWER SESSION to prevent the observer from resizing the agent's
|
||||||
|
// window. A bare `tmux attach -r` against the agent session itself still lets the
|
||||||
|
// client shrink the session to its terminal size; a grouped session gets INDEPENDENT
|
||||||
|
// sizing so the agent's window is never affected by the viewer's terminal dimensions.
|
||||||
|
//
|
||||||
|
// Sequence:
|
||||||
|
// 1. Create a throwaway grouped session targeting the agent (capturing runner).
|
||||||
|
// 2. Attach -r (read-only) to the viewer session (interactiveRunner / TTY).
|
||||||
|
// 3. Kill the viewer session on detach so stale sessions don't accumulate.
|
||||||
|
const viewerName = buildViewerSessionName(agent);
|
||||||
|
const socketName = roster.tmux.socketName;
|
||||||
|
|
||||||
|
await runChecked(runner, buildAgentWatchCreateViewerCommand(agent, viewerName, socketName));
|
||||||
|
|
||||||
|
const [bin, args] = splitCommand(buildAgentWatchAttachCommand(viewerName, socketName));
|
||||||
|
const exitCode = await iRunner(bin, args);
|
||||||
|
|
||||||
|
// Best-effort cleanup of the viewer session regardless of how the user detached.
|
||||||
|
// Errors here are intentionally suppressed — the agent session is unaffected.
|
||||||
|
const killResult = await runner(
|
||||||
|
...splitCommand(buildAgentWatchKillViewerCommand(viewerName, socketName)),
|
||||||
|
);
|
||||||
|
void killResult; // result is intentionally ignored
|
||||||
|
|
||||||
|
if (exitCode !== 0) {
|
||||||
|
process.exitCode = exitCode;
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
agentCommand
|
agentCommand
|
||||||
.command('reset <agent>')
|
.command('reset <agent>')
|
||||||
.description('Reset a local fleet agent by sending the runtime reset command')
|
.description('Reset a local fleet agent by sending the runtime reset command')
|
||||||
@@ -864,6 +1522,32 @@ function resolveFrameworkRoot(): string {
|
|||||||
return resolve(dirname(currentFile), '..', '..', 'framework');
|
return resolve(dirname(currentFile), '..', '..', 'framework');
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Default InteractiveRunner implementation: spawns the command with inherited
|
||||||
|
* stdio so the terminal is passed through to the child process. This is required
|
||||||
|
* for commands like `tmux attach` that are full-screen interactive and cannot be
|
||||||
|
* captured through a pipe.
|
||||||
|
*/
|
||||||
|
function spawnInteractive(command: string, args: string[]): Promise<number> {
|
||||||
|
return new Promise((resolvePromise) => {
|
||||||
|
const child = spawn(command, args, { stdio: 'inherit' });
|
||||||
|
child.on('error', () => {
|
||||||
|
resolvePromise(127);
|
||||||
|
});
|
||||||
|
child.on('close', (code) => {
|
||||||
|
resolvePromise(code ?? 1);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Default SleepFn implementation backed by setTimeout.
|
||||||
|
* Tests inject a stub to avoid real delays in the send --verify polling loop.
|
||||||
|
*/
|
||||||
|
function defaultSleep(ms: number): Promise<void> {
|
||||||
|
return new Promise<void>((res) => setTimeout(res, ms));
|
||||||
|
}
|
||||||
|
|
||||||
async function canRead(path: string): Promise<boolean> {
|
async function canRead(path: string): Promise<boolean> {
|
||||||
try {
|
try {
|
||||||
await access(path, constants.R_OK);
|
await access(path, constants.R_OK);
|
||||||
|
|||||||
Reference in New Issue
Block a user