Compare commits

...

7 Commits

Author SHA1 Message Date
Jarvis
6dbd3d691c test(fleet): hermetic heartbeat tests — temp run-dir + no leaked sidecars
All checks were successful
ci/woodpecker/pr/ci Pipeline was successful
ci/woodpecker/push/ci Pipeline was successful
Tests 3, 4, 5 previously returned synthetic pane PIDs (99999/99998/99997)
from their fake list-panes shims but did not set MOSAIC_HEARTBEAT_RUN_DIR,
causing the launcher to fall back to the real ~/.config/mosaic/fleet/run
and potentially spawn a background sidecar against an arbitrary host PID.

Fix:
- list-panes in tests 3/4/5 now returns empty string → PANE_PID stays
  unset → no sidecar is spawned for tests where heartbeat is not under test.
- MOSAIC_HEARTBEAT_RUN_DIR is exported to a per-test mktemp dir in each
  fake-tmux test (3, 4, 5) as defence-in-depth so even if the sidecar
  code path changes, it can never write to the real fleet run dir.
- New temp dirs are registered in CLEANUP_DIRS so they are removed by the
  existing EXIT trap.
- Tests 6 and 7 (the dedicated heartbeat tests) are unchanged: test 6 uses
  a real tmux pane PID + its own HB_RUN_DIR, test 7 intercepts via a fake
  setsid shim that captures args and exits immediately.
- All 7 tests pass; verify-sanitized.sh passes; no stray sidecar processes
  or unexpected .hb files are written to ~/.config/mosaic/fleet/run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
2026-06-21 15:58:35 -05:00
Jarvis
b50a062021 feat(fleet): launcher heartbeat sidecar — HB for all runtimes (pi/claude/codex)
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline was canceled
Replace the terminal `exec tmux` with a plain `tmux new-session -d` so the
launcher continues running after creating the pane. The script then resolves
the pane PID via `tmux list-panes -F '#{pane_pid}'` (with a brief retry loop)
and spawns a detached, runtime-agnostic heartbeat sidecar via `setsid bash -c
... &` + `disown`. The sidecar loops while `kill -0 <pane_pid>` succeeds,
writing ~/.config/mosaic/fleet/run/<AGENT>.hb atomically (tmp + mv) every
MOSAIC_HEARTBEAT_INTERVAL seconds (default 15), then exits naturally when the
runtime process dies — making `mosaic fleet ps` show stale then dead.
HB_RUN_DIR and interval are configurable via env; sidecar startup is
best-effort (failures warn but do not abort the launch). Two new shell tests
cover pane-PID resolution (test 6, real tmux) and sidecar invocation
correctness (test 7, fake-tmux + fake-setsid shims).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
2026-06-21 15:52:20 -05:00
afcbbb302f feat(fleet): auto-enable units on install + drift recognizes wrapped runtimes (#583)
Some checks failed
ci/woodpecker/push/ci Pipeline failed
ci/woodpecker/push/publish Pipeline was successful
2026-06-21 20:02:19 +00:00
c2c0b5fe8d chore(release): bump @mosaicstack/mosaic 0.0.34 -> 0.0.35 (#582)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
2026-06-21 18:59:39 +00:00
c9cfe36204 docs(framework): P3.1 fast-follow — governance wording + gate scope + bare-launch note (#577)
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/push/publish Pipeline was canceled
Co-authored-by: Jason Woltje <jason@diversecanvas.com>
Co-committed-by: Jason Woltje <jason@diversecanvas.com>
2026-06-21 18:56:50 +00:00
fc90c89913 fix(fleet): durable runtime PATH for detached agent launch (#581)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
2026-06-21 17:30:40 +00:00
af2eede7a9 feat(fleet): Phase-2 observability — fleet ps + watch + send verify (#579)
All checks were successful
ci/woodpecker/push/ci Pipeline was successful
ci/woodpecker/push/publish Pipeline was successful
2026-06-21 04:23:51 +00:00
12 changed files with 2751 additions and 15 deletions

109
docs/fleet/PRD.md Normal file
View File

@@ -0,0 +1,109 @@
# PRD — Fleet Phase 2: Operator Observability
> **Workstream:** W-FLEET under `mvp-20260312` · **Phase:** 2
> **North star:** [docs/fleet/north-star.md](./north-star.md)
> **Source umbrella PRD:** [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
> **Tracks task:** `fleet-observability-1` — restore operator observability into fleet agent sessions.
## Problem
The durable tmux fleet runs on the isolated `mosaic-factory` socket. That isolation
(which protects the operator's default tmux) makes the fleet **invisible** to default
tooling, and truth is split across three planes no single command joins — systemd
(`systemctl --user`), tmux (`-L mosaic-factory`), and the process tree (`pstree`).
`agent tail` (`capture-pane`) returns **blank for full-screen TUIs**, and `agent send`
confirms only keystroke injection, not acceptance. Net: the operator has near-zero
observability and no safe way to watch a session.
## Goals
1. One command shows the **whole fleet's** real state, joining all three planes.
2. **Liveness is truthful**: healthy = answered a heartbeat, not "pane alive".
3. The operator can **watch** any session read-only without disrupting it.
4. `send` reports **delivered-and-accepted**, not just injected.
5. Every record/address carries **`tenant_id` + `host`** (zero foreclosure for multi-tenant/multi-host).
## Non-goals (this phase)
- No webUI (Phase 5; rides federation for cross-host).
- No `fleetd` daemon or persistent history store.
- No real-runtime swap (Phase 3) — instrument the live **dogfood stub** fleet.
- No cross-host aggregation yet (addressing is host-tagged but queries stay local).
## Functional requirements
| ID | Requirement |
| ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| FR-1 | `mosaic fleet ps [--json]` prints one row per roster agent joining: name · tenant · host · runtime · systemd(active/enabled) · pane(alive/dead) · pid · idle · **last-heartbeat age** · **drift** flag (roster runtime ≠ actual pane command) · **boot-enable** warning (active but `UnitFileState=disabled`). |
| FR-2 | **Heartbeat protocol v1** (see below); `dogfood-agent.py` implements the responder. `fleet ps` issues probes (or reads last-seen) and reports health per FR-1. |
| FR-3 | `mosaic agent watch <name>` opens a **read-only** view of the pane (grouped session or `tmux attach -r`) that cannot send keystrokes and does not shrink the agent's window. |
| FR-4 | `mosaic agent attach <name>` remains the **explicit** interactive-takeover path (separate verb, documented as the only one that can type). |
| FR-5 | `mosaic agent send <name> --verify` confirms the message was **accepted** (not left as an unsubmitted draft) and returns non-zero if delivery cannot be verified. |
| FR-6 | All structured output (`--json`) includes `tenant_id` and `host` fields. |
## Heartbeat protocol v1
- **Probe:** operator/`fleet ps` writes a sentinel line to the agent's input or a
well-known per-agent heartbeat file path `~/.config/mosaic/fleet/run/<agent>.hb`.
- **Response:** the runtime updates `<agent>.hb` with `ts=<iso8601> pid=<pid> status=<ok|busy>`
on a fixed interval (default 15s) and on demand when probed.
- **Health rule:** `healthy` if `now - ts <= 3 × interval`; else `stale`; missing file = `unknown`.
- **Contract:** every runtime (dogfood stub now; claude/codex/pi/opencode in Phase 3)
MUST emit the heartbeat. The protocol is file-based so it works for headless stubs and
full-screen TUIs alike (no `capture-pane` dependency).
- `ASSUMPTION:` file-based heartbeat (vs in-pane echo) — chosen because it is TUI-safe and
uid-scoped, fitting per-tenant isolation. Open to an OTEL-span variant in Phase 3 (MVP-X6).
## Acceptance criteria
- `mosaic fleet ps` shows all 5 live sessions on `mosaic-factory` with correct
pane/pid/idle and flags the dogfood **drift** (`canary-pi` runtime=pi but pane runs
`dogfood-agent.py`) and the **boot-enable** gap (active but disabled).
- Killing one agent's pane flips its row to dead/stale within one `interval`.
- `agent watch` shows live output and provably cannot type into the pane; detaching
leaves the agent's window size unchanged.
- `agent send --verify` returns success on an accepting pane and non-zero on a wedged/draft pane.
- Quality gates green: `pnpm typecheck`, `pnpm lint`, `pnpm format:check`, plus
`pnpm --filter @mosaicstack/mosaic test`.
- Independent review passed; dogfood evidence captured against the live fleet.
## Test plan
- Unit/CLI specs in `packages/mosaic/src/commands/fleet.spec.ts` (and a new
`fleet-ps`/`watch`/`send-verify` spec) using the injected `CommandRunner` to assert
exact tmux/systemd command construction and JSON shape (tenant+host present).
- Situational: run against the live `mosaic-factory` fleet; capture `fleet ps` output,
a kill-and-detect cycle, a read-only `watch`, and a `send --verify` pass/fail pair.
## Known limitations
- **Verify heuristic is best-effort:** `agent send --verify` uses a `>` -prefix draft
heuristic that is specific to pi/claude TUIs. Draft detection for codex and opencode
TUIs is best-effort only; those runtimes may not use the same input-line indicator.
- **Pane-change check is the best Phase-2 signal; verify now polls up to a bounded
timeout:** `agent send --verify` captures a BEFORE snapshot, sends the message, then
polls `capture-pane` every ~400 ms up to a configurable total timeout (default ~6 s,
controlled by `--verify-timeout <ms>`). On each poll it runs classifySendResult: if
the pane shows 'accepted' or 'draft' the loop exits immediately; while the result is
'unverifiable' (no pane change yet) it keeps polling. After the timeout with no
definitive result, it fails closed: exit 1 with "no pane change after send". This
eliminates false 'unverifiable' failures for slow/loaded TUIs that were previously
caused by the old fixed 300 ms single-capture. Definitive acceptance ultimately
requires a runtime acknowledgement (Phase-3 heartbeat-ack); the bounded pane-change
poll is the best signal available against an opaque TUI for Phase-2.
- **Blank AFTER capture fails closed:** Full-screen TUIs (claude, codex, opencode, pi)
render blank for `tmux capture-pane`. When the AFTER snapshot is empty, `send --verify`
returns non-zero with an "unverifiable" message rather than silently succeeding. This
is an intentional fail-closed design (FR-5).
- **`agent watch` uses a grouped viewer session:** `tmux attach -r` directly against the
agent session lets the viewer terminal shrink the agent's window. `agent watch` instead
creates a throwaway grouped session (`tmux new-session -d -t '=<agent>' -s
'<agent>-watch-<pid>'`), attaches read-only to that session, and kills it on detach.
The grouped session shares the agent's windows but has independent sizing, so the
agent's window is never affected. `tmux attach` is still interactive and requires
inherited stdio; the `interactiveRunner` handles TTY passthrough.
## Surfaces & parity (MVP-X1)
CLI lands this phase. TUI surface follows in the `packages/mosaic` wizard; webUI in
Phase 5 via federation. PRD records the parity debt explicitly so it is not lost.

27
docs/fleet/TASKS.md Normal file
View File

@@ -0,0 +1,27 @@
# Tasks — W-FLEET (Fleet) Phase 2: Observability
> Workstream task file for the Fleet. Single-writer: Fleet workstream lead (orchestrator).
> Workers read but never modify. This is **not** the MVP rollup (`docs/TASKS.md`) — a
> rollup row is proposed to the MVP orchestrator, not written here.
>
> Mission: `mvp-20260312` · PRD: [docs/fleet/PRD.md](./PRD.md) · North star: [docs/fleet/north-star.md](./north-star.md)
> Status: `not-started` | `in-progress` | `done` | `blocked` | `failed`
| id | status | description | depends_on | agent | pr | notes |
| ------------- | ----------- | ------------------------------------------------------------------------------------------------------------------ | --------------------- | ----------- | --- | ----------------------------------------------------------------------------------------------------------------------------- |
| FLEET-OBS-000 | done | Plan: north-star + Phase-2 PRD + workstream scaffolding | — | lead | — | persisted 2026-06-20 on `feat/fleet-observability` |
| FLEET-OBS-001 | done | Heartbeat protocol v1 spec finalized in PRD + framework doc | FLEET-OBS-000 | lead | — | file-based `~/.config/mosaic/fleet/run/<agent>.hb`; spec in PRD |
| FLEET-OBS-002 | in-progress | Implement heartbeat responder in `dogfood-agent.py` | FLEET-OBS-001 | fleet-coder | — | dispatched to ad-hoc `mosaic yolo` fleet agent (dogfood) |
| FLEET-OBS-003 | done | `mosaic fleet ps` — join systemd+tmux+proc+idle+heartbeat; tenant+host tagged; drift + boot-enable flags; `--json` | FLEET-OBS-001 | worker | — | commit ab47831; LIVE-verified on mosaic-factory; caught canary-pi DRIFT + BOOT-ENABLE. Polish: idleSeconds parse returns null |
| FLEET-OBS-004 | done | `mosaic agent watch <name>` — read-only join (no resize, no keystrokes) | FLEET-OBS-000 | worker | — | `attach -r`; verb wired |
| FLEET-OBS-005 | done | `mosaic agent send --verify` — delivery/acceptance receipt | FLEET-OBS-000 | worker | — | --verify flag; draft-heuristic verify |
| FLEET-OBS-006 | done | CLI specs for ps/watch/send-verify (tenant+host shape, command construction) | FLEET-OBS-003,004,005 | worker | — | 62 tests green (31 new); re-verified by lead |
| FLEET-OBS-007 | not-started | Framework doc: fleet observability guide + verbs | FLEET-OBS-003,004,005 | lead | — | `docs/guides/` or `framework/tools/.../README` |
| FLEET-OBS-008 | not-started | Independent review + dogfood verification on live fleet | FLEET-OBS-002..007 | reviewer | — | author ≠ reviewer; capture evidence in scratchpad |
| FLEET-OBS-009 | not-started | Open PR → green CI (queue guard) → squash-merge → close `fleet-observability-1` | FLEET-OBS-008 | lead | — | trunk merge; no direct push to main |
## Proposed MVP rollup row (for the MVP orchestrator — not written by this workstream)
```
| W-FLEET | in-progress | Fleet (agent-session execution layer) | Phase 2/5 | docs/fleet/TASKS.md | observability dogfooded on live stub fleet; control plane rides federation (W1) |
```

133
docs/fleet/north-star.md Normal file
View File

@@ -0,0 +1,133 @@
# Mosaic Fleet — North Star
> **Workstream:** W-FLEET (Fleet) under mission `mvp-20260312`
> **Umbrella:** [docs/MISSION-MANIFEST.md](../MISSION-MANIFEST.md) · [docs/PRD.md](../PRD.md) (Mosaic Stack v0.1.0)
> **Status:** doctrine — authored 2026-06-20. Owner of this file: Fleet workstream lead.
> This document does **not** modify the MVP rollup; a rollup row is proposed, not written here.
## Vision
A **customizable, multi-tenant fleet of always-on AI agents** — each defined by role,
materialized as a durable, joinable runtime session, coordinated by the proven
orchestrator/worker model, and observable end-to-end across hosts. Coding today;
finance, analytics, research as roster entries tomorrow — same primitives, different
roster. The fleet is the **agent-session execution layer** of the Mosaic Stack MVP:
the thing federation makes reachable across hosts and the webUI/TUI/CLI make visible.
The USC tmux PoC (durable sessions + `agent-send` comms) proved the model. This
workstream makes it an official, observable, multi-tenant Mosaic Stack capability.
## The Fleet as means of production (bootstrapping)
The Fleet has a **dual role**, and that is the point:
- **As product** — a multi-tenant agent-fleet capability of Mosaic Stack (this workstream).
- **As means of production** — the orchestrator/worker fleet that _actually builds the
entire MVP_ (federation W1, webUI, TUI, CLI, and the Fleet itself).
We are **building the system that builds the system.** Every other MVP workstream is
delivered _by_ the fleet, so fleet observability and control are not merely product
features — they are the **operational floor of the whole delivery effort**. If we cannot
see and steer the agents, we cannot trust what they ship. This is why Phase 2
(observability) leads: it is the instrument panel for the factory, dogfooded on the live
fleet that is, recursively, building Mosaic Stack.
The discipline that makes great power safe is the same gate chain the fleet enforces:
independent review before merge, green CI, honest completion, decide-and-inform cadence,
and no irreversible action without authority. The bootstrap is only as trustworthy as
those gates.
## Alignment with MVP cross-cutting requirements
The Fleet inherits — does not re-invent — the MVP's hard requirements:
| MVP req | What it means for the Fleet |
| ----------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| MVP-X1 three-surface parity | fleet observability/control reachable via **CLI + TUI + webUI** (CLI first; webUI is required for parity, not optional) |
| MVP-X2 multi-tenant isolation | one tenant = one **Linux uid** (own `systemd --user`, socket, `~/.config/mosaic`); no cross-tenant leakage |
| MVP-X3 auth (BetterAuth/SSO) | operator→fleet and cross-host views are auth-gated through the platform's existing auth |
| MVP-X4 quality gates | `pnpm typecheck`/`lint`/`format:check` green before any push |
| MVP-X5 federated topology | cross-host fleet visibility rides the **federation** boundary (W1), not a bespoke broker |
| MVP-X6 OTEL tracing | heartbeats, sends, and lifecycle events emit spans; `traceparent` crosses the federation boundary |
| MVP-X7 trunk merge | branch from `main`, squash-merge via PR, never push to `main` |
## The stack — where every concern lives
One **definition** is the source of truth; the **session** is how it runs.
| Layer | Owner | Phase-2 reality | Destination |
| -------------------------------- | ------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------- |
| **Definition + identity + auth** | gateway / `mosaic-as` (scoped tokens, #541) | `roster.yaml` (tenant-tagged) | one definition; `mosaic agent --new` materializes it |
| **Tenancy boundary** | **Linux uid per tenant** (linger, own `systemd --user`, own socket, own `~/.config/mosaic`) | one tenant: `jarvis` = tenant zero | uid-per-tenant; federation aggregates across hosts |
| **Runtime** | per-tenant tmux session on isolated socket | dogfood stub sessions (live now on `mosaic-factory`) | claude/codex/pi/opencode TUIs |
| **Liveness** | **heartbeat protocol** every runtime answers | protocol defined + dogfood stub answers it | all runtimes answer; "healthy" ≠ "pane alive" |
| **Observation** | read-only `watch` (native tmux) + `pipe-pane` stream | CLI `watch`/`ps`; explicit opt-in `attach` for control | + auth-gated webUI streams |
| **Control plane** | **federation** across hosts × tenants | records already carry `tenant_id` + `host` | federated gateways expose fleet state; webUI in Phase 5 |
## Operating model (inherited, not reinvented)
The AI-guide law stands: one accountable **orchestrator**, isolated **workers** that
stop at PR-open, the serialized **gate chain** (independent review → green CI →
diff-sanity → squash-merge → verify), **decide-and-inform** cadence, and a durable
**board** so missions survive session death. The Fleet is the infrastructure _under_
this model. See `mosaicstack-aiguide` whitepapers 01 (inter-agent comms) and 03
(orchestration model) for the rationale.
## Invariants — "maximal vision, incremental delivery, zero foreclosure"
Every artifact, starting Phase 2, MUST:
1. Carry **`tenant_id` + `host`** in schema and message addressing — even with one of each today.
2. Treat **isolation socket ≠ invisibility** — anything isolated is surfaced by one command.
3. Define **healthy = answered a heartbeat within N seconds**, never just "pane alive".
4. Make **observation read-only by default**; control is an explicit, separate, opt-in verb.
## Observation model
| Verb | Behavior |
| ----------------------------------- | -------------------------------------------------------------------------------------------------- |
| `mosaic fleet ps` | one table joining systemd + tmux + process + idle + last-heartbeat, with drift + boot-enable flags |
| `mosaic agent watch <name>` | **read-only** join (grouped session / `-r`), no resize tyranny, no keystrokes |
| `mosaic agent attach <name>` | explicit interactive takeover (the only path that can type) |
| `mosaic agent send <name> --verify` | confirms message **accepted**, not merely keystroke-injected |
> Why the current PoC blocks observation: sessions live on the isolated `mosaic-factory`
> socket (invisible to default `tmux ls`), the only sanctioned read is `capture-pane`
> (blank for full-screen TUIs), and `attach` is read-write + resizes the session. The
> verbs above restore "join and observe" safely.
## Phased roadmap
| Phase | Outcome | Status |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| 01 | tmux PoC, hardening, published CLI v0.0.34 (#565#568) | ✅ done |
| **2 — Observability** | `fleet ps` (host+tenant aware join), heartbeat protocol + dogfood stub answers it, `agent watch` (read-only), `agent send --verify` receipts | ▶ now |
| 3 — Real runtimes | claude/codex/pi/opencode answer heartbeat; **hybrid lifecycle** (core always-on: orchestrator+reviewer; ephemeral workers per lane) | planned |
| 4 — Unified definition | one agent schema in gateway; `mosaic agent --new` → materialized per-tenant session; uid-tenant provisioning | planned |
| 5 — Control plane | federation-backed cross-host × cross-tenant fleet view; **webUI** (surface chosen then) for MVP-X1 parity | planned |
## Decisions of record (2026-06-20, with Jason)
- Agent model: **config defines, session runs** (gateway = definition/identity/auth; tmux = runtime).
- Tenancy: **multi-tenant from the start**; isolation = **per-tenant Linux uid**.
- Health: **heartbeat required** (dogfood stub implements the protocol now).
- Lifecycle: **hybrid** — core always-on + ephemeral workers per lane.
- Observation: **read-only default, opt-in takeover**.
- Multi-host: **designed-for from day one**; control plane **rides federation (W1)**.
- Delivery: **CLI-first now**, dogfood against the live stub fleet; webUI deferred to Phase 5.
- Runtimes: fleet agents default to **Codex / pi-on-Codex**; **Claude is reserved for Claude
Code only** (avoid alternate-harness API pricing). Validated durable recipe:
`mosaic yolo pi --model openai-codex/gpt-5.5:high`. Durable detached launch requires the
runtime-bin on PATH (baked into the pane command) + boot-survival (`enable` + linger),
which `fleet init` should automate.
## Assumptions (veto-able)
- `ASSUMPTION:` first-class runtimes = claude, codex, pi, opencode; a "role" (analyst,
finance, researcher) = persona + skills + tools on top of a runtime, shipped as a
starter role library in the framework.
- `ASSUMPTION:` the cross-host control plane is the **federation** layer (W1), not a
separate `fleetd` daemon.
- `ASSUMPTION:` Fleet is workstream **W-FLEET** under `mvp-20260312`; a rollup row in
`docs/TASKS.md` and a workstream declaration in `MISSION-MANIFEST.md` are proposed to
the MVP orchestrator, not written by this workstream.

View File

@@ -0,0 +1,100 @@
# Scratchpad — Fleet Phase 2: Observability (W-FLEET)
> Append-only. Mission `mvp-20260312` / workstream W-FLEET.
> Lead: Jarvis (Claude) at `W-jarvis:mos-claude-18`. Coordinating with `jwoltje@dragon-lin:coder0-0`.
## Mission prompt (2026-06-20)
Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability
for delivery. The USC tmux PoC is the proven base. Jason granted lead authority:
"The fleet is a great way to actually build the MVP — we are building the system that
builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate
with a second agent on `dragon-lin`.
## Decisions of record (with Jason, 2026-06-20)
- Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
- Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
- Health: heartbeat required; dogfood stub implements protocol now.
- Lifecycle: hybrid (core always-on + ephemeral workers).
- Observation: read-only default, opt-in takeover.
- Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
- Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
- Fleet is dual-role: product AND means of production (bootstrapping the MVP).
- Code review = **dual-engine**: Claude **and** gpt-5.5/Codex, run together (Jason: the
combination produces the best results). Launch reviewers via `mosaic yolo pi` / `codex`
(proven path) or `~/.config/mosaic/tools/codex/codex-code-review.sh`. Applies to all
code-review gates incl. FLEET-OBS-008. Per Jason 2026-06-20.
- Worktree discipline: do fleet work in `~/src/mosaicstack-stack-worktrees/<branch>`, NOT
the shared main checkout — concurrent processes mutate `main` there (learned 2026-06-20).
## Environment facts (verified 2026-06-20)
- Fleet is live on `W-jarvis` (uid 1000, `jarvis`, `Linger=yes`) on tmux socket
`mosaic-factory`: `_holder`, `canary-pi`, `dogfood-coder`, `dogfood-orchestrator`,
`dogfood-reviewer`. All panes run `~/.config/mosaic/fleet/dogfood-agent.py` (stub),
including `canary-pi` (roster says runtime=pi → **drift**).
- Holder + `mosaic-agent@*` units are `active (exited)` but `UnitFileState=disabled`
(reboot loses fleet → boot-enable gap to surface).
- Observation blocked by: isolated socket (hidden from default `tmux ls`), `capture-pane`
blank for TUIs, `attach` being read-write + resizing.
- Second agent: `jwoltje@dragon-lin`, session `coder0-0` (group `coder0`), running `node`,
default socket. ssh forward reach confirmed.
## Governance / collision-safety
- `mosaicstack-stack` has active mission `mvp-20260312` with single-writer locks on
`docs/MISSION-MANIFEST.md`, `docs/TASKS.md`, `docs/scratchpads/mvp-20260312.md`.
- This workstream touches NONE of those. All Fleet docs scoped under `docs/fleet/` +
this scratchpad. Rollup row proposed, not written.
## Session log
- 2026-06-20: Researched AI guide + fleet code + live state. Established north star with
Jason (8 forks decided). Branched `feat/fleet-observability`. Persisted
`docs/fleet/{north-star.md,PRD.md,TASKS.md}` + this scratchpad. Next: establish comms
with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + `fleet ps`).
- 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): `fleet ps`,
`agent watch`, `agent send --verify`, 62 tests. LIVE-verified `fleet ps` on
mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON.
Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — `fleet ps` HB now
`healthy` for all 4 agents.
- Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572
(sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine
blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
- **FINDINGS (north-star / Phase-3 blockers):**
1. Ad-hoc `mosaic yolo {codex,pi}` via `start-agent-session.sh` DIE immediately in a
detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub
survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY
in the detached shell) must be fixed before Phase-3 real-runtime swap. `fleet ps`
caught both dead panes instantly (tool validated).
2. `MOSAIC_AGENT_NAME` (set in systemd EnvironmentFile) is NOT propagated into tmux's
global env, so agents defaulted to `unknown`. Worked around in dogfood-agent.py via
tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
- Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close
`fleet-observability-1`. Defer launch-path + env-propagation fixes to Phase 3.
- 2026-06-21 (session 3): Phase-2 PR #579 merged (3 dual-engine rounds hardened
verify+watch). Then closed the launch-path question with Jason's input — CORRECTING
earlier findings:
- The ad-hoc launch deaths were NOT a fundamental TTY blocker: (a) codex was a stale
version (Jason updated it); (b) pi was misconfigured to Claude auth (Jason removed it;
default is now Codex). The REAL durable-launch bug is **PATH**: the detached tmux
launch shell is login+non-interactive, so it misses `~/.npm-global/bin` (added only in
`~/.bashrc`) -> `mosaic: command not found` (127) -> pane dies. tmux panes inherit the
tmux _server_ env, so PATH must be baked into the pane command.
- **Durable real-agent recipe (validated live on gpt-5.5, Claude-free):**
`mosaic yolo pi --model openai-codex/gpt-5.5:high` — pi tolerates detached tmux; a raw
interactive TUI (codex CLI) exits without an attached client. Status line confirmed
`(openai-codex) gpt-5.5 • high`.
- PATH fix landed in `start-agent-session.sh` (commit 32efc13, branch
feat/fleet-launch-path): derive runtime-bin prefix (MOSAIC_RUNTIME_BIN | npm prefix |
~/.npm-global/bin | ~/.local/bin), bake `export PATH=...; exec <cmd>` into the pane;
`exec` also fixes the drift false-positive. Live-tested under stripped PATH -> durable.
- Boot-survival: Jason ran `systemctl --user enable` (+ linger). TODO: auto-enable in
**fleet init** so operators never have to remember it (agentic-enhancement cycle).
- Future custom Pi harness build: pi cannot self-report its model (track
runtime/model/effort as fleet metadata); drift detection should recognize `node` as
pi's pane command (a node-wrapped pane can currently read as drift).
- Findings recorded in AI Guide playbooks/tmux-fleet.md (aiguide PR #7, merged).
- Policy: avoid Claude outside Claude Code (API pricing for alt-harness use) — fleet
runtimes default to Codex / pi-on-Codex; Claude stays in Claude Code only.

View File

@@ -70,6 +70,9 @@ Skills, hooks, MCP, and plugins are force multipliers you MUST use when applicab
## Missing core file ## Missing core file
If `CONSTITUTION.md`, `AGENTS.md`, `SOUL.md`, or the runtime contract is missing, stop and report it. If `CONSTITUTION.md`, `AGENTS.md`, `SOUL.md`, or the runtime contract is missing, stop and report it.
This agent-facing strictness is intentional and stricter than the launcher: the launcher injects
`CONSTITUTION.md` tolerantly (skipping it if absent so pre-upgrade hosts keep working), but once a host
is re-seeded a genuinely missing core file is a stop-and-report condition — not something to proceed past.
## Session Closure ## Session Closure

View File

@@ -2,8 +2,11 @@
The irreducible, non-negotiable law for every Mosaic agent on every harness. The irreducible, non-negotiable law for every Mosaic agent on every harness.
**Framework-owned.** This file is overwritten verbatim on every upgrade — do not edit it. To change **Framework-owned.** This file is overwritten verbatim on every upgrade — do not edit it. There is
behavior, add a `.local.md` overlay or a `policy/` file (tighten-only; see `constitution/LAYER-MODEL.md`). **no `CONSTITUTION.local.md`**: hard gates are not locally overridable. A lower layer may only make
behavior _stricter_, never relax or override a gate (see Precedence). Operator customization lives in
other layers — `SOUL.md` / `USER.md` and the tighten-only overlays `STANDARDS.local.md` /
`SOUL.local.md` / `USER.local.md` / `policy/*.md` (see `constitution/LAYER-MODEL.md`).
Authored in **capability verbs**: where a gate names a capability ("structured reasoning", "queue Authored in **capability verbs**: where a gate names a capability ("structured reasoning", "queue
guard"), the runtime adapter binds it to a concrete tool and states whether absence is a hard stop. guard"), the runtime adapter binds it to a concrete tool and states whether absence is a hard stop.

View File

@@ -6,6 +6,8 @@ MOSAIC_TMUX_SOCKET=${MOSAIC_TMUX_SOCKET:-mosaic-factory}
MOSAIC_AGENT_RUNTIME=${MOSAIC_AGENT_RUNTIME:-pi} MOSAIC_AGENT_RUNTIME=${MOSAIC_AGENT_RUNTIME:-pi}
MOSAIC_AGENT_WORKDIR=${MOSAIC_AGENT_WORKDIR:-$HOME} MOSAIC_AGENT_WORKDIR=${MOSAIC_AGENT_WORKDIR:-$HOME}
MOSAIC_AGENT_COMMAND=${MOSAIC_AGENT_COMMAND:-} MOSAIC_AGENT_COMMAND=${MOSAIC_AGENT_COMMAND:-}
MOSAIC_HEARTBEAT_RUN_DIR=${MOSAIC_HEARTBEAT_RUN_DIR:-$HOME/.config/mosaic/fleet/run}
MOSAIC_HEARTBEAT_INTERVAL=${MOSAIC_HEARTBEAT_INTERVAL:-15}
if [ -z "$AGENT_NAME" ]; then if [ -z "$AGENT_NAME" ]; then
echo "ERROR: agent name argument or MOSAIC_AGENT_NAME is required" >&2 echo "ERROR: agent name argument or MOSAIC_AGENT_NAME is required" >&2
@@ -26,5 +28,125 @@ if [ -z "$MOSAIC_AGENT_COMMAND" ]; then
MOSAIC_AGENT_COMMAND="mosaic yolo $MOSAIC_AGENT_RUNTIME" MOSAIC_AGENT_COMMAND="mosaic yolo $MOSAIC_AGENT_RUNTIME"
fi fi
# ── Derive a runtime-bin PATH prefix ─────────────────────────────────────────
# Precedence:
# 1. $MOSAIC_RUNTIME_BIN (explicit override)
# 2. $(npm config get prefix)/bin (if npm is on PATH)
# 3. Fallbacks: $HOME/.npm-global/bin and $HOME/.local/bin
#
# Only directories that already exist are included. The prefix is baked into
# the pane command regardless of what the LAUNCHER process's $PATH contains,
# because the tmux pane inherits the tmux SERVER environment (not this script's
# environment). A dir on the launcher's PATH may be absent from the server PATH,
# so every existing candidate must always be included. Dedup within the
# constructed prefix avoids listing the same dir twice.
_build_runtime_bin_prefix() {
local candidates=()
if [ -n "${MOSAIC_RUNTIME_BIN:-}" ]; then
candidates+=("$MOSAIC_RUNTIME_BIN")
fi
if command -v npm >/dev/null 2>&1; then
local npm_prefix
npm_prefix=$(npm config get prefix 2>/dev/null) || true
if [ -n "$npm_prefix" ]; then
candidates+=("${npm_prefix}/bin")
fi
fi
candidates+=("$HOME/.npm-global/bin")
candidates+=("$HOME/.local/bin")
local prefix=""
for dir in "${candidates[@]}"; do
[ -d "$dir" ] || continue
if [ -z "$prefix" ]; then
prefix="$dir"
else
case ":${prefix}:" in
*":${dir}:"*) ;; # already in our prefix — skip
*) prefix="${prefix}:${dir}" ;;
esac
fi
done
printf '%s' "$prefix"
}
MOSAIC_RUNTIME_BIN_PREFIX=$(_build_runtime_bin_prefix)
# ── Build the pane command ────────────────────────────────────────────────────
# The pane command must:
# - Export the augmented PATH so the runtime binary is found.
# - exec the agent command so the runtime is the pane's foreground process
# (makes `fleet ps` pane_current_command check reliable; no DRIFT false-positive).
#
# Quoting strategy: single-quote the inner shell snippet so that variable
# references in MOSAIC_AGENT_COMMAND are NOT expanded here — they expand inside
# the pane shell. However, MOSAIC_RUNTIME_BIN_PREFIX and PATH must be expanded
# NOW (in this script) because the pane shell inherits the tmux server
# environment, not this script's env.
#
# We build the snippet as a double-quoted here-string embedded in a printf call
# to avoid nested quoting problems.
if [ -n "$MOSAIC_RUNTIME_BIN_PREFIX" ]; then
PANE_SHELL_SNIPPET="export PATH=\"${MOSAIC_RUNTIME_BIN_PREFIX}:\${PATH}\"; exec ${MOSAIC_AGENT_COMMAND}"
else
PANE_SHELL_SNIPPET="exec ${MOSAIC_AGENT_COMMAND}"
fi
mkdir -p "$MOSAIC_AGENT_WORKDIR" mkdir -p "$MOSAIC_AGENT_WORKDIR"
exec tmux -L "$MOSAIC_TMUX_SOCKET" new-session -d -s "$AGENT_NAME" -c "$MOSAIC_AGENT_WORKDIR" "$MOSAIC_AGENT_COMMAND"
# ── Launch the tmux session (no exec — we continue to wire the heartbeat) ────
tmux -L "$MOSAIC_TMUX_SOCKET" new-session -d -s "$AGENT_NAME" -c "$MOSAIC_AGENT_WORKDIR" \
bash -c "$PANE_SHELL_SNIPPET"
# ── Resolve the pane PID (retry briefly to let the session initialise) ────────
PANE_PID=""
for _retry in 1 2 3 4 5; do
PANE_PID=$(tmux -L "$MOSAIC_TMUX_SOCKET" list-panes \
-t "=${AGENT_NAME}:0.0" -F '#{pane_pid}' 2>/dev/null || true)
[ -n "$PANE_PID" ] && break
sleep 0.2
done
# ── Spawn the heartbeat sidecar (detached, best-effort) ──────────────────────
# The sidecar writes ~/.config/mosaic/fleet/run/<AGENT>.hb atomically while the
# pane process is alive, then exits so the file goes stale (fleet ps shows stale
# then PANE=dead). It is runtime-agnostic: it only cares about the pane PID.
_start_heartbeat_sidecar() {
local agent="$1"
local pane_pid="$2"
local run_dir="$3"
local interval="$4"
local hb_file="${run_dir}/${agent}.hb"
mkdir -p "$run_dir"
# Write the sidecar as a self-contained bash one-liner so it carries no
# references to any variables from this script's environment.
local sidecar_script
sidecar_script=$(printf \
'hb=%s; pid=%s; iv=%s; mkdir -p "$(dirname "$hb")"; while kill -0 "$pid" 2>/dev/null; do tmp="$hb.tmp.$$"; printf "ts=%%s\npid=%%s\nstatus=ok\n" "$(date +%%Y-%%m-%%dT%%H:%%M:%%S%%z)" "$pid" > "$tmp" && mv "$tmp" "$hb"; sleep "$iv"; done' \
"$hb_file" "$pane_pid" "$interval")
# setsid + disown ensures the sidecar survives this script exiting.
# stderr/stdout go to /dev/null; failures are non-fatal.
if command -v setsid >/dev/null 2>&1; then
setsid bash -c "$sidecar_script" </dev/null >/dev/null 2>&1 &
else
bash -c "$sidecar_script" </dev/null >/dev/null 2>&1 &
fi
disown $! 2>/dev/null || true
}
if [ -n "$PANE_PID" ]; then
# Guard: do not let sidecar startup failures abort the launcher (set -e).
_start_heartbeat_sidecar "$AGENT_NAME" "$PANE_PID" \
"$MOSAIC_HEARTBEAT_RUN_DIR" "$MOSAIC_HEARTBEAT_INTERVAL" || \
echo "WARNING: heartbeat sidecar could not be started for $AGENT_NAME" >&2
else
echo "WARNING: could not resolve pane PID for $AGENT_NAME — heartbeat sidecar not started" >&2
fi

View File

@@ -6,13 +6,26 @@ START="$SCRIPT_DIR/start-agent-session.sh"
SOCKET="mosaic-agent-test-$RANDOM-$$" SOCKET="mosaic-agent-test-$RANDOM-$$"
AGENT="agent-$RANDOM" AGENT="agent-$RANDOM"
WORKDIR=$(mktemp -d) WORKDIR=$(mktemp -d)
trap 'tmux -L "$SOCKET" kill-server >/dev/null 2>&1 || true; rm -rf "$WORKDIR"' EXIT
# Keep a single cleanup trap that accumulates resources.
CLEANUP_DIRS=("$WORKDIR")
CLEANUP_SOCKETS=("$SOCKET")
trap '_cleanup' EXIT
_cleanup() {
for s in "${CLEANUP_SOCKETS[@]:-}"; do
tmux -L "$s" kill-server >/dev/null 2>&1 || true
done
for d in "${CLEANUP_DIRS[@]:-}"; do
rm -rf "$d"
done
}
fail() { fail() {
echo "FAIL: $*" >&2 echo "FAIL: $*" >&2
exit 1 exit 1
} }
# ── Test 1: basic session creation with workdir check ─────────────────────────
MOSAIC_TMUX_SOCKET="$SOCKET" \ MOSAIC_TMUX_SOCKET="$SOCKET" \
MOSAIC_AGENT_WORKDIR="$WORKDIR" \ MOSAIC_AGENT_WORKDIR="$WORKDIR" \
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \ MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
@@ -22,6 +35,7 @@ tmux -L "$SOCKET" has-session -t "=$AGENT:0.0" || fail "agent session was not cr
actual_dir=$(tmux -L "$SOCKET" display-message -p -t "=$AGENT:0.0" '#{pane_current_path}') actual_dir=$(tmux -L "$SOCKET" display-message -p -t "=$AGENT:0.0" '#{pane_current_path}')
[ "$actual_dir" = "$WORKDIR" ] || fail "agent workdir mismatch: $actual_dir" [ "$actual_dir" = "$WORKDIR" ] || fail "agent workdir mismatch: $actual_dir"
# ── Test 2: idempotency (duplicate start prints 'already running') ─────────────
MOSAIC_TMUX_SOCKET="$SOCKET" \ MOSAIC_TMUX_SOCKET="$SOCKET" \
MOSAIC_AGENT_WORKDIR="$WORKDIR" \ MOSAIC_AGENT_WORKDIR="$WORKDIR" \
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \ MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
@@ -29,4 +43,310 @@ MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
grep -qF 'already running' /tmp/mosaic-start-agent-idempotent.out || fail "duplicate start was not idempotent" grep -qF 'already running' /tmp/mosaic-start-agent-idempotent.out || fail "duplicate start was not idempotent"
# ── Test 3: runtime-bin PATH prefix is baked into the pane command ────────────
#
# We capture the command the script would hand to tmux by injecting a fake
# 'tmux' shim into PATH. The shim:
# - Intercepts 'new-session' calls and records its arguments to a file.
# - For 'has-session' calls, exits 1 (session does not exist) so the script
# proceeds to launch instead of printing "already running".
# - For 'list-panes' calls, returns empty so PANE_PID stays unset and the
# heartbeat sidecar is NOT spawned (heartbeat is not the focus of this test;
# test 6 and 7 cover that path). This prevents any real-filesystem side
# effects or leaked background processes.
# - For all other subcommands, exits 0.
#
# Assertions:
# a) 'export PATH=' with the synthetic MOSAIC_RUNTIME_BIN prefix appears.
# b) 'exec' appears so the runtime replaces the wrapper shell.
# c) MOSAIC_AGENT_COMMAND with flags is forwarded intact.
FAKE_BIN=$(mktemp -d)
FAKE_RUNTIME_BIN=$(mktemp -d)
TMUX_ARGS_FILE=$(mktemp)
HB_RUN_DIR3=$(mktemp -d)
CLEANUP_DIRS+=("$FAKE_BIN" "$FAKE_RUNTIME_BIN" "$HB_RUN_DIR3")
# Write the fake tmux shim (uses only positional args, no sourced vars).
cat > "$FAKE_BIN/tmux" <<SHIM
#!/usr/bin/env bash
# Fake tmux: record new-session args; report has-session as missing.
subcmd="\$3" # argv: tmux -L <socket> <subcmd> ...
if [ "\$subcmd" = "has-session" ]; then
exit 1 # session not found → script will attempt new-session
fi
if [ "\$subcmd" = "new-session" ]; then
printf '%s\n' "\$@" > "$TMUX_ARGS_FILE"
exit 0
fi
if [ "\$subcmd" = "list-panes" ]; then
# Return empty: no sidecar spawned (heartbeat is not the focus of this test).
echo ""
exit 0
fi
exit 0
SHIM
chmod +x "$FAKE_BIN/tmux"
SOCKET3="mosaic-agent-test3-$RANDOM-$$"
AGENT3="agent3-$RANDOM"
WORKDIR3=$(mktemp -d)
CLEANUP_DIRS+=("$WORKDIR3")
PATH="$FAKE_BIN:$PATH" \
MOSAIC_TMUX_SOCKET="$SOCKET3" \
MOSAIC_AGENT_WORKDIR="$WORKDIR3" \
MOSAIC_AGENT_RUNTIME="pi" \
MOSAIC_RUNTIME_BIN="$FAKE_RUNTIME_BIN" \
MOSAIC_AGENT_COMMAND="mosaic yolo pi --model openai-codex/gpt-5.5:high" \
MOSAIC_HEARTBEAT_RUN_DIR="$HB_RUN_DIR3" \
"$START" "$AGENT3"
all_args=$(cat "$TMUX_ARGS_FILE" 2>/dev/null || true)
rm -f "$TMUX_ARGS_FILE"
echo "--- captured tmux new-session args ---"
echo "$all_args"
echo "--- end args ---"
# a) PATH prefix containing FAKE_RUNTIME_BIN must appear.
echo "$all_args" | grep -qF "export PATH=" || fail "pane command does not export PATH"
echo "$all_args" | grep -qF "$FAKE_RUNTIME_BIN" || fail "pane command does not include MOSAIC_RUNTIME_BIN in PATH prefix"
# b) exec must appear so the runtime replaces the wrapper shell.
echo "$all_args" | grep -qF "exec " || fail "pane command does not use exec"
# c) Full MOSAIC_AGENT_COMMAND (with flags) must be forwarded.
echo "$all_args" | grep -qF "mosaic yolo pi --model openai-codex/gpt-5.5:high" || \
fail "pane command does not forward MOSAIC_AGENT_COMMAND with flags intact"
# ── Test 4: when no extra runtime-bin dirs exist, exec still appears ───────────
TMUX_ARGS_FILE2=$(mktemp)
FAKE_BIN2=$(mktemp -d)
HB_RUN_DIR4=$(mktemp -d)
CLEANUP_DIRS+=("$FAKE_BIN2" "$HB_RUN_DIR4")
cat > "$FAKE_BIN2/tmux" <<SHIM2
#!/usr/bin/env bash
subcmd="\$3"
if [ "\$subcmd" = "has-session" ]; then exit 1; fi
if [ "\$subcmd" = "new-session" ]; then
printf '%s\n' "\$@" > "$TMUX_ARGS_FILE2"
exit 0
fi
if [ "\$subcmd" = "list-panes" ]; then
# Return empty: no sidecar spawned (heartbeat is not the focus of this test).
echo ""
exit 0
fi
exit 0
SHIM2
chmod +x "$FAKE_BIN2/tmux"
SOCKET4="mosaic-agent-test4-$RANDOM-$$"
AGENT4="agent4-$RANDOM"
WORKDIR4=$(mktemp -d)
CLEANUP_DIRS+=("$WORKDIR4")
# MOSAIC_RUNTIME_BIN points to a non-existent dir so prefix will be empty;
# .npm-global/bin and .local/bin may or may not exist but we just want exec.
PATH="$FAKE_BIN2:$PATH" \
MOSAIC_TMUX_SOCKET="$SOCKET4" \
MOSAIC_AGENT_WORKDIR="$WORKDIR4" \
MOSAIC_AGENT_RUNTIME="pi" \
MOSAIC_RUNTIME_BIN="/nonexistent-dir-$$" \
MOSAIC_AGENT_COMMAND="mosaic yolo pi" \
MOSAIC_HEARTBEAT_RUN_DIR="$HB_RUN_DIR4" \
"$START" "$AGENT4"
all_args4=$(cat "$TMUX_ARGS_FILE2" 2>/dev/null || true)
rm -f "$TMUX_ARGS_FILE2"
rm -rf "$WORKDIR4"
echo "$all_args4" | grep -qF "exec " || fail "pane command (no prefix dirs) does not use exec"
echo "$all_args4" | grep -qF "mosaic yolo pi" || fail "pane command does not include agent command when no prefix"
# ── Test 5: candidate dir already in LAUNCHER $PATH is still baked into pane ──
#
# Regression guard for the bug where _build_runtime_bin_prefix() used to skip
# a candidate because it was already present in the launcher process's $PATH.
# That check was wrong: the pane inherits the tmux SERVER environment, not the
# launcher's env. Even if a dir is on the launcher's PATH it must always be
# baked into the pane's PATH export.
#
# We prove this by setting PATH to include FAKE_RUNTIME_BIN5 (the candidate),
# then asserting the generated new-session command still exports it.
TMUX_ARGS_FILE5=$(mktemp)
FAKE_BIN5=$(mktemp -d)
FAKE_RUNTIME_BIN5=$(mktemp -d) # this dir IS on the launcher's PATH below
HB_RUN_DIR5=$(mktemp -d)
CLEANUP_DIRS+=("$FAKE_BIN5" "$FAKE_RUNTIME_BIN5" "$HB_RUN_DIR5")
cat > "$FAKE_BIN5/tmux" <<SHIM5
#!/usr/bin/env bash
subcmd="\$3"
if [ "\$subcmd" = "has-session" ]; then exit 1; fi
if [ "\$subcmd" = "new-session" ]; then
printf '%s\n' "\$@" > "$TMUX_ARGS_FILE5"
exit 0
fi
if [ "\$subcmd" = "list-panes" ]; then
# Return empty: no sidecar spawned (heartbeat is not the focus of this test).
echo ""
exit 0
fi
exit 0
SHIM5
chmod +x "$FAKE_BIN5/tmux"
SOCKET5="mosaic-agent-test5-$RANDOM-$$"
AGENT5="agent5-$RANDOM"
WORKDIR5=$(mktemp -d)
CLEANUP_DIRS+=("$WORKDIR5")
CLEANUP_SOCKETS+=("$SOCKET5")
# FAKE_RUNTIME_BIN5 is deliberately placed on the LAUNCHER PATH so that the
# old (buggy) code would have skipped it. The correct code must still include
# it in the pane PATH export.
PATH="$FAKE_BIN5:$FAKE_RUNTIME_BIN5:$PATH" \
MOSAIC_TMUX_SOCKET="$SOCKET5" \
MOSAIC_AGENT_WORKDIR="$WORKDIR5" \
MOSAIC_AGENT_RUNTIME="pi" \
MOSAIC_RUNTIME_BIN="$FAKE_RUNTIME_BIN5" \
MOSAIC_AGENT_COMMAND="mosaic yolo pi" \
MOSAIC_HEARTBEAT_RUN_DIR="$HB_RUN_DIR5" \
"$START" "$AGENT5"
all_args5=$(cat "$TMUX_ARGS_FILE5" 2>/dev/null || true)
rm -f "$TMUX_ARGS_FILE5"
rm -rf "$WORKDIR5"
echo "--- test 5: launcher-PATH candidate must still appear in pane export ---"
echo "$all_args5"
echo "--- end test 5 args ---"
echo "$all_args5" | grep -qF "export PATH=" || \
fail "test5: pane command does not export PATH when candidate is on launcher PATH"
echo "$all_args5" | grep -qF "$FAKE_RUNTIME_BIN5" || \
fail "test5: candidate dir (already on launcher PATH) was NOT baked into pane PATH — regression"
# ── Test 6: heartbeat sidecar — pane PID resolved + .hb file written ──────────
#
# Uses a real tmux session (same socket as test 1 which already has $AGENT) so
# list-panes returns a real pane PID. We override MOSAIC_HEARTBEAT_RUN_DIR to
# a temp dir and set a 1-second interval, then wait up to 3 s for the .hb file
# to appear and check its content.
HB_RUN_DIR=$(mktemp -d)
CLEANUP_DIRS+=("$HB_RUN_DIR")
# Re-use the session+agent created in Test 1 (still alive on $SOCKET / $AGENT).
# We need to invoke the script for a NEW agent on the same socket to exercise
# the heartbeat path with a real pane PID.
AGENT6="agent6-$RANDOM"
MOSAIC_TMUX_SOCKET="$SOCKET" \
MOSAIC_AGENT_WORKDIR="$WORKDIR" \
MOSAIC_AGENT_COMMAND='bash --noprofile --norc -i' \
MOSAIC_HEARTBEAT_RUN_DIR="$HB_RUN_DIR" \
MOSAIC_HEARTBEAT_INTERVAL="1" \
"$START" "$AGENT6"
HB_FILE="$HB_RUN_DIR/${AGENT6}.hb"
# Wait up to 5 seconds for the heartbeat file to appear.
_waited=0
until [ -f "$HB_FILE" ] || [ "$_waited" -ge 5 ]; do
sleep 0.5
_waited=$((_waited + 1))
done
[ -f "$HB_FILE" ] || fail "test6: heartbeat file not written at $HB_FILE within 5s"
hb_content=$(cat "$HB_FILE")
echo "--- test 6: heartbeat file content ---"
echo "$hb_content"
echo "--- end test 6 ---"
# Verify required fields are present.
echo "$hb_content" | grep -qE '^ts=[0-9]{4}-[0-9]{2}-[0-9]{2}T' || \
fail "test6: heartbeat ts field missing or malformed"
echo "$hb_content" | grep -qE '^pid=[0-9]+' || \
fail "test6: heartbeat pid field missing or malformed"
echo "$hb_content" | grep -qF 'status=ok' || \
fail "test6: heartbeat status=ok missing"
# ── Test 7: heartbeat sidecar — targets correct .hb path per agent name ────────
#
# Uses the fake-tmux shim approach (like tests 3-5) to capture the sidecar
# invocation without needing a real session. A fake setsid shim records its
# arguments so we can assert the sidecar script targets the expected .hb path
# and uses the configured interval.
FAKE_BIN7=$(mktemp -d)
FAKE_RUNTIME_BIN7=$(mktemp -d)
SETSID_ARGS_FILE=$(mktemp)
HB_RUN_DIR7=$(mktemp -d)
CLEANUP_DIRS+=("$FAKE_BIN7" "$FAKE_RUNTIME_BIN7" "$HB_RUN_DIR7")
AGENT7="my-fleet-agent-$RANDOM"
INTERVAL7="42"
# Fake tmux: has-session → not found; new-session → ok; list-panes → known PID.
cat > "$FAKE_BIN7/tmux" <<SHIM7
#!/usr/bin/env bash
subcmd="\$3"
if [ "\$subcmd" = "has-session" ]; then exit 1; fi
if [ "\$subcmd" = "new-session" ]; then exit 0; fi
if [ "\$subcmd" = "list-panes" ]; then echo "88888"; exit 0; fi
exit 0
SHIM7
chmod +x "$FAKE_BIN7/tmux"
# Fake setsid: capture the bash -c <script> argument for inspection, then
# background an actual bash subshell so disown succeeds in the caller.
cat > "$FAKE_BIN7/setsid" <<'SETSID_SHIM'
#!/usr/bin/env bash
# argv: setsid bash -c <sidecar_script>
# Record the full argument list to the capture file, then exit cleanly.
printf '%s\0' "$@" > __SETSID_ARGS_FILE__
exit 0
SETSID_SHIM
# Patch the placeholder with the real capture-file path (avoids heredoc expansion issues).
sed -i "s|__SETSID_ARGS_FILE__|${SETSID_ARGS_FILE}|g" "$FAKE_BIN7/setsid"
chmod +x "$FAKE_BIN7/setsid"
SOCKET7="mosaic-agent-test7-$RANDOM-$$"
WORKDIR7=$(mktemp -d)
CLEANUP_DIRS+=("$WORKDIR7")
PATH="$FAKE_BIN7:$PATH" \
MOSAIC_TMUX_SOCKET="$SOCKET7" \
MOSAIC_AGENT_WORKDIR="$WORKDIR7" \
MOSAIC_AGENT_RUNTIME="pi" \
MOSAIC_RUNTIME_BIN="$FAKE_RUNTIME_BIN7" \
MOSAIC_AGENT_COMMAND="mosaic yolo pi" \
MOSAIC_HEARTBEAT_RUN_DIR="$HB_RUN_DIR7" \
MOSAIC_HEARTBEAT_INTERVAL="$INTERVAL7" \
"$START" "$AGENT7"
# Give the background setsid shim a moment to finish writing the capture file.
sleep 0.5
setsid_args=$(cat "$SETSID_ARGS_FILE" 2>/dev/null | tr '\0' '\n' || true)
rm -f "$SETSID_ARGS_FILE"
rm -rf "$WORKDIR7"
echo "--- test 7: captured setsid args ---"
echo "$setsid_args"
echo "--- end test 7 ---"
# The sidecar script (bash -c <script>) must reference the correct .hb path.
expected_hb="${HB_RUN_DIR7}/${AGENT7}.hb"
echo "$setsid_args" | grep -qF "$expected_hb" || \
fail "test7: sidecar script does not reference correct .hb path ($expected_hb)"
# The sidecar script must use the configured interval.
echo "$setsid_args" | grep -qF "$INTERVAL7" || \
fail "test7: sidecar script does not reference configured interval ($INTERVAL7)"
echo "ok - start-agent-session" echo "ok - start-agent-session"

View File

@@ -12,7 +12,7 @@
# 2. STRUCTURAL (private $HOME default in *.sh) — scanned everywhere EXCEPT examples/, # 2. STRUCTURAL (private $HOME default in *.sh) — scanned everywhere EXCEPT examples/,
# because worked example overlays/personas legitimately show placeholder paths. # because worked example overlays/personas legitimately show placeholder paths.
# #
# File types: *.md, *.sh, *.ps1, *.json, and the extensionless CLI scripts under # File types: *.md, *.sh, *.ps1, *.json, *.yml/*.yaml, *.toml, *.env, *.service, and the CLI scripts under
# tools/_scripts/. Excludes node_modules/ and this gate file. # tools/_scripts/. Excludes node_modules/ and this gate file.
# #
# NOTE: '\bPDA\b' intentionally matches "PDA-friendly" (the contamination removed in P2); # NOTE: '\bPDA\b' intentionally matches "PDA-friendly" (the contamination removed in P2);
@@ -39,7 +39,7 @@ cd "$FRAMEWORK_ROOT" || { echo "FRAMEWORK_ROOT not found: $FRAMEWORK_ROOT" >&2;
# Identity scope = ALL shipped text files (examples/ INCLUDED). # Identity scope = ALL shipped text files (examples/ INCLUDED).
_files_identity() { _files_identity() {
find . -type f \ find . -type f \
\( -name '*.md' -o -name '*.sh' -o -name '*.ps1' -o -name '*.json' -o -path '*/tools/_scripts/*' \) \ \( -name '*.md' -o -name '*.sh' -o -name '*.ps1' -o -name '*.json' -o -name '*.yml' -o -name '*.yaml' -o -name '*.toml' -o -name '*.env' -o -name '*.service' -o -path '*/tools/_scripts/*' \) \
-not -path '*/node_modules/*' -not -path "./$SELF_REL" -print0 -not -path '*/node_modules/*' -not -path "./$SELF_REL" -print0
} }
# Structural scope = shipped scripts, examples/ EXCLUDED. # Structural scope = shipped scripts, examples/ EXCLUDED.

View File

@@ -1,6 +1,6 @@
{ {
"name": "@mosaicstack/mosaic", "name": "@mosaicstack/mosaic",
"version": "0.0.34", "version": "0.0.35",
"repository": { "repository": {
"type": "git", "type": "git",
"url": "https://git.mosaicstack.dev/mosaicstack/stack.git", "url": "https://git.mosaicstack.dev/mosaicstack/stack.git",

File diff suppressed because it is too large Load Diff

View File

@@ -1,12 +1,19 @@
import { constants } from 'node:fs'; import { constants } from 'node:fs';
import { access, chmod, copyFile, mkdir, readFile, writeFile } from 'node:fs/promises'; import { access, chmod, copyFile, mkdir, readFile, writeFile } from 'node:fs/promises';
import { homedir, hostname } from 'node:os'; import { homedir, hostname, userInfo } from 'node:os';
import { dirname, join, resolve } from 'node:path'; import { dirname, join, resolve } from 'node:path';
import { fileURLToPath } from 'node:url'; import { fileURLToPath } from 'node:url';
import { spawn } from 'node:child_process'; import { spawn } from 'node:child_process';
import type { Command } from 'commander'; import type { Command } from 'commander';
import YAML from 'yaml'; import YAML from 'yaml';
/**
* A function that spawns a command with inherited stdio (TTY passthrough).
* Used for interactive commands like `tmux attach` that need a real terminal.
* Resolves with the process exit code.
*/
export type InteractiveRunner = (command: string, args: string[]) => Promise<number>;
export interface CommandResult { export interface CommandResult {
stdout: string; stdout: string;
stderr: string; stderr: string;
@@ -15,8 +22,23 @@ export interface CommandResult {
export type CommandRunner = (command: string, args: string[]) => Promise<CommandResult>; export type CommandRunner = (command: string, args: string[]) => Promise<CommandResult>;
/**
* Injectable sleep helper used by the send --verify polling loop.
* Tests stub this to avoid real delays; production uses the default
* implementation backed by setTimeout.
*/
export type SleepFn = (ms: number) => Promise<void>;
export interface FleetCommandDeps { export interface FleetCommandDeps {
runner?: CommandRunner; runner?: CommandRunner;
/** Injectable interactive runner for commands needing inherited TTY (e.g., `tmux attach`). */
interactiveRunner?: InteractiveRunner;
/**
* Injectable sleep function for the send --verify polling loop.
* Defaults to a real setTimeout-based sleep. Tests stub this to avoid
* real delays; the default is used in production.
*/
sleepFn?: SleepFn;
mosaicHome?: string; mosaicHome?: string;
frameworkRoot?: string; frameworkRoot?: string;
} }
@@ -92,6 +114,18 @@ type FleetServiceAction = 'start' | 'stop' | 'restart' | 'status';
const DEFAULT_SOCKET_NAME = 'mosaic-factory'; const DEFAULT_SOCKET_NAME = 'mosaic-factory';
const DEFAULT_HOLDER_SESSION = '_holder'; const DEFAULT_HOLDER_SESSION = '_holder';
const DEFAULT_WORKING_DIRECTORY = '~/src'; const DEFAULT_WORKING_DIRECTORY = '~/src';
/**
* Default poll interval (ms) between capture-pane checks in `send --verify`.
* Kept short enough to react quickly while not hammering tmux on busy hosts.
*/
export const VERIFY_POLL_INTERVAL_MS = 400;
/**
* Default total timeout (ms) for the `send --verify` polling loop.
* Configurable via `--verify-timeout <ms>` on `agent send`.
*/
export const VERIFY_DEFAULT_TIMEOUT_MS = 6_000;
const DEFAULT_RUNTIME_RESETS: Record<string, { resetCommand: string }> = { const DEFAULT_RUNTIME_RESETS: Record<string, { resetCommand: string }> = {
claude: { resetCommand: '/clear' }, claude: { resetCommand: '/clear' },
codex: { resetCommand: '/clear' }, codex: { resetCommand: '/clear' },
@@ -176,6 +210,93 @@ export function buildFleetServiceCommand(action: FleetServiceAction, agentName?:
return ['systemctl', '--user', action, service]; return ['systemctl', '--user', action, service];
} }
/**
* Returns the systemctl --user enable command for a given unit.
* Used by the install auto-enable step to persist units across reboots.
*/
export function buildSystemdEnableCommand(unit: string): string[] {
return ['systemctl', '--user', 'enable', unit];
}
/**
* Returns the loginctl enable-linger command for a given user.
* Linger allows user systemd services to survive logout.
*/
export function buildEnableLingerCommand(user: string): string[] {
return ['loginctl', 'enable-linger', user];
}
/**
* Enable fleet units for boot-survival after install.
* Non-fatal: if systemctl enable returns non-zero, a warning is printed and we continue.
* If opts.enable === false (--no-enable flag), the whole step is skipped.
*/
export async function enableFleetUnits(
runner: CommandRunner,
roster: FleetRoster,
opts: { enable?: boolean },
): Promise<void> {
if (opts.enable === false) {
return;
}
try {
let succeeded = 0;
let failed = 0;
const holderResult = await runner(
...splitCommand(buildSystemdEnableCommand('mosaic-tmux-holder.service')),
);
if (holderResult.exitCode === 0) {
succeeded++;
} else {
failed++;
process.stderr.write(
`Warning: could not enable mosaic-tmux-holder.service: ${holderResult.stderr || holderResult.stdout || 'non-zero exit'}\n`,
);
}
for (const agent of roster.agents) {
const unit = `mosaic-agent@${agent.name}.service`;
const result = await runner(...splitCommand(buildSystemdEnableCommand(unit)));
if (result.exitCode === 0) {
succeeded++;
} else {
failed++;
process.stderr.write(
`Warning: could not enable ${unit}: ${result.stderr || result.stdout || 'non-zero exit'}\n`,
);
}
}
if (succeeded > 0) {
console.log(`Enabled ${succeeded} unit(s) for boot-survival.`);
}
if (failed > 0) {
process.stderr.write(
`Warning: ${failed} unit(s) could not be enabled (systemctl unavailable?). Run manually if needed.\n`,
);
}
// Best-effort linger
let username: string;
try {
username = userInfo().username;
} catch {
username = process.env['USER'] ?? process.env['LOGNAME'] ?? 'unknown';
}
const lingerResult = await runner(...splitCommand(buildEnableLingerCommand(username)));
if (lingerResult.exitCode !== 0) {
process.stderr.write(
`Hint: run 'loginctl enable-linger ${username}' as root to survive logout.\n`,
);
}
} catch (err) {
process.stderr.write(
`Warning: auto-enable step failed unexpectedly: ${err instanceof Error ? err.message : String(err)}\n`,
);
}
}
export function buildAgentSendCommand( export function buildAgentSendCommand(
paths: FleetPaths, paths: FleetPaths,
agentName: string, agentName: string,
@@ -236,6 +357,410 @@ export function buildAgentTailCommand(
]; ];
} }
// ---------------------------------------------------------------------------
// Fleet ps — phase 2 observability helpers
// ---------------------------------------------------------------------------
export const HEARTBEAT_INTERVAL_MS = 15_000;
export const HEARTBEAT_HEALTHY_MULTIPLIER = 3;
export interface HeartbeatInfo {
ts: Date | null;
pid: number | null;
status: 'ok' | 'busy' | null;
/** healthy | stale | unknown */
health: 'healthy' | 'stale' | 'unknown';
ageMs: number | null;
}
export interface AgentPsRow {
name: string;
tenant_id: string;
host: string;
runtime: string;
systemdActive: string;
systemdEnabled: string;
paneAlive: boolean;
panePid: number | null;
paneCommand: string | null;
idleSeconds: number | null;
heartbeat: HeartbeatInfo;
/** roster runtime !== actual pane command */
driftFlag: boolean;
/** active but UnitFileState=disabled */
bootEnableWarning: boolean;
}
/**
* Returns the systemd show command for an agent unit (active+enabled state).
* Returns: `systemctl --user show <unit> -p ActiveState -p SubState -p UnitFileState`
*/
export function buildSystemdShowCommand(agentName: string): string[] {
const unit = `mosaic-agent@${agentName}.service`;
return [
'systemctl',
'--user',
'show',
unit,
'-p',
'ActiveState',
'-p',
'SubState',
'-p',
'UnitFileState',
];
}
/**
* Returns the tmux list-panes command for an agent pane.
* Format: `#{pane_pid} #{pane_current_command} #{pane_dead} #{pane_activity}`
*/
export function buildTmuxListPanesCommand(
agentName: string,
socketName = DEFAULT_SOCKET_NAME,
): string[] {
return [
'tmux',
'-L',
socketName,
'list-panes',
'-t',
`=${agentName}:0.0`,
'-F',
'#{pane_pid} #{pane_current_command} #{pane_dead} #{pane_activity}',
];
}
/**
* Returns the heartbeat file path for an agent.
*/
export function heartbeatPath(agentName: string, mosaicHome = defaultMosaicHome()): string {
return join(mosaicHome, 'fleet', 'run', `${agentName}.hb`);
}
/**
* Parse a heartbeat file's contents into a HeartbeatInfo.
* File format (one key=value per line):
* ts=<iso8601>
* pid=<pid>
* status=<ok|busy>
*/
export function parseHeartbeat(content: string | null, nowMs = Date.now()): HeartbeatInfo {
if (content === null) {
return { ts: null, pid: null, status: null, health: 'unknown', ageMs: null };
}
const lines = content.split('\n');
let ts: Date | null = null;
let pid: number | null = null;
let status: 'ok' | 'busy' | null = null;
for (const line of lines) {
const [key, ...rest] = line.split('=');
const val = rest.join('=').trim();
if (key === 'ts' && val) {
const d = new Date(val);
if (!Number.isNaN(d.getTime())) ts = d;
} else if (key === 'pid' && val) {
const n = Number.parseInt(val, 10);
if (Number.isFinite(n)) pid = n;
} else if (key === 'status' && (val === 'ok' || val === 'busy')) {
status = val;
}
}
const thresholdMs = HEARTBEAT_INTERVAL_MS * HEARTBEAT_HEALTHY_MULTIPLIER;
let health: 'healthy' | 'stale' | 'unknown' = 'unknown';
let ageMs: number | null = null;
if (ts !== null) {
ageMs = nowMs - ts.getTime();
health = ageMs <= thresholdMs ? 'healthy' : 'stale';
}
return { ts, pid, status, health, ageMs };
}
/**
* Parse the output of `systemctl --user show ... -p ActiveState -p SubState -p UnitFileState`
* Returns an object with the three properties.
*/
export function parseSystemdShow(output: string): {
ActiveState: string;
SubState: string;
UnitFileState: string;
} {
const result: Record<string, string> = {};
for (const line of output.split('\n')) {
const eq = line.indexOf('=');
if (eq !== -1) {
result[line.slice(0, eq)] = line.slice(eq + 1).trim();
}
}
return {
ActiveState: result['ActiveState'] ?? 'unknown',
SubState: result['SubState'] ?? 'unknown',
UnitFileState: result['UnitFileState'] ?? 'unknown',
};
}
/**
* Parse the output of `tmux list-panes -F '#{pane_pid} #{pane_current_command} #{pane_dead} #{pane_activity}'`
* pane_activity is a Unix epoch timestamp (seconds).
*/
export function parseTmuxListPanes(
output: string,
nowMs = Date.now(),
): { pid: number | null; command: string | null; dead: boolean; idleSeconds: number | null } {
const line = output.trim().split('\n')[0];
if (!line) {
return { pid: null, command: null, dead: true, idleSeconds: null };
}
// format: <pid> <command> <dead(0|1)> <activity_epoch>
const parts = line.split(' ');
const pid = parts[0] ? (Number.isFinite(Number(parts[0])) ? Number(parts[0]) : null) : null;
const command = parts[1] ?? null;
const dead = parts[2] === '1';
const activityEpoch = parts[3] ? Number(parts[3]) : NaN;
const idleSeconds =
Number.isFinite(activityEpoch) && activityEpoch > 0
? Math.floor((nowMs - activityEpoch * 1000) / 1000)
: null;
return { pid, command, dead, idleSeconds };
}
/**
* Maps each known runtime to the set of acceptable pane commands.
* A pane running any of these commands for the given runtime is NOT considered drifted.
* Runtimes launched via `mosaic yolo` wrap in node, so 'node' is acceptable for most.
* The dogfood runtime accepts python3/python (the canary-pi dogfood stub).
*/
export const RUNTIME_ACCEPTABLE_COMMANDS: Record<string, readonly string[]> = {
claude: ['claude', 'node'],
codex: ['codex', 'node'],
opencode: ['opencode', 'node'],
pi: ['pi', 'node'],
dogfood: ['python3', 'python'],
};
/**
* Determine if there is a runtime drift: roster says one runtime but the pane
* is actually running something from a different runtime. We detect this by
* checking if the pane command doesn't match a known acceptable command for the
* roster's declared runtime.
*
* Known acceptable commands per runtime (see RUNTIME_ACCEPTABLE_COMMANDS):
* claude → claude, node (node covers mosaic yolo wrapper)
* codex → codex, node
* opencode → opencode, node
* pi → pi, node (python3 still flags drift for canary-pi dogfood stub)
* dogfood → python3, python
*
* If the pane is running something else (e.g., python3/dogfood-agent.py) for
* an agent whose roster runtime is "pi", that's a drift.
*/
export function detectDrift(rosterRuntime: string, paneCommand: string | null): boolean {
if (!paneCommand) return false;
const acceptable = RUNTIME_ACCEPTABLE_COMMANDS[rosterRuntime];
if (!acceptable) return false;
return !acceptable.includes(paneCommand);
}
/**
* Returns the default tenant_id (OS username) and host (short hostname).
* These MUST appear in every --json record for multi-tenant/multi-host zero-foreclosure.
*/
export function getDefaultTenantAndHost(): { tenant_id: string; host: string } {
let tenant_id: string;
try {
tenant_id = userInfo().username;
} catch {
tenant_id = process.env['USER'] ?? process.env['LOGNAME'] ?? 'unknown';
}
const host = hostname().split('.')[0] || 'localhost';
return { tenant_id, host };
}
/**
* Builds the command to create a grouped viewer session targeting an agent session.
* A grouped session shares the same windows as the target but gets INDEPENDENT sizing,
* so attaching the viewer never resizes the agent's window.
*
* The viewer session name is derived from the agent name and a unique suffix (typically
* the caller's PID) so multiple concurrent watchers don't collide.
*
* Usage sequence:
* 1. Run buildAgentWatchCreateViewerCommand → create grouped session (via capturing runner).
* 2. Run buildAgentWatchAttachCommand → attach -r to the viewer session (via interactiveRunner).
* 3. Run buildAgentWatchKillViewerCommand → kill the viewer session on detach (via capturing runner).
*/
export function buildAgentWatchCreateViewerCommand(
agentName: string,
viewerSessionName: string,
socketName = DEFAULT_SOCKET_NAME,
): string[] {
return [
'tmux',
'-L',
socketName,
'new-session',
'-d',
'-t',
`=${agentName}`,
'-s',
viewerSessionName,
];
}
/**
* Builds the interactive attach command for a viewer session (read-only).
* Must be run via interactiveRunner (stdio: 'inherit').
*/
export function buildAgentWatchAttachCommand(
viewerSessionName: string,
socketName = DEFAULT_SOCKET_NAME,
): string[] {
return ['tmux', '-L', socketName, 'attach', '-r', '-t', viewerSessionName];
}
/**
* Builds the kill-session command to clean up a viewer session after detach.
* Keeps the agent session intact.
*/
export function buildAgentWatchKillViewerCommand(
viewerSessionName: string,
socketName = DEFAULT_SOCKET_NAME,
): string[] {
return ['tmux', '-L', socketName, 'kill-session', '-t', viewerSessionName];
}
/**
* Returns a unique viewer session name for a given agent.
* Uses process.pid so concurrent watchers produce distinct names.
*/
export function buildViewerSessionName(agentName: string): string {
return `${agentName}-watch-${process.pid}`;
}
/**
* @deprecated Use buildAgentWatchCreateViewerCommand + buildAgentWatchAttachCommand +
* buildAgentWatchKillViewerCommand instead. This bare attach targets the agent session
* directly and can resize it when the viewer terminal is smaller than the agent's window.
*
* Kept for backward compatibility only.
*/
export function buildAgentWatchCommand(
agentName: string,
socketName = DEFAULT_SOCKET_NAME,
): string[] {
return ['tmux', '-L', socketName, 'attach', '-r', '-t', `=${agentName}`];
}
/**
* Builds the capture-pane command used to verify that agent send was accepted
* (not left as an unsubmitted draft). Captures the last N lines and checks for
* the draft heuristic.
*/
export function buildAgentVerifyAcceptedCommand(
agentName: string,
socketName = DEFAULT_SOCKET_NAME,
lines = 5,
): string[] {
return [
'tmux',
'-L',
socketName,
'capture-pane',
'-t',
`=${agentName}:0.0`,
'-p',
'-S',
`-${lines}`,
];
}
/**
* Result of a send-verify check.
* - 'accepted': positive evidence that the message was accepted (response content present).
* - 'draft': last non-empty line matches the draft heuristic (unsubmitted input).
* - 'unverifiable': pane did not change after send (stale or blank) — we cannot determine
* acceptance; fails closed per FR-5.
*/
export type SendVerifyResult = 'accepted' | 'draft' | 'unverifiable';
/**
* Classify the result of a send-verify check by comparing BEFORE and AFTER pane snapshots.
*
* This is the primary classifier for `send --verify`. It addresses the stale-pane
* false-success problem: if the pane content did not change after the send, the new
* message never registered in the TUI (wedged pane, send dropped, etc.).
*
* Classification logic:
* 'unverifiable' — AFTER is blank/empty OR AFTER == BEFORE (no pane change after send).
* 'draft' — AFTER differs from BEFORE AND the last non-empty line of AFTER starts
* with the draft pattern ("> "); message was typed but not submitted.
* 'accepted' — AFTER differs from BEFORE AND AFTER does not end in a draft line;
* positive evidence that the TUI accepted the message.
*
* NOTE on blank AFTER: Full-screen TUIs (claude, codex, opencode, pi) render blank for
* `tmux capture-pane`. A blank AFTER is indistinguishable from a wedged pane, so it
* is always classified 'unverifiable' (fail-closed).
*
* NOTE on definitive acceptance: Phase-2 can only observe the pane surface — there is no
* runtime acknowledgement (heartbeat-ack) at this phase. The pane-change check is the best
* signal available against an opaque TUI. Definitive acceptance ultimately requires a
* runtime acknowledgement (Phase-3 heartbeat-ack).
*
* Draft heuristic: a last non-empty line (after stripping ANSI escapes) that starts
* with "> " is treated as an unsubmitted input line. This pattern is specific to
* pi/claude TUIs; draft detection for codex/opencode TUIs is best-effort only.
*
* FR-5 requires `send --verify` to return non-zero when delivery cannot be verified.
*
* @param before Pane snapshot captured BEFORE the send command.
* @param after Pane snapshot captured AFTER the send command (after the delay).
*/
export function classifySendResult(before: string, after: string): SendVerifyResult {
const afterLines = after.split('\n').filter((l) => l.trim().length > 0);
// Blank/empty AFTER => full-screen TUI rendered blank, or pane is wedged => unverifiable.
if (afterLines.length === 0) return 'unverifiable';
// No change => message didn't register in the TUI (stale/wedged pane) => unverifiable.
if (after === before) return 'unverifiable';
// AFTER differs from BEFORE — check whether the pane is now showing a draft line.
const lastLine = afterLines[afterLines.length - 1]!;
const stripped = lastLine.replace(/\x1b\[[0-9;]*m/g, '').trim();
// Heuristic: if stripped last line starts with "> " — that's the common draft pattern
// in pi/claude TUIs for showing user input before submission.
// NOTE: this heuristic is pi/claude-specific; draft detection for codex/opencode
// TUIs is best-effort only and may miss other unsubmitted-input indicators.
if (/^>\s/.test(stripped)) return 'draft';
return 'accepted';
}
/**
* Check whether a send was accepted (not left as draft), using only the AFTER snapshot.
*
* @deprecated Prefer classifySendResult(before, after) which guards against stale-pane
* false-successes. This single-snapshot variant cannot detect a wedged pane that still
* shows old non-empty content — it will incorrectly return 'accepted' in that case.
*
* Retained for unit-test compatibility with single-snapshot assertions.
*
* Returns:
* 'unverifiable' — blank/empty capture (full-screen TUIs render blank; we cannot tell).
* 'draft' — last non-empty line matches the draft heuristic.
* 'accepted' — non-blank and not a draft line (but may be stale — see above).
*/
export function isSendAccepted(capturedOutput: string): SendVerifyResult {
const lines = capturedOutput.split('\n').filter((l) => l.trim().length > 0);
// Blank/empty capture => full-screen TUI rendered blank => unverifiable.
// This is the known-unverifiable case; fail closed (not treated as success).
if (lines.length === 0) return 'unverifiable';
const lastLine = lines[lines.length - 1]!;
const stripped = lastLine.replace(/\x1b\[[0-9;]*m/g, '').trim();
// Heuristic: if stripped last line starts with "> " — that's the common draft pattern
// in pi/claude TUIs for showing user input before submission.
// NOTE: this heuristic is pi/claude-specific; draft detection for codex/opencode
// TUIs is best-effort only and may miss other unsubmitted-input indicators.
if (/^>\s/.test(stripped)) return 'draft';
return 'accepted';
}
export function registerFleetCommand(program: Command, deps: FleetCommandDeps = {}): Command { export function registerFleetCommand(program: Command, deps: FleetCommandDeps = {}): Command {
const runner = deps.runner ?? runCommand; const runner = deps.runner ?? runCommand;
const paths = resolveFleetPaths(deps.mosaicHome); const paths = resolveFleetPaths(deps.mosaicHome);
@@ -277,12 +802,22 @@ export function registerFleetCommand(program: Command, deps: FleetCommandDeps =
cmd cmd
.command('install') .command('install')
.description('Install local fleet tools and user systemd units') .description('Install local fleet tools and user systemd units')
.action(async () => installFleet(cmd, frameworkRoot)); .option('--no-enable', 'Skip enabling units for boot-survival')
.action(async (opts: { enable?: boolean }) => {
await installFleet(cmd, frameworkRoot);
const roster = await loadRosterForCommand(cmd);
await enableFleetUnits(runner, roster, opts);
});
cmd cmd
.command('install-systemd') .command('install-systemd')
.description('Install local fleet tools and user systemd units') .description('Install local fleet tools and user systemd units')
.action(async () => installFleet(cmd, frameworkRoot)); .option('--no-enable', 'Skip enabling units for boot-survival')
.action(async (opts: { enable?: boolean }) => {
await installFleet(cmd, frameworkRoot);
const roster = await loadRosterForCommand(cmd);
await enableFleetUnits(runner, roster, opts);
});
for (const action of ['start', 'stop', 'restart'] as const) { for (const action of ['start', 'stop', 'restart'] as const) {
cmd cmd
@@ -360,6 +895,113 @@ export function registerFleetCommand(program: Command, deps: FleetCommandDeps =
console.log(`Verified fleet on tmux socket ${socketName}.`); console.log(`Verified fleet on tmux socket ${socketName}.`);
}); });
cmd
.command('ps')
.description('Show real-time status for all roster agents (systemd + tmux + heartbeat)')
.option('--json', 'Print JSON array')
.action(async (opts: { json?: boolean }) => {
const commandOpts = cmd.opts<{ mosaicHome: string; roster?: string }>();
const activePaths = resolveFleetPaths(commandOpts.mosaicHome);
const roster = await loadRosterForCommand(cmd);
const { tenant_id, host } = getDefaultTenantAndHost();
const nowMs = Date.now();
const rows: AgentPsRow[] = [];
for (const agent of roster.agents) {
// systemd show
const showResult = await runner(...splitCommand(buildSystemdShowCommand(agent.name)));
const sysInfo = parseSystemdShow(showResult.stdout);
// tmux list-panes
const panesResult = await runner(
...splitCommand(buildTmuxListPanesCommand(agent.name, roster.tmux.socketName)),
);
const paneInfo = parseTmuxListPanes(panesResult.stdout, nowMs);
// heartbeat
const hbFile = heartbeatPath(agent.name, activePaths.mosaicHome);
let hbContent: string | null = null;
try {
hbContent = await readFile(hbFile, 'utf8');
} catch {
hbContent = null;
}
const hb = parseHeartbeat(hbContent, nowMs);
// drift and boot-enable
const driftFlag = detectDrift(agent.runtime, paneInfo.command);
const bootEnableWarning =
sysInfo.ActiveState === 'active' && sysInfo.UnitFileState === 'disabled';
rows.push({
name: agent.name,
tenant_id,
host,
runtime: agent.runtime,
systemdActive: sysInfo.ActiveState,
systemdEnabled: sysInfo.UnitFileState,
paneAlive: !paneInfo.dead,
panePid: paneInfo.pid,
paneCommand: paneInfo.command,
idleSeconds: paneInfo.idleSeconds,
heartbeat: hb,
driftFlag,
bootEnableWarning,
});
}
if (opts.json) {
console.log(JSON.stringify(rows, null, 2));
return;
}
// Table output
const header = [
'NAME'.padEnd(18),
'TENANT'.padEnd(12),
'HOST'.padEnd(12),
'RUNTIME'.padEnd(10),
'SYSTEMD'.padEnd(16),
'PANE'.padEnd(8),
'PID'.padEnd(8),
'IDLE'.padEnd(8),
'HB'.padEnd(12),
'FLAGS',
].join(' ');
console.log(header);
console.log('-'.repeat(header.length));
for (const row of rows) {
const systemd = `${row.systemdActive}/${row.systemdEnabled}`;
const pane = row.paneAlive ? 'alive' : 'dead';
const pid = row.panePid !== null ? String(row.panePid) : '-';
const idle = row.idleSeconds !== null ? `${row.idleSeconds}s` : '-';
const hbAge =
row.heartbeat.ageMs !== null
? `${Math.round(row.heartbeat.ageMs / 1000)}s/${row.heartbeat.health}`
: `unknown`;
const flags: string[] = [];
if (row.driftFlag) flags.push('DRIFT');
if (row.bootEnableWarning) flags.push('BOOT-ENABLE');
console.log(
[
row.name.padEnd(18),
row.tenant_id.padEnd(12),
row.host.padEnd(12),
row.runtime.padEnd(10),
systemd.padEnd(16),
pane.padEnd(8),
pid.padEnd(8),
idle.padEnd(8),
hbAge.padEnd(12),
flags.join(','),
].join(' '),
);
}
});
return cmd; return cmd;
} }
@@ -368,6 +1010,8 @@ export function registerFleetAgentCommands(
deps: FleetCommandDeps = {}, deps: FleetCommandDeps = {},
): void { ): void {
const runner = deps.runner ?? runCommand; const runner = deps.runner ?? runCommand;
const iRunner = deps.interactiveRunner ?? spawnInteractive;
const sleepFn = deps.sleepFn ?? defaultSleep;
agentCommand agentCommand
.command('roster') .command('roster')
@@ -417,21 +1061,141 @@ export function registerFleetAgentCommands(
.requiredOption('--message <text>', 'Message text') .requiredOption('--message <text>', 'Message text')
.option('--source-label <label>', 'Source label for the message preamble') .option('--source-label <label>', 'Source label for the message preamble')
.option('--source <label>', 'Alias for --source-label') .option('--source <label>', 'Alias for --source-label')
.option(
'--verify',
'Verify message was accepted (not left as a draft); exit non-zero if unverifiable',
)
.option(
'--verify-timeout <ms>',
`Maximum time (ms) to poll for pane change when --verify is set (default: ${VERIFY_DEFAULT_TIMEOUT_MS})`,
String(VERIFY_DEFAULT_TIMEOUT_MS),
)
.action( .action(
async (agent: string, opts: { message: string; sourceLabel?: string; source?: string }) => { async (
agent: string,
opts: {
message: string;
sourceLabel?: string;
source?: string;
verify?: boolean;
verifyTimeout?: string;
},
) => {
const roster = await loadRosterFromAgentCommand(agentCommand, deps.mosaicHome); const roster = await loadRosterFromAgentCommand(agentCommand, deps.mosaicHome);
getRosterAgent(roster, agent); getRosterAgent(roster, agent);
const paths = resolveFleetPaths( const paths = resolveFleetPaths(
resolveMosaicHomeFromCommand(agentCommand, deps.mosaicHome), resolveMosaicHomeFromCommand(agentCommand, deps.mosaicHome),
); );
const sourceLabel = opts.sourceLabel ?? opts.source ?? getDefaultOperatorSourceLabel(); const sourceLabel = opts.sourceLabel ?? opts.source ?? getDefaultOperatorSourceLabel();
await runChecked( if (opts.verify) {
runner, const parsedTimeout =
buildAgentSendCommand(paths, agent, opts.message, roster.tmux.socketName, sourceLabel), opts.verifyTimeout !== undefined ? Number.parseInt(opts.verifyTimeout, 10) : Number.NaN;
); const timeoutMs = Number.isFinite(parsedTimeout)
? Math.max(0, parsedTimeout)
: VERIFY_DEFAULT_TIMEOUT_MS;
// Capture BEFORE snapshot so we can detect stale-pane false-successes.
// A wedged pane that still shows old non-empty content must not be reported
// as 'accepted' — we compare BEFORE vs AFTER to guard against that case.
const beforeResult = await runner(
...splitCommand(buildAgentVerifyAcceptedCommand(agent, roster.tmux.socketName)),
);
if (beforeResult.exitCode !== 0) {
throw new Error(
`send --verify: could not capture pane output before send (tmux exited ${beforeResult.exitCode}).`,
);
}
const beforeSnapshot = beforeResult.stdout;
await runChecked(
runner,
buildAgentSendCommand(paths, agent, opts.message, roster.tmux.socketName, sourceLabel),
);
// Bounded polling loop: poll capture-pane every VERIFY_POLL_INTERVAL_MS up to
// timeoutMs. Return immediately when the pane shows 'accepted' or 'draft';
// keep polling while 'unverifiable' (no pane change yet). Fail closed after
// timeout with the existing "no pane change after send" message.
const deadline = Date.now() + timeoutMs;
let verifyResult: SendVerifyResult = 'unverifiable';
while (true) {
await sleepFn(VERIFY_POLL_INTERVAL_MS);
const afterResult = await runner(
...splitCommand(buildAgentVerifyAcceptedCommand(agent, roster.tmux.socketName)),
);
if (afterResult.exitCode !== 0) {
throw new Error(
`send --verify: could not capture pane output to verify acceptance (tmux exited ${afterResult.exitCode}).`,
);
}
verifyResult = classifySendResult(beforeSnapshot, afterResult.stdout);
// Definitive result — stop polling immediately.
if (verifyResult === 'accepted' || verifyResult === 'draft') {
break;
}
// Still unverifiable — check if we have time left to poll again.
if (Date.now() >= deadline) {
break;
}
}
if (verifyResult === 'draft') {
process.exitCode = 1;
process.stderr.write(
`send --verify: message left as unsubmitted draft in agent "${agent}".\n`,
);
} else if (verifyResult === 'unverifiable') {
process.exitCode = 1;
process.stderr.write(
`send --verify: could not verify delivery (no pane change after send) for agent "${agent}".\n`,
);
}
} else {
await runChecked(
runner,
buildAgentSendCommand(paths, agent, opts.message, roster.tmux.socketName, sourceLabel),
);
}
}, },
); );
agentCommand
.command('watch <agent>')
.description('Open a read-only view of a fleet agent tmux session (cannot send keystrokes)')
.action(async (agent: string) => {
const roster = await loadRosterFromAgentCommand(agentCommand, deps.mosaicHome);
getRosterAgent(roster, agent);
// Use a GROUPED VIEWER SESSION to prevent the observer from resizing the agent's
// window. A bare `tmux attach -r` against the agent session itself still lets the
// client shrink the session to its terminal size; a grouped session gets INDEPENDENT
// sizing so the agent's window is never affected by the viewer's terminal dimensions.
//
// Sequence:
// 1. Create a throwaway grouped session targeting the agent (capturing runner).
// 2. Attach -r (read-only) to the viewer session (interactiveRunner / TTY).
// 3. Kill the viewer session on detach so stale sessions don't accumulate.
const viewerName = buildViewerSessionName(agent);
const socketName = roster.tmux.socketName;
await runChecked(runner, buildAgentWatchCreateViewerCommand(agent, viewerName, socketName));
const [bin, args] = splitCommand(buildAgentWatchAttachCommand(viewerName, socketName));
const exitCode = await iRunner(bin, args);
// Best-effort cleanup of the viewer session regardless of how the user detached.
// Errors here are intentionally suppressed — the agent session is unaffected.
const killResult = await runner(
...splitCommand(buildAgentWatchKillViewerCommand(viewerName, socketName)),
);
void killResult; // result is intentionally ignored
if (exitCode !== 0) {
process.exitCode = exitCode;
}
});
agentCommand agentCommand
.command('reset <agent>') .command('reset <agent>')
.description('Reset a local fleet agent by sending the runtime reset command') .description('Reset a local fleet agent by sending the runtime reset command')
@@ -864,6 +1628,32 @@ function resolveFrameworkRoot(): string {
return resolve(dirname(currentFile), '..', '..', 'framework'); return resolve(dirname(currentFile), '..', '..', 'framework');
} }
/**
* Default InteractiveRunner implementation: spawns the command with inherited
* stdio so the terminal is passed through to the child process. This is required
* for commands like `tmux attach` that are full-screen interactive and cannot be
* captured through a pipe.
*/
function spawnInteractive(command: string, args: string[]): Promise<number> {
return new Promise((resolvePromise) => {
const child = spawn(command, args, { stdio: 'inherit' });
child.on('error', () => {
resolvePromise(127);
});
child.on('close', (code) => {
resolvePromise(code ?? 1);
});
});
}
/**
* Default SleepFn implementation backed by setTimeout.
* Tests inject a stub to avoid real delays in the send --verify polling loop.
*/
function defaultSleep(ms: number): Promise<void> {
return new Promise<void>((res) => setTimeout(res, ms));
}
async function canRead(path: string): Promise<boolean> { async function canRead(path: string): Promise<boolean> {
try { try {
await access(path, constants.R_OK); await access(path, constants.R_OK);