101 lines
6.6 KiB
Markdown
101 lines
6.6 KiB
Markdown
# Scratchpad — Fleet Phase 2: Observability (W-FLEET)
|
|
|
|
> Append-only. Mission `mvp-20260312` / workstream W-FLEET.
|
|
> Lead: Jarvis (Claude) at `W-jarvis:mos-claude-18`. Coordinating with `jwoltje@dragon-lin:coder0-0`.
|
|
|
|
## Mission prompt (2026-06-20)
|
|
|
|
Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability
|
|
for delivery. The USC tmux PoC is the proven base. Jason granted lead authority:
|
|
"The fleet is a great way to actually build the MVP — we are building the system that
|
|
builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate
|
|
with a second agent on `dragon-lin`.
|
|
|
|
## Decisions of record (with Jason, 2026-06-20)
|
|
|
|
- Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
|
|
- Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
|
|
- Health: heartbeat required; dogfood stub implements protocol now.
|
|
- Lifecycle: hybrid (core always-on + ephemeral workers).
|
|
- Observation: read-only default, opt-in takeover.
|
|
- Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
|
|
- Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
|
|
- Fleet is dual-role: product AND means of production (bootstrapping the MVP).
|
|
- Code review = **dual-engine**: Claude **and** gpt-5.5/Codex, run together (Jason: the
|
|
combination produces the best results). Launch reviewers via `mosaic yolo pi` / `codex`
|
|
(proven path) or `~/.config/mosaic/tools/codex/codex-code-review.sh`. Applies to all
|
|
code-review gates incl. FLEET-OBS-008. Per Jason 2026-06-20.
|
|
- Worktree discipline: do fleet work in `~/src/mosaicstack-stack-worktrees/<branch>`, NOT
|
|
the shared main checkout — concurrent processes mutate `main` there (learned 2026-06-20).
|
|
|
|
## Environment facts (verified 2026-06-20)
|
|
|
|
- Fleet is live on `W-jarvis` (uid 1000, `jarvis`, `Linger=yes`) on tmux socket
|
|
`mosaic-factory`: `_holder`, `canary-pi`, `dogfood-coder`, `dogfood-orchestrator`,
|
|
`dogfood-reviewer`. All panes run `~/.config/mosaic/fleet/dogfood-agent.py` (stub),
|
|
including `canary-pi` (roster says runtime=pi → **drift**).
|
|
- Holder + `mosaic-agent@*` units are `active (exited)` but `UnitFileState=disabled`
|
|
(reboot loses fleet → boot-enable gap to surface).
|
|
- Observation blocked by: isolated socket (hidden from default `tmux ls`), `capture-pane`
|
|
blank for TUIs, `attach` being read-write + resizing.
|
|
- Second agent: `jwoltje@dragon-lin`, session `coder0-0` (group `coder0`), running `node`,
|
|
default socket. ssh forward reach confirmed.
|
|
|
|
## Governance / collision-safety
|
|
|
|
- `mosaicstack-stack` has active mission `mvp-20260312` with single-writer locks on
|
|
`docs/MISSION-MANIFEST.md`, `docs/TASKS.md`, `docs/scratchpads/mvp-20260312.md`.
|
|
- This workstream touches NONE of those. All Fleet docs scoped under `docs/fleet/` +
|
|
this scratchpad. Rollup row proposed, not written.
|
|
|
|
## Session log
|
|
|
|
- 2026-06-20: Researched AI guide + fleet code + live state. Established north star with
|
|
Jason (8 forks decided). Branched `feat/fleet-observability`. Persisted
|
|
`docs/fleet/{north-star.md,PRD.md,TASKS.md}` + this scratchpad. Next: establish comms
|
|
with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + `fleet ps`).
|
|
- 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): `fleet ps`,
|
|
`agent watch`, `agent send --verify`, 62 tests. LIVE-verified `fleet ps` on
|
|
mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON.
|
|
Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — `fleet ps` HB now
|
|
`healthy` for all 4 agents.
|
|
- Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572
|
|
(sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine
|
|
blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
|
|
- **FINDINGS (north-star / Phase-3 blockers):**
|
|
1. Ad-hoc `mosaic yolo {codex,pi}` via `start-agent-session.sh` DIE immediately in a
|
|
detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub
|
|
survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY
|
|
in the detached shell) must be fixed before Phase-3 real-runtime swap. `fleet ps`
|
|
caught both dead panes instantly (tool validated).
|
|
2. `MOSAIC_AGENT_NAME` (set in systemd EnvironmentFile) is NOT propagated into tmux's
|
|
global env, so agents defaulted to `unknown`. Worked around in dogfood-agent.py via
|
|
tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
|
|
- Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close
|
|
`fleet-observability-1`. Defer launch-path + env-propagation fixes to Phase 3.
|
|
- 2026-06-21 (session 3): Phase-2 PR #579 merged (3 dual-engine rounds hardened
|
|
verify+watch). Then closed the launch-path question with Jason's input — CORRECTING
|
|
earlier findings:
|
|
- The ad-hoc launch deaths were NOT a fundamental TTY blocker: (a) codex was a stale
|
|
version (Jason updated it); (b) pi was misconfigured to Claude auth (Jason removed it;
|
|
default is now Codex). The REAL durable-launch bug is **PATH**: the detached tmux
|
|
launch shell is login+non-interactive, so it misses `~/.npm-global/bin` (added only in
|
|
`~/.bashrc`) -> `mosaic: command not found` (127) -> pane dies. tmux panes inherit the
|
|
tmux _server_ env, so PATH must be baked into the pane command.
|
|
- **Durable real-agent recipe (validated live on gpt-5.5, Claude-free):**
|
|
`mosaic yolo pi --model openai-codex/gpt-5.5:high` — pi tolerates detached tmux; a raw
|
|
interactive TUI (codex CLI) exits without an attached client. Status line confirmed
|
|
`(openai-codex) gpt-5.5 • high`.
|
|
- PATH fix landed in `start-agent-session.sh` (commit 32efc13, branch
|
|
feat/fleet-launch-path): derive runtime-bin prefix (MOSAIC_RUNTIME_BIN | npm prefix |
|
|
~/.npm-global/bin | ~/.local/bin), bake `export PATH=...; exec <cmd>` into the pane;
|
|
`exec` also fixes the drift false-positive. Live-tested under stripped PATH -> durable.
|
|
- Boot-survival: Jason ran `systemctl --user enable` (+ linger). TODO: auto-enable in
|
|
**fleet init** so operators never have to remember it (agentic-enhancement cycle).
|
|
- Future custom Pi harness build: pi cannot self-report its model (track
|
|
runtime/model/effort as fleet metadata); drift detection should recognize `node` as
|
|
pi's pane command (a node-wrapped pane can currently read as drift).
|
|
- Findings recorded in AI Guide playbooks/tmux-fleet.md (aiguide PR #7, merged).
|
|
- Policy: avoid Claude outside Claude Code (API pricing for alt-harness use) — fleet
|
|
runtimes default to Codex / pi-on-Codex; Claude stays in Claude Code only.
|