Correct the launch-path finding (PATH, not TTY), record the validated durable real-agent recipe (pi on openai-codex/gpt-5.5), the Codex-default/Claude-reserved policy, and the fleet-init boot-survival automation TODO. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
6.6 KiB
6.6 KiB
Scratchpad — Fleet Phase 2: Observability (W-FLEET)
Append-only. Mission
mvp-20260312/ workstream W-FLEET. Lead: Jarvis (Claude) atW-jarvis:mos-claude-18. Coordinating withjwoltje@dragon-lin:coder0-0.
Mission prompt (2026-06-20)
Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability
for delivery. The USC tmux PoC is the proven base. Jason granted lead authority:
"The fleet is a great way to actually build the MVP — we are building the system that
builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate
with a second agent on dragon-lin.
Decisions of record (with Jason, 2026-06-20)
- Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
- Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
- Health: heartbeat required; dogfood stub implements protocol now.
- Lifecycle: hybrid (core always-on + ephemeral workers).
- Observation: read-only default, opt-in takeover.
- Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
- Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
- Fleet is dual-role: product AND means of production (bootstrapping the MVP).
- Code review = dual-engine: Claude and gpt-5.5/Codex, run together (Jason: the
combination produces the best results). Launch reviewers via
mosaic yolo pi/codex(proven path) or~/.config/mosaic/tools/codex/codex-code-review.sh. Applies to all code-review gates incl. FLEET-OBS-008. Per Jason 2026-06-20. - Worktree discipline: do fleet work in
~/src/mosaicstack-stack-worktrees/<branch>, NOT the shared main checkout — concurrent processes mutatemainthere (learned 2026-06-20).
Environment facts (verified 2026-06-20)
- Fleet is live on
W-jarvis(uid 1000,jarvis,Linger=yes) on tmux socketmosaic-factory:_holder,canary-pi,dogfood-coder,dogfood-orchestrator,dogfood-reviewer. All panes run~/.config/mosaic/fleet/dogfood-agent.py(stub), includingcanary-pi(roster says runtime=pi → drift). - Holder +
mosaic-agent@*units areactive (exited)butUnitFileState=disabled(reboot loses fleet → boot-enable gap to surface). - Observation blocked by: isolated socket (hidden from default
tmux ls),capture-paneblank for TUIs,attachbeing read-write + resizing. - Second agent:
jwoltje@dragon-lin, sessioncoder0-0(groupcoder0), runningnode, default socket. ssh forward reach confirmed.
Governance / collision-safety
mosaicstack-stackhas active missionmvp-20260312with single-writer locks ondocs/MISSION-MANIFEST.md,docs/TASKS.md,docs/scratchpads/mvp-20260312.md.- This workstream touches NONE of those. All Fleet docs scoped under
docs/fleet/+ this scratchpad. Rollup row proposed, not written.
Session log
- 2026-06-20: Researched AI guide + fleet code + live state. Established north star with
Jason (8 forks decided). Branched
feat/fleet-observability. Persisteddocs/fleet/{north-star.md,PRD.md,TASKS.md}+ this scratchpad. Next: establish comms with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat +fleet ps). - 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831):
fleet ps,agent watch,agent send --verify, 62 tests. LIVE-verifiedfleet pson mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON. Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) —fleet psHB nowhealthyfor all 4 agents. - Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572 (sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
- FINDINGS (north-star / Phase-3 blockers):
- Ad-hoc
mosaic yolo {codex,pi}viastart-agent-session.shDIE immediately in a detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY in the detached shell) must be fixed before Phase-3 real-runtime swap.fleet pscaught both dead panes instantly (tool validated). MOSAIC_AGENT_NAME(set in systemd EnvironmentFile) is NOT propagated into tmux's global env, so agents defaulted tounknown. Worked around in dogfood-agent.py via tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
- Ad-hoc
- Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close
fleet-observability-1. Defer launch-path + env-propagation fixes to Phase 3. - 2026-06-21 (session 3): Phase-2 PR #579 merged (3 dual-engine rounds hardened
verify+watch). Then closed the launch-path question with Jason's input — CORRECTING
earlier findings:
- The ad-hoc launch deaths were NOT a fundamental TTY blocker: (a) codex was a stale
version (Jason updated it); (b) pi was misconfigured to Claude auth (Jason removed it;
default is now Codex). The REAL durable-launch bug is PATH: the detached tmux
launch shell is login+non-interactive, so it misses
~/.npm-global/bin(added only in~/.bashrc) ->mosaic: command not found(127) -> pane dies. tmux panes inherit the tmux server env, so PATH must be baked into the pane command. - Durable real-agent recipe (validated live on gpt-5.5, Claude-free):
mosaic yolo pi --model openai-codex/gpt-5.5:high— pi tolerates detached tmux; a raw interactive TUI (codex CLI) exits without an attached client. Status line confirmed(openai-codex) gpt-5.5 • high. - PATH fix landed in
start-agent-session.sh(commit32efc13, branch feat/fleet-launch-path): derive runtime-bin prefix (MOSAIC_RUNTIME_BIN | npm prefix | ~/.npm-global/bin | ~/.local/bin), bakeexport PATH=...; exec <cmd>into the pane;execalso fixes the drift false-positive. Live-tested under stripped PATH -> durable. - Boot-survival: Jason ran
systemctl --user enable(+ linger). TODO: auto-enable in fleet init so operators never have to remember it (agentic-enhancement cycle). - Future custom Pi harness build: pi cannot self-report its model (track
runtime/model/effort as fleet metadata); drift detection should recognize
nodeas pi's pane command (a node-wrapped pane can currently read as drift). - Findings recorded in AI Guide playbooks/tmux-fleet.md (aiguide PR #7, merged).
- Policy: avoid Claude outside Claude Code (API pricing for alt-harness use) — fleet runtimes default to Codex / pi-on-Codex; Claude stays in Claude Code only.
- The ad-hoc launch deaths were NOT a fundamental TTY blocker: (a) codex was a stale
version (Jason updated it); (b) pi was misconfigured to Claude auth (Jason removed it;
default is now Codex). The REAL durable-launch bug is PATH: the detached tmux
launch shell is login+non-interactive, so it misses