Files
stack/docs/scratchpads/fleet-observability-phase2.md
Jarvis 1908dab373
Some checks failed
ci/woodpecker/push/ci Pipeline was canceled
ci/woodpecker/pr/ci Pipeline was canceled
docs(fleet): record durable-launch findings + runtime-default policy
Correct the launch-path finding (PATH, not TTY), record the validated durable
real-agent recipe (pi on openai-codex/gpt-5.5), the Codex-default/Claude-reserved
policy, and the fleet-init boot-survival automation TODO.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMoEx7hfdFGjUiCHuN1RRi
2026-06-21 12:08:39 -05:00

6.6 KiB

Scratchpad — Fleet Phase 2: Observability (W-FLEET)

Append-only. Mission mvp-20260312 / workstream W-FLEET. Lead: Jarvis (Claude) at W-jarvis:mos-claude-18. Coordinating with jwoltje@dragon-lin:coder0-0.

Mission prompt (2026-06-20)

Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability for delivery. The USC tmux PoC is the proven base. Jason granted lead authority: "The fleet is a great way to actually build the MVP — we are building the system that builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate with a second agent on dragon-lin.

Decisions of record (with Jason, 2026-06-20)

  • Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
  • Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
  • Health: heartbeat required; dogfood stub implements protocol now.
  • Lifecycle: hybrid (core always-on + ephemeral workers).
  • Observation: read-only default, opt-in takeover.
  • Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
  • Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
  • Fleet is dual-role: product AND means of production (bootstrapping the MVP).
  • Code review = dual-engine: Claude and gpt-5.5/Codex, run together (Jason: the combination produces the best results). Launch reviewers via mosaic yolo pi / codex (proven path) or ~/.config/mosaic/tools/codex/codex-code-review.sh. Applies to all code-review gates incl. FLEET-OBS-008. Per Jason 2026-06-20.
  • Worktree discipline: do fleet work in ~/src/mosaicstack-stack-worktrees/<branch>, NOT the shared main checkout — concurrent processes mutate main there (learned 2026-06-20).

Environment facts (verified 2026-06-20)

  • Fleet is live on W-jarvis (uid 1000, jarvis, Linger=yes) on tmux socket mosaic-factory: _holder, canary-pi, dogfood-coder, dogfood-orchestrator, dogfood-reviewer. All panes run ~/.config/mosaic/fleet/dogfood-agent.py (stub), including canary-pi (roster says runtime=pi → drift).
  • Holder + mosaic-agent@* units are active (exited) but UnitFileState=disabled (reboot loses fleet → boot-enable gap to surface).
  • Observation blocked by: isolated socket (hidden from default tmux ls), capture-pane blank for TUIs, attach being read-write + resizing.
  • Second agent: jwoltje@dragon-lin, session coder0-0 (group coder0), running node, default socket. ssh forward reach confirmed.

Governance / collision-safety

  • mosaicstack-stack has active mission mvp-20260312 with single-writer locks on docs/MISSION-MANIFEST.md, docs/TASKS.md, docs/scratchpads/mvp-20260312.md.
  • This workstream touches NONE of those. All Fleet docs scoped under docs/fleet/ + this scratchpad. Rollup row proposed, not written.

Session log

  • 2026-06-20: Researched AI guide + fleet code + live state. Established north star with Jason (8 forks decided). Branched feat/fleet-observability. Persisted docs/fleet/{north-star.md,PRD.md,TASKS.md} + this scratchpad. Next: establish comms with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + fleet ps).
  • 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): fleet ps, agent watch, agent send --verify, 62 tests. LIVE-verified fleet ps on mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON. Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — fleet ps HB now healthy for all 4 agents.
  • Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572 (sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
  • FINDINGS (north-star / Phase-3 blockers):
    1. Ad-hoc mosaic yolo {codex,pi} via start-agent-session.sh DIE immediately in a detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY in the detached shell) must be fixed before Phase-3 real-runtime swap. fleet ps caught both dead panes instantly (tool validated).
    2. MOSAIC_AGENT_NAME (set in systemd EnvironmentFile) is NOT propagated into tmux's global env, so agents defaulted to unknown. Worked around in dogfood-agent.py via tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
  • Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close fleet-observability-1. Defer launch-path + env-propagation fixes to Phase 3.
  • 2026-06-21 (session 3): Phase-2 PR #579 merged (3 dual-engine rounds hardened verify+watch). Then closed the launch-path question with Jason's input — CORRECTING earlier findings:
    • The ad-hoc launch deaths were NOT a fundamental TTY blocker: (a) codex was a stale version (Jason updated it); (b) pi was misconfigured to Claude auth (Jason removed it; default is now Codex). The REAL durable-launch bug is PATH: the detached tmux launch shell is login+non-interactive, so it misses ~/.npm-global/bin (added only in ~/.bashrc) -> mosaic: command not found (127) -> pane dies. tmux panes inherit the tmux server env, so PATH must be baked into the pane command.
    • Durable real-agent recipe (validated live on gpt-5.5, Claude-free): mosaic yolo pi --model openai-codex/gpt-5.5:high — pi tolerates detached tmux; a raw interactive TUI (codex CLI) exits without an attached client. Status line confirmed (openai-codex) gpt-5.5 • high.
    • PATH fix landed in start-agent-session.sh (commit 32efc13, branch feat/fleet-launch-path): derive runtime-bin prefix (MOSAIC_RUNTIME_BIN | npm prefix | ~/.npm-global/bin | ~/.local/bin), bake export PATH=...; exec <cmd> into the pane; exec also fixes the drift false-positive. Live-tested under stripped PATH -> durable.
    • Boot-survival: Jason ran systemctl --user enable (+ linger). TODO: auto-enable in fleet init so operators never have to remember it (agentic-enhancement cycle).
    • Future custom Pi harness build: pi cannot self-report its model (track runtime/model/effort as fleet metadata); drift detection should recognize node as pi's pane command (a node-wrapped pane can currently read as drift).
    • Findings recorded in AI Guide playbooks/tmux-fleet.md (aiguide PR #7, merged).
    • Policy: avoid Claude outside Claude Code (API pricing for alt-harness use) — fleet runtimes default to Codex / pi-on-Codex; Claude stays in Claude Code only.