Files
stack/docs/scratchpads/fleet-observability-phase2.md
2026-06-20 22:30:34 -05:00

4.7 KiB

Scratchpad — Fleet Phase 2: Observability (W-FLEET)

Append-only. Mission mvp-20260312 / workstream W-FLEET. Lead: Jarvis (Claude) at W-jarvis:mos-claude-18. Coordinating with jwoltje@dragon-lin:coder0-0.

Mission prompt (2026-06-20)

Establish the north star for the Mosaic Fleet feature and prepare Phase-2 observability for delivery. The USC tmux PoC is the proven base. Jason granted lead authority: "The fleet is a great way to actually build the MVP — we are building the system that builds the system." Dogfood actual agent construction + ad-hoc deployment; coordinate with a second agent on dragon-lin.

Decisions of record (with Jason, 2026-06-20)

  • Agent model: config defines, session runs (gateway = definition/identity/auth; tmux = runtime).
  • Tenancy: multi-tenant from the start; isolation = per-tenant Linux uid.
  • Health: heartbeat required; dogfood stub implements protocol now.
  • Lifecycle: hybrid (core always-on + ephemeral workers).
  • Observation: read-only default, opt-in takeover.
  • Multi-host: designed-for day one; control plane rides federation (W1), not a bespoke broker.
  • Delivery: CLI-first, dogfood on the live stub fleet; webUI deferred to Phase 5.
  • Fleet is dual-role: product AND means of production (bootstrapping the MVP).
  • Code review = dual-engine: Claude and gpt-5.5/Codex, run together (Jason: the combination produces the best results). Launch reviewers via mosaic yolo pi / codex (proven path) or ~/.config/mosaic/tools/codex/codex-code-review.sh. Applies to all code-review gates incl. FLEET-OBS-008. Per Jason 2026-06-20.
  • Worktree discipline: do fleet work in ~/src/mosaicstack-stack-worktrees/<branch>, NOT the shared main checkout — concurrent processes mutate main there (learned 2026-06-20).

Environment facts (verified 2026-06-20)

  • Fleet is live on W-jarvis (uid 1000, jarvis, Linger=yes) on tmux socket mosaic-factory: _holder, canary-pi, dogfood-coder, dogfood-orchestrator, dogfood-reviewer. All panes run ~/.config/mosaic/fleet/dogfood-agent.py (stub), including canary-pi (roster says runtime=pi → drift).
  • Holder + mosaic-agent@* units are active (exited) but UnitFileState=disabled (reboot loses fleet → boot-enable gap to surface).
  • Observation blocked by: isolated socket (hidden from default tmux ls), capture-pane blank for TUIs, attach being read-write + resizing.
  • Second agent: jwoltje@dragon-lin, session coder0-0 (group coder0), running node, default socket. ssh forward reach confirmed.

Governance / collision-safety

  • mosaicstack-stack has active mission mvp-20260312 with single-writer locks on docs/MISSION-MANIFEST.md, docs/TASKS.md, docs/scratchpads/mvp-20260312.md.
  • This workstream touches NONE of those. All Fleet docs scoped under docs/fleet/ + this scratchpad. Rollup row proposed, not written.

Session log

  • 2026-06-20: Researched AI guide + fleet code + live state. Established north star with Jason (8 forks decided). Branched feat/fleet-observability. Persisted docs/fleet/{north-star.md,PRD.md,TASKS.md} + this scratchpad. Next: establish comms with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + fleet ps).
  • 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): fleet ps, agent watch, agent send --verify, 62 tests. LIVE-verified fleet ps on mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON. Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — fleet ps HB now healthy for all 4 agents.
  • Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572 (sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
  • FINDINGS (north-star / Phase-3 blockers):
    1. Ad-hoc mosaic yolo {codex,pi} via start-agent-session.sh DIE immediately in a detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY in the detached shell) must be fixed before Phase-3 real-runtime swap. fleet ps caught both dead panes instantly (tool validated).
    2. MOSAIC_AGENT_NAME (set in systemd EnvironmentFile) is NOT propagated into tmux's global env, so agents defaulted to unknown. Worked around in dogfood-agent.py via tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
  • Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close fleet-observability-1. Defer launch-path + env-propagation fixes to Phase 3.