diff --git a/docs/scratchpads/fleet-observability-phase2.md b/docs/scratchpads/fleet-observability-phase2.md index ed2460b..22f0694 100644 --- a/docs/scratchpads/fleet-observability-phase2.md +++ b/docs/scratchpads/fleet-observability-phase2.md @@ -54,3 +54,22 @@ with a second agent on `dragon-lin`. Jason (8 forks decided). Branched `feat/fleet-observability`. Persisted `docs/fleet/{north-star.md,PRD.md,TASKS.md}` + this scratchpad. Next: establish comms with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + `fleet ps`). +- 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): `fleet ps`, + `agent watch`, `agent send --verify`, 62 tests. LIVE-verified `fleet ps` on + mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON. + Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — `fleet ps` HB now + `healthy` for all 4 agents. +- Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572 + (sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine + blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575. +- **FINDINGS (north-star / Phase-3 blockers):** + 1. Ad-hoc `mosaic yolo {codex,pi}` via `start-agent-session.sh` DIE immediately in a + detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub + survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY + in the detached shell) must be fixed before Phase-3 real-runtime swap. `fleet ps` + caught both dead panes instantly (tool validated). + 2. `MOSAIC_AGENT_NAME` (set in systemd EnvironmentFile) is NOT propagated into tmux's + global env, so agents defaulted to `unknown`. Worked around in dogfood-agent.py via + tmux session-name fallback; the systemd/tmux env handoff needs a real fix. +- Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close + `fleet-observability-1`. Defer launch-path + env-propagation fixes to Phase 3.