feat(fleet): Phase-2 observability — fleet ps + watch + send verify #579

Merged
jason.woltje merged 9 commits from feat/fleet-observability into main 2026-06-21 04:23:52 +00:00
Showing only changes of commit 11c4dbe6f3 - Show all commits

View File

@@ -54,3 +54,22 @@ with a second agent on `dragon-lin`.
Jason (8 forks decided). Branched `feat/fleet-observability`. Persisted
`docs/fleet/{north-star.md,PRD.md,TASKS.md}` + this scratchpad. Next: establish comms
with dragon-lin coder, commit docs, begin Phase-2 delivery (heartbeat + `fleet ps`).
- 2026-06-20 (session 2): Built Phase-2 CLI via worker (commit ab47831): `fleet ps`,
`agent watch`, `agent send --verify`, 62 tests. LIVE-verified `fleet ps` on
mosaic-factory — correctly flagged canary-pi DRIFT + BOOT-ENABLE, tenant_id+host in JSON.
Heartbeat responder added to dogfood-agent.py (FLEET-OBS-002) — `fleet ps` HB now
`healthy` for all 4 agents.
- Coordination: dual-engine-reviewed (Claude+Codex) and merged framework PRs #572
(sanitization gate) + #575 (CONSTITUTION extraction) as Lead. Codex caught an Alpine
blocker on #572 (refuted by CI); Claude caught a CI-breaking format failure on #575.
- **FINDINGS (north-star / Phase-3 blockers):**
1. Ad-hoc `mosaic yolo {codex,pi}` via `start-agent-session.sh` DIE immediately in a
detached tmux pane (codex: "stdin is not a terminal"; pi: same). Only the python stub
survives. => Real runtimes have NEVER run durably in the fleet. Launch path (PATH/TTY
in the detached shell) must be fixed before Phase-3 real-runtime swap. `fleet ps`
caught both dead panes instantly (tool validated).
2. `MOSAIC_AGENT_NAME` (set in systemd EnvironmentFile) is NOT propagated into tmux's
global env, so agents defaulted to `unknown`. Worked around in dogfood-agent.py via
tmux session-name fallback; the systemd/tmux env handoff needs a real fix.
- Next: rebase on merged main, open Phase-2 PR, dual-engine review, merge, close
`fleet-observability-1`. Defer launch-path + env-propagation fixes to Phase 3.