docs: add mission control and coordination resilience docs

2026-05-07 13:20:02 -05:00
parent 755df9079e
commit 079c5597ff
6 changed files with 978 additions and 0 deletions
--- a/docs/plans/2026-05-07-coordination-resilience.md
+++ b/docs/plans/2026-05-07-coordination-resilience.md
@@ -0,0 +1,233 @@
+# Mosaic Stack ↔ Hermes Coordination Resilience
+
+> Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform.
+
+## Summary
+
+The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session.
+
+## SIBKISS operational summary
+
+- mission on
+- heartbeat always
+- resume from packet
+- block with `[BLOCKED]`
+- reassign
+- keep tasks tiny
+- auto-heal dead workers
+
+The design has four parts:
+
+1. Atomic task decomposition — workers operate only within a small, explicit scope.
+2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope.
+3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them.
+4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider.
+
+## Why this exists
+
+Observed failure modes:
+
+- Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work.
+- Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked.
+- Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff.
+
+The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic.
+
+## Core workflow
+
+### 1) Atomic task boundaries
+
+Every task should have:
+
+- one concern
+- explicit files/packages in scope
+- explicit files/packages out of scope
+- a maximum file count if possible
+- a stated expected iteration budget
+
+When a worker discovers work outside scope, it must stop fixing it and hand off.
+
+### 2) Worker-authored distress card
+
+If the worker can still report status, it creates a card like:
+
+- Title: `[BLOCKED] t_<source_id> <blocker_type>`
+- Assignee: `tuesday` / orchestrator role
+- Status: `ready`
+- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
+
+The orchestrator receives the card, acts on it, and closes the loop.
+
+## Routing rules
+
+### Distress card routing
+
+- Title: `[BLOCKED] t_<source_id> <blocker_type>`
+- Assignee: `tuesday` / orchestrator role
+- Status: `ready`
+- Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action
+- Source task stays linked to the distress card so the recovery trail is auditable
+
+The orchestrator receives the card, acts on it, and closes the loop.
+
+### 3) Mechanical fallback for rate-limited workers
+
+If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata.
+
+That watcher should:
+
+- inspect running / blocked tasks
+- detect repeated 429 / 503 / overload errors
+- create the same standardized `[BLOCKED]` card on behalf of the worker
+- link the distress card to the source task
+- add a comment to the source task
+- allow the dispatcher to pick up the new card immediately
+
+This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically.
+
+### 4) Auto-heal for dead workers
+
+A separate no-agent watcher should:
+
+- reap dead PIDs stuck in `running`
+- reset crash-loops whose failures are infrastructure-related
+- escalate tasks that have been reset too many times
+
+This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving.
+
+## Distress card contract
+
+### Canonical title
+
+```text
+[BLOCKED] t_<source_task_id> <blocker_type>
+```
+
+### Canonical blocker types
+
+- `scope_boundary`
+- `env_blocker`
+- `credential_failure`
+- `dependency`
+- `iteration_budget`
+- `rate_limited`
+
+### Canonical body
+
+```markdown
+## Distress Signal
+- Blocked task: t_xxx
+- Worker: <profile_name>
+- Branch: <git_branch_name>
+- Workspace: <path>
+- Blocker type: <type>
+- Completed: <what was done>
+- Cannot touch: <out-of-scope packages/files>
+- Needs: <what the orchestrator should do>
+- State: committed | uncommitted | stashed(<stash_name>)
+
+## Scope Guard
+DO NOT touch: anything outside diagnosing and remediating the blocker described above
+Only fix: assign, split, reassign, or unblock the source task
+```
+
+## Routing rules
+
+### Distress card routing
+
+- `[BLOCKED]` title prefix should bypass normal triage.
+- The card should go directly to the orchestration profile.
+- The orchestrator should start from a clean session each time.
+
+### Rate-limit fallback
+
+When the source task is rate-limited:
+
+- do not keep retrying in the worker
+- let the watcher synthesize the distress card
+- have the orchestrator reassign the source task to a different profile/provider combo
+
+### Provider fallback principle
+
+Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible.
+
+### Suggested fallback order
+
+1. Keep the current task body and scope guards intact.
+2. Reassign to a different profile on a different provider.
+3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers.
+4. If repeated failures continue, split the task into a narrower atomic card.
+
+## Related recovery docs
+
+- Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md`
+- Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md`
+- The same packet-first resume rule applies: inspect the latest packet before re-reading mission files.
+
+## Watchers to implement
+
+### Auto-heal watcher
+
+Responsibilities:
+
+- reap stale workers
+- reset dead-PID crash loops
+- track reset counts
+- escalate after repeated resets
+
+### Distress synthesizer watcher
+
+Responsibilities:
+
+- detect rate-limited / stuck workers
+- create `[BLOCKED]` cards mechanically
+- link the card to the source task
+- leave a comment for traceability
+
+### Iteration-budget watcher
+
+Responsibilities:
+
+- detect long-running tasks and repeated failure patterns
+- recommend splits when a task is clearly over-scoped
+- report tasks that need human review after multiple resets
+
+## Operational principle
+
+If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context.
+
+This is what makes the system robust across compaction, rate limits, and dead workers.
+
+## Suggested implementation order
+
+1. Atomic task metadata in task bodies
+2. Worker-authored distress card protocol
+3. Mechanical distress synthesizer watcher
+4. Auto-heal watcher for dead workers
+5. Orchestrator routing rules for `[BLOCKED]`
+6. Rate-limit fallback / model reassignment table
+
+## Where this fits in Hermes
+
+- Kanban = durable work graph and status engine
+- Watchers = mechanical healing and distress synthesis
+- Orchestrator = split / reassign / unblock decision-maker
+- Workers = execution inside atomic task boundaries
+
+## Where this fits in Mosaic Stack
+
+- PRD / coordination infra should encode the same patterns
+- Mosaic can use the same distress-card contract and watcher logic
+- The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue
+
+## Cross-project takeaway
+
+The important pattern is not the specific tool names. It is the mechanical feedback loop:
+
+- detect failure without requiring the failing worker to succeed
+- create a standardized help artifact
+- route that artifact to a fresh orchestrator context
+- repair the assignment graph
+- continue the mission
+
+That pattern is reusable anywhere.