# Mosaic Stack ↔ Hermes Coordination Resilience > Purpose: document the self-healing coordination patterns that emerged while implementing the Hermes mission toolset, distress-card protocol, and auto-heal watchers, so the same mechanics can be reimplemented in Mosaic Stack or any similar agent platform. ## Summary The coordination layer should be treated as a system of mechanical recovery loops rather than a single interactive agent session. ## SIBKISS operational summary - mission on - heartbeat always - resume from packet - block with `[BLOCKED]` - reassign - keep tasks tiny - auto-heal dead workers The design has four parts: 1. Atomic task decomposition — workers operate only within a small, explicit scope. 2. Distress signaling — workers create a standardized `[BLOCKED]` card when they encounter a blocker outside their scope. 3. Mechanical fallback — if the worker cannot phone home because of rate limits or dead context, a cron-style watcher synthesizes the distress card for them. 4. Auto-heal / reassignment — stale workers are reaped, crash-loops are reset, and rate-limited work is reassigned to a different profile/provider. ## Why this exists Observed failure modes: - Scope creep: a worker completes the target fix, then spends the rest of its budget chasing downstream cascade work. - Silent failure / dead worker: the worker PID is gone, but the task remains running or blocked. - Rate-limited worker: the worker is too constrained to create a help card itself, so it spins or fails without a clean handoff. The answer is not to raise iteration caps or ask the worker to keep trying longer. The answer is to make the coordination layer self-healing and the work items atomic. ## Core workflow ### 1) Atomic task boundaries Every task should have: - one concern - explicit files/packages in scope - explicit files/packages out of scope - a maximum file count if possible - a stated expected iteration budget When a worker discovers work outside scope, it must stop fixing it and hand off. ### 2) Worker-authored distress card If the worker can still report status, it creates a card like: - Title: `[BLOCKED] t_ ` - Assignee: `tuesday` / orchestrator role - Status: `ready` - Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action The orchestrator receives the card, acts on it, and closes the loop. ## Routing rules ### Distress card routing - Title: `[BLOCKED] t_ ` - Assignee: `tuesday` / orchestrator role - Status: `ready` - Body: standardized distress template with source task, blocker type, completed work, cannot-touch scope, and needed action - Source task stays linked to the distress card so the recovery trail is auditable The orchestrator receives the card, acts on it, and closes the loop. ### 3) Mechanical fallback for rate-limited workers If the worker is too rate-limited or unstable to create the distress card itself, a no-agent watcher must synthesize the card from the task row and failure metadata. That watcher should: - inspect running / blocked tasks - detect repeated 429 / 503 / overload errors - create the same standardized `[BLOCKED]` card on behalf of the worker - link the distress card to the source task - add a comment to the source task - allow the dispatcher to pick up the new card immediately This is the key fix for the logic issue: the worker does not need to be able to phone home if the watcher can do it mechanically. ### 4) Auto-heal for dead workers A separate no-agent watcher should: - reap dead PIDs stuck in `running` - reset crash-loops whose failures are infrastructure-related - escalate tasks that have been reset too many times This watcher prevents stale tasks from clogging the board and keeps the dispatch queue moving. ## Distress card contract ### Canonical title ```text [BLOCKED] t_ ``` ### Canonical blocker types - `scope_boundary` - `env_blocker` - `credential_failure` - `dependency` - `iteration_budget` - `rate_limited` ### Canonical body ```markdown ## Distress Signal - Blocked task: t_xxx - Worker: - Branch: - Workspace: - Blocker type: - Completed: - Cannot touch: - Needs: - State: committed | uncommitted | stashed() ## Scope Guard DO NOT touch: anything outside diagnosing and remediating the blocker described above Only fix: assign, split, reassign, or unblock the source task ``` ## Routing rules ### Distress card routing - `[BLOCKED]` title prefix should bypass normal triage. - The card should go directly to the orchestration profile. - The orchestrator should start from a clean session each time. ### Rate-limit fallback When the source task is rate-limited: - do not keep retrying in the worker - let the watcher synthesize the distress card - have the orchestrator reassign the source task to a different profile/provider combo ### Provider fallback principle Never reassign rate-limited work back to the same provider if the failure was provider pressure. Use a different provider when possible. ### Suggested fallback order 1. Keep the current task body and scope guards intact. 2. Reassign to a different profile on a different provider. 3. If that is impossible, reassign to a different profile on the same provider only for non-rate-limit blockers. 4. If repeated failures continue, split the task into a narrower atomic card. ## Related recovery docs - Mission packet recovery contract: `/opt/hermes/docs/mission-toolset-heartbeat.md` - Hermes mission implementation plan: `/opt/hermes/docs/plans/mission-toolset-implementation.md` - The same packet-first resume rule applies: inspect the latest packet before re-reading mission files. - New-session trigger: when a profile config changes, start a fresh session or `/reset` so the updated toolset is actually loaded. ## Watchers to implement ### Auto-heal watcher Responsibilities: - reap stale workers - reset dead-PID crash loops - track reset counts - escalate after repeated resets ### Distress synthesizer watcher Responsibilities: - detect rate-limited / stuck workers - create `[BLOCKED]` cards mechanically - link the card to the source task - leave a comment for traceability ### Iteration-budget watcher Responsibilities: - detect long-running tasks and repeated failure patterns - recommend splits when a task is clearly over-scoped - report tasks that need human review after multiple resets ## Operational principle If a task cannot cleanly finish within its atomic scope, the right response is to surface a smaller coordination problem, not to keep burning context. This is what makes the system robust across compaction, rate limits, and dead workers. ## Suggested implementation order 1. Atomic task metadata in task bodies 2. Worker-authored distress card protocol 3. Mechanical distress synthesizer watcher 4. Auto-heal watcher for dead workers 5. Orchestrator routing rules for `[BLOCKED]` 6. Rate-limit fallback / model reassignment table ## Where this fits in Hermes - Kanban = durable work graph and status engine - Watchers = mechanical healing and distress synthesis - Orchestrator = split / reassign / unblock decision-maker - Workers = execution inside atomic task boundaries ## Where this fits in Mosaic Stack - PRD / coordination infra should encode the same patterns - Mosaic can use the same distress-card contract and watcher logic - The coordination model should be runtime-agnostic: any agent system can use it if it can write a task card and react to a ready queue ## Cross-project takeaway The important pattern is not the specific tool names. It is the mechanical feedback loop: - detect failure without requiring the failing worker to succeed - create a standardized help artifact - route that artifact to a fresh orchestrator context - repair the assignment graph - continue the mission That pattern is reusable anywhere.